PEIT

overview

The codes for ACL2023 paper: "PEIT: Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation" [paper]

Download ECOIT Dataset

We have released ECOIT Dataset, you can download it here.

installment

pip install -e ./

python setup.py build_ext --inplace

Folder structure

Assuming the folder structure is as follows:
/tmp/data |—— mt_data (parallel text data) |—— train.en |—— train.zh |—— valid.en |—— valid.zh |—— it_data (Synthetic Data) |—— train_images |—— 0.jpg |—— 1.jpg |—— .... |—— 10000000.jpg |—— text.en |—— text.zh |—— valid_images |—— 0.jpg |—— 1.jpg |—— .... |—— 1000.jpg |—— text.en |—— text.zh |—— ft_data (ECOIT) |—— train_images |—— 0.jpg |—— 1.jpg |—— .... |—— 470000.jpg |—— text.en |—— text.zh |—— valid_images |—— 0.jpg |—— 1.jpg |—— .... |—— 2000.jpg |—— text.en |—— text.zh |—— test_images |—— 0.jpg |—— 1.jpg |—— .... |—— 1000.jpg |—— text.en |—— text.zh

text.en and text.zh in train_images/valid_images/test_images will record sentences line by line based on the id of images.

Data Synthesis

Data Synthesis: python data_synthesis_and_preprocess/data_synthesis/data_synthesis.py
You need to provide the paths of ECOIT dataset (We will release it soon, which provides backgrounds), pre-downloaded fonts (you can download fonts from Google Fonts), mt dataset (provides parallel sentence pairs) to create image translation samples in train_images/valid_images.

Data processing and Training

Process:

Data Cleaning
BPE training with sentencepiece
BPE apply
Binarizing the data with fairseq-preprocess

Datasets:

UN dataset

Training

mt_data is used to pre-train pure text machine translation model in stage one
it_data is used to pre-train image translation model in stage two
ft_data is the ECOIT dataset and will be used to fine-tune image translation model.
Please refer to run.sh for the complete training process.
Please refer to run_generate.sh for the image translation inference.

Multi-Line Image Translation

Please set --multi-line --model-height 320 --model-width 480 in IT training，the visual encoder of IT model will flatten feature maps from multiple rows into one row.

Pretrained OCR Initialization

For CRNN visual encoder, we initialize its parameters with a pretrained ocr model (zh_sim_g2.pth), you can download it from EasyOCR.

Citation

Please cite as:

@inproceedings{zhu-etal-2023-peit,
    title = "{PEIT}: Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation",
    author = "Zhu, Shaolin  and
      Li, Shangjie  and
      Lei, Yikun  and
      Xiong, Deyi",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.751",
    doi = "10.18653/v1/2023.acl-long.751",
    pages = "13433--13447",
    abstract = "Image translation is a task that translates an image containing text in the source language to the target language. One major challenge with image translation is the modality gap between visual text inputs and textual inputs/outputs of machine translation (MT). In this paper, we propose PEIT, an end-to-end image translation framework that bridges the modality gap with pre-trained models. It is composed of four essential components: a visual encoder, a shared encoder-decoder backbone network, a vision-text representation aligner equipped with the shared encoder and a cross-modal regularizer stacked over the shared decoder. Both the aligner and regularizer aim at reducing the modality gap. To train PEIT, we employ a two-stage pre-training strategy with an auxiliary MT task: (1) pre-training the MT model on the MT training data to initialize the shared encoder-decoder backbone network; and (2) pre-training PEIT with the aligner and regularizer on a synthesized dataset with rendered images containing text from the MT training data. In order to facilitate the evaluation of PEIT and promote research on image translation, we create a large-scale image translation corpus ECOIT containing 480K image-translation pairs via crowd-sourcing and manual post-editing from real-world images in the e-commerce domain. Experiments on the curated ECOIT benchmark dataset demonstrate that PEIT substantially outperforms both cascaded image translation systems (OCR+MT) and previous strong end-to-end image translation model, with fewer parameters and faster decoding speed.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.circleci		.circleci
.github		.github
.vscode		.vscode
data_synthesis_and_preprocess		data_synthesis_and_preprocess
docs		docs
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
deduplicate.py		deduplicate.py
deduplicate_para.py		deduplicate_para.py
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
release_utils.py		release_utils.py
run.sh		run.sh
run_generate.sh		run_generate.sh
setup.cfg		setup.cfg
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PEIT

overview

Download ECOIT Dataset

installment

Folder structure

Data Synthesis

Data processing and Training

Process:

Datasets:

Training

Multi-Line Image Translation

Pretrained OCR Initialization

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

lishangjie1/PEIT

Folders and files

Latest commit

History

Repository files navigation

PEIT

overview

Download ECOIT Dataset

installment

Folder structure

Data Synthesis

Data processing and Training

Process:

Datasets:

Training

Multi-Line Image Translation

Pretrained OCR Initialization

Citation

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages