TransCorpus

TransCorpus is a scalable, production-ready API and CLI toolkit for large-scale parallel translation, preprocessing, and corpus management. It supports multi-GPU translation, robust checkpointing, and safe concurrent downloads, making it ideal for research and industry-scale machine translation workflows.

Features

🚀 Multi-GPU and multi-process translation
📦 Corpus downloading and preprocessing
🔒 Safe, resumable, and concurrent file downloads
🧩 Split and checkpoint management for large corpora
🛠️ Easy deployment and extensibility
🖥️ Cross-platform: Linux, macOS, Windows

Quick Start

Clone and Install

git clone https://github.com/jknafou/TransCorpus.git
cd TransCorpus
UV_INDEX_STRATEGY=unsafe-best-match rye sync
source .venv/bin/activate

Download a Corpus

transcorpus download-corpus [corpus_name]

Preprocess the corpus by splits

transcorpus preprocess [corpus_name] [language] --num-split 100

Translate (and preprocess if not done) the corpus by split

transcorpus translate [corpus_name] [language] --num-split 100

Preview a corpus with two languages next to each other:

transcorpus preview [corpus_name] [language1] Opt[language2]

Example of two languages next to each other

A demo mode can be tested using the -d flag for each command.

Preprocess and Translate (Multi-GPU Example)

The following example translates the bio corpus (PubMed) of about 30GB, preprocessing it with 4 parallel workers, while translating each available split with two GPUs of different sizes. It can easily be modified to one needs. When deployed on an HPC cluster, for example with SLURM, it will automatically resume from where it left off in the previous run. With shared memory, multiple GPUs from different nodes can work simultaneously.

# Preprocess with 4 workers iteratively, split into 20 parts (here in demo mode)
./example/multi_GPU.sh bio de 4 20

Research-Proven Performance

Paper to be submitted to EMNLP2025

TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. A paper detailing these results will be submitted to EMNLP 2025. 📝 Current Paper Version. If you use this toolkit, please cite:

@misc{knafou-transbert,
	author = {Knafou, Julien and Mottin, Luc and Ana\"{i}s, Mottaz and Alexandre, Flament and  Ruch, Patrick},
	title = {TransBERT: A Framework for Synthetic Translation in Domain-Specific Language Modeling},
	year = {2025},
	note = {Submitted to EMNLP2025. Anonymous ACL submission available:},
	url = {https://transbert.s3.text-analytics.ch/TransBERT.pdf},
}

🧬 Pretrained Models

Looking for pretrained models built with TransCorpus? Check out TransBERT-bio-fr on Hugging Face 🤗, a French biomedical language model trained entirely on synthetic translations generated by this toolkit. Also available, TransCorpus-bio-fr on Hugging Face 🤗

New corpus upload

One can easily add its own corpus (along with a demo) to the repo following the same schema of domains.json:

    "bio": {
        "database": {
            "file": "https://transcorpus.s3.text-analytics.ch/bibmed.tar.gz"
        },
        "corpus": {
            "file":
                "https://transcorpus.s3.text-analytics.ch/title_abstract_en.txt"
            ,
            "demo":
                "https://transcorpus.s3.text-analytics.ch/1k_sample.txt"
        },
        "id": {
            "file":
                "https://transcorpus.s3.text-analytics.ch/PMID.txt"
            ,
            "demo":
                "https://transcorpus.s3.text-analytics.ch/PMID_1k_sample.txt"
        },
        "language": "en"
    }

Where each line of the corpus is a different document. For the moment, a life-science corpus is available comprising about 28GB of raw text, 22M of abstracts from PubMed. The database it is made of can also be downloaded using transcorpus download-database bio.

Deployment

Requirements:

Python 3.10+
rye (for dependency management)
CUDA-enabled GPUs (for multi-GPU translation)

Contributing

Pull requests and issues are welcome!

License

MIT License

Acknowledgements

Swiss AI Center
fairseq
PyTorch
rye

TransCorpus makes large-scale, robust translation easy and reproducible.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
example		example
src/transcorpus		src/transcorpus
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TransCorpus

Features

Quick Start

Preprocess and Translate (Multi-GPU Example)

Research-Proven Performance

Paper to be submitted to EMNLP2025

🧬 Pretrained Models

New corpus upload

Deployment

Requirements:

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

jknafou/TransCorpus

Folders and files

Latest commit

History

Repository files navigation

TransCorpus

Features

Quick Start

Preprocess and Translate (Multi-GPU Example)

Research-Proven Performance

Paper to be submitted to EMNLP2025

🧬 Pretrained Models

New corpus upload

Deployment

Requirements:

Contributing

License

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages