Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang
- [2024/9/14]🔥🔥🔥 We release the official demo. Thanks very much for Huggingface providing the GPU resource.
- [2024/9/13]🔥🔥🔥 We release the Huggingface deployment.
- [2024/9/03]🔥🔥🔥 We open-source the codes, weights, and benchmarks. The paper can be found in this repo. We also have submitted it to Arxiv.
- [2024/9/03]🔥🔥🔥 We release the OCR-2.0 model GOT!
Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of Vary.
We encourage everyone to develop GOT applications based on this repo. Thanks for the following contributions :
Colab of GOT ~ contributor: @Zizhe Wang
CPU version of GOT ~ contributor: @ElvisClaros
Online demo ~ contributor: @Joseph Pollack
Dokcer & client demo ~ contributor: @QIN2DIM
GUI of GOT ~ contributor: @XJF2332
Towards OCR-2.0 via a Unified End-to-end Model
- Our environment is cuda11.8+torch2.0.1
- Clone this repository and navigate to the GOT folder
git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
cd 'the GOT folder'
- Install Package
conda create -n got python=3.10 -y
conda activate got
pip install -e .
- Install Flash-Attention
pip install ninja
pip install flash-attn --no-build-isolation
- Huggingface
- Google Drive
- BaiduYun code: OCR2
- plain texts OCR:
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type ocr
- format texts OCR:
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format
- fine-grained OCR:
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format/ocr --box [x1,y1,x2,y2]
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format/ocr --color red/green/blue
- multi-crop OCR:
python3 GOT/demo/run_ocr_2.0_crop.py --model-name /GOT_weights/ --image-file /an/image/file.png
- multi-page OCR (the image path contains multiple .png files):
python3 GOT/demo/run_ocr_2.0_crop.py --model-name /GOT_weights/ --image-file /images/path/ --multi-page
- render the formatted OCR results:
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format --render
Note: The rendering results can be found in /results/demo.html. Please open the demo.html to see the results.
- Train sample can be found here. Note that the '<image>' in the 'conversations'-'human'-'value' is necessary!
- This codebase only supports post-training (stage-2/stage-3) upon our GOT weights.
- If you want train from stage-1 described in our paper, you need this repo.
deepspeed /GOT-OCR-2.0-master/GOT/train/train_GOT.py \ --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json --model_name_or_path /GOT_weights/ \ --use_im_start_end True \ --bf16 True \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 1 \ --weight_decay 0. \ --warmup_ratio 0.001 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 8192 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --report_to none \ --per_device_train_batch_size 2 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --datasets pdf-ocr+scence \ --output_dir /your/output/path