8000 GitHub - csuhan/Tar: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ Tar Public

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Notifications You must be signed in to change notification settings

csuhan/Tar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unifying Visual Understanding and Generation via Text-Aligned Representations

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang

Project Lead   Corresponding Authors

Project Page Tar Paper on arXiv Huggingface Model Huggingface Space Huggingface Space

News

  • June 2025. Code and models are released.

Contents

Install

git clone https://github.com/csuhan/Tar && cd Tar

conda create -n tar python=3.10 -y

pip install -r requirements.txt

# optional
pip install flash-attn --no-build-isolation

Models

1️⃣ Text-Aligned Tokenizer (TA-Tok)

Model Encoder Input Size Codebook Size Link
TA-Tok SigLIP2 384px 65536 ta_tok.pth

2️⃣ De-Tokenizer

Model Type VQVAE Output Size Link
AR-DTok AR vq_ds_t2i.pt 256px ar_dtok_lp_256px.pth
AR-DTok AR vq_ds_t2i.pt 512px ar_dtok_lp_512px.pth
AR-DTok AR vq_ds_t2i.pt 1024px ar_dtok_lp_1024px.pth

3️⃣ LLM

Model Vision Tokenizer LLM Link
Tar-1.5B TA-Tok Qwen2.5-1.5B-Instruct csuhan/Tar-1.5B
Tar-7B TA-Tok Qwen2.5-7B-Instruct csuhan/Tar-7B

Inference

1️⃣ Text-to-image generation

from t2i_inference import T2IConfig, TextToImageInference
config = T2IConfig(
  ar_path=hf_hub_download("csuhan/TA-Tok", "ar_dtok_lp_1024px.pth"),
  encoder_path = hf_hub_download("csuhan/TA-Tok", "ta_tok.pth"),
  decoder_path = hf_hub_download("peizesun/llamagen_t2i", "vq_ds16_t2i.pt")
)
inference = TextToImageInference(config)
prompt = "A photo of a macaw"
image = inference.generate_image(prompt)
image.save("generated_image.png")

You can directly run python t2i_inference.py to generate images. The models will be downloaded automatically.

2️⃣ Image Understanding

from i2t_inference import I2TConfig, ImageToTextInference
config = I2TConfig(ta_tok_path=hf_hub_download("csuhan/TA-Tok", "ta_tok.pth"))
inference = ImageToTextInference(config)
description = inference.generate('asset/dog_cat.jpg', "Describe the image shortly.")
print(description)

You can run python i2t_inference.py to generate text for a given image.

Demo

🔥 Try the Huggingface Space demo at: Demo 1 and Demo 2

Run the demo locally:

python app.py

Train

Data format

Each data item should contain at least the following keys:

{
  "image": "path/to/image",
  "conversations": [
    {"from": "human", "value": "<image>\nDescribe the image shortly."},
    {"from": "gpt", "value": "The image describes a xxx"}
  ]
}

If the data item contains more than one image, the key image will be a list of image. Besides, we also recommend to use parquet datasets instead of local datasets. The format of parquet datasets is a bit different from the json format:

{
  "image": {"bytes": img_bytes},
  "conversations": [
    ...
  ]
}

If you want a quick start and try Tar on small scale datasets, you can run the following script:

bash scripts/Tar_1.5B_pretrain_demo.sh

The required data will be downloaded automatically. Note make sure your /tmp has >500GB storage to download the data, or change the data path in scripts/data_demo.yaml

Here we also provide the model trained with the above script: csuhan/tar_1.5B_pretrain_demo. You can verify if your env setup is correct.

Evaluation

1️⃣ Image Understanding Evaluation

bash scripts/eval/Tar_1.5B_pretrain_demo_und_eval.sh

You can modify the MODEL_PATH and --tasks to evaluation other models and tasks.

2️⃣ Text-to-image Evaluation

bash scripts/eval/Tar_1.5B_pretrain_demo_gen_eval.sh

Note you still need to follow the instructions in DPG Bench and Geneval to evaluate the results.

Citation

@article{han2025tar,
  title={Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations}, 
  author={Han, Jiaming and Chen, Hao and Zhao, Yang and Wang, Hanyu and Zhao, Qi and Yang, Ziyan and He, Hao and Yue, Xiangyu and Jiang, Lu},
  journal={arXiv preprint arXiv:2506.18898},
  year={2025},
}

License

This project is licensed under the Apache 2.0 License.

About

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0