GitHub - csuhan/Tar: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Unifying Visual Understanding and Generation via Text-Aligned Representations

Jiaming Han, Hao Chen^†, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue^‡, Lu Jiang^‡

^† Project Lead ^‡ Corresponding Authors

News

June 2025. Code and models are released.

Install

git clone https://github.com/csuhan/Tar && cd Tar

conda create -n tar python=3.10 -y

pip install -r requirements.txt

# optional
pip install flash-attn --no-build-isolation

Models

1️⃣ Text-Aligned Tokenizer (TA-Tok)

Model	Encoder	Input Size	Codebook Size	Link
TA-Tok	SigLIP2	384px	65536	ta_tok.pth

2️⃣ De-Tokenizer

Model	Type	VQVAE	Output Size	Link
AR-DTok	AR	vq_ds_t2i.pt	256px	ar_dtok_lp_256px.pth
AR-DTok	AR	vq_ds_t2i.pt	512px	ar_dtok_lp_512px.pth
AR-DTok	AR	vq_ds_t2i.pt	1024px	ar_dtok_lp_1024px.pth

3️⃣ LLM

Model	Vision Tokenizer	LLM	Link
Tar-1.5B	TA-Tok	Qwen2.5-1.5B-Instruct	csuhan/Tar-1.5B
Tar-7B	TA-Tok	Qwen2.5-7B-Instruct	csuhan/Tar-7B

Inference

1️⃣ Text-to-image generation

from t2i_inference import T2IConfig, TextToImageInference
config = T2IConfig(
  ar_path=hf_hub_download("csuhan/TA-Tok", "ar_dtok_lp_1024px.pth"),
  encoder_path = hf_hub_download("csuhan/TA-Tok", "ta_tok.pth"),
  decoder_path = hf_hub_download("peizesun/llamagen_t2i", "vq_ds16_t2i.pt")
)
inference = TextToImageInference(config)
prompt = "A photo of a macaw"
image = inference.generate_image(prompt)
image.save("generated_image.png")

You can directly run python t2i_inference.py to generate images. The models will be downloaded automatically.

2️⃣ Image Understanding

from i2t_inference import I2TConfig, ImageToTextInference
config = I2TConfig(ta_tok_path=hf_hub_download("csuhan/TA-Tok", "ta_tok.pth"))
inference = ImageToTextInference(config)
description = inference.generate('asset/dog_cat.jpg', "Describe the image shortly.")
print(description)

You can run python i2t_inference.py to generate text for a given image.

Demo

🔥 Try the Huggingface Space demo at: Demo 1 and Demo 2

Run the demo locally:

python app.py

Train

Data format

Each data item should contain at least the following keys:

{
  "image": "path/to/image",
  "conversations": [
    {"from": "human", "value": "<image>\nDescribe the image shortly."},
    {"from": "gpt", "value": "The image describes a xxx"}
  ]
}

If the data item contains more than one image, the key image will be a list of image. Besides, we also recommend to use parquet datasets instead of local datasets. The format of parquet datasets is a bit different from the json format:

{
  "image": {"bytes": img_bytes},
  "conversations": [
    ...
  ]
}

If you want a quick start and try Tar on small scale datasets, you can run the following script:

bash scripts/Tar_1.5B_pretrain_demo.sh

The required data will be downloaded automatically. Note make sure your /tmp has >500GB storage to download the data, or change the data path in scripts/data_demo.yaml

Here we also provide the model trained with the above script: csuhan/tar_1.5B_pretrain_demo. You can verify if your env setup is correct.

Evaluation

1️⃣ Image Understanding Evaluation

bash scripts/eval/Tar_1.5B_pretrain_demo_und_eval.sh

You can modify the MODEL_PATH and --tasks to evaluation other models and tasks.

2️⃣ Text-to-image Evaluation

bash scripts/eval/Tar_1.5B_pretrain_demo_gen_eval.sh

Note you still need to follow the instructions in DPG Bench and Geneval to evaluate the results.

Citation

@article{han2025tar,
  title={Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations}, 
  author={Han, Jiaming and Chen, Hao and Zhao, Yang and Wang, Hanyu and Zhao, Qi and Yang, Ziyan and He, Hao and Yue, Xiangyu and Jiang, Lu},
  journal={arXiv preprint arXiv:2506.18898},
  year={2025},
}

License

This project is licensed under the Apache 2.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unifying Visual Understanding and Generation via Text-Aligned Representations

News

Contents

Install

Models

Inference

Demo

Train

Evaluation

Citation

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
asset		asset
eval		eval
llava		llava
scripts		scripts
tok		tok
.gitignore		.gitignore
README.md		README.md
app.py		app.py
i2t_inference.py		i2t_inference.py
requirements.txt		requirements.txt
t2i_inference.py		t2i_inference.py

csuhan/Tar

Folders and files

Latest commit

History

Repository files navigation

Unifying Visual Understanding and Generation via Text-Aligned Representations

News

Contents

Install

Models

Inference

Demo

Train

Evaluation

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages