Ting Pan1,2*, Lulu Tang2*, Xinlong Wang2¶, Shiguang Shan1
We present Tokenize Anything via Prompting, a unified an 8000 d promptable model capable of simultaneously segmenting, recognizing, and captioning arbitrary regions, with flexible visual prompts (point, box and sketch). The model is trained with exhaustive segmentation masks sourced from SA-1B, coupled with semantic priors from a pre-trained EVA-CLIP with 5 billion parameters.
torch
>= 2.1
flash-attn
>= 2.3.3 (for TextGeneration)
gradio-image-prompter
(for GradioApp, Install from URL)
Clone this repository to local disk and install:
cd tokenize-anything && pip install .
You can also install from the remote repository:
pip install git+ssh://git@github.com/baaivision/tokenize-anything.git
The TAP models can be used for diverse vision and language tasks.
We adopt a modular design that decouples all components and predictors.
As a best practice, implement your custom predictor and asynchronous pipeline as follows:
from tokenize_anything import model_registry
with <distributed_actor>:
model = model_registry["<model_type>"](checkpoint="<path/to/checkpoint>")
results = <custom_predictor>(model, *args, **kwargs)
server.collect_results()
See builtin examples (web-demo and evaluations) provided in scripts for more details.
See Inference Guide.
See Concept Guide.
See Evaluation Guide for TAP-H.
See Evaluation Guide for TAP-L.
See Evaluation Guide for TAP-B.
- Three versions of the model are available with different image encoders.
- Use a longer pre-training and fine-tuning schedule (improved segmentation and caption performance).
- Apply weight decay for all bias parameters (avoid FP16 overflow in QK matmul).
- Sample point prompts from predicted mask instead of GT box during VG training.
Model | Description | Schedule | MD5 | Weights |
---|---|---|---|---|
tap_vit_h | ViT-H TAP v1.1 model | (100% SA-1B, 180k), (VG, 50ep) | 4bdfb9 | 🤗 HF link |
tap_vit_l | ViT-L TAP v1.1 model | (100% SA-1B, 180k), (VG, 50ep) | c1d41f | 🤗 HF link |
tap_vit_b | ViT-B TAP v1.1 model | (100% SA-1B, 180k), (VG, 50ep) | 707f80 | 🤗 HF link |
- Two versions of the model are available with different image encoders.
- Original paper results.
Model | Description | Schedule | MD5 | Weights |
---|---|---|---|---|
tap_vit_l | ViT-L TAP v1.0 model | (50% SA-1B, 90k), (VG, 25ep) | 03f8ec | 🤗 HF link |
tap_vit_b | ViT-B TAP v1.0 model | (50% SA-1B, 90k), (VG, 25ep) | b45cbf | 🤗 HF link |
Note: You can generate these weights following the Concept Guide.
Concept | Description | Weights |
---|---|---|
Merged-2560 | Merged concepts | 🤗 HF link |
LVIS-1203 | LVIS concepts | 🤗 HF link |
COCO-80 | COCO concepts | 🤗 HF link |
@article{pan2023tap,
title={Tokenize Anything via Prompting},
author={Pan, Ting and Tang, Lulu and Wang, Xinlong and Shan, Shiguang},
journal={arXiv preprint arXiv:2312.09128},
year={2023}
}
We thank the repositories: SAM, EVA, LLaMA, FlashAttention, Gradio, Detectron2 and CodeWithGPU.