8000 GitHub - baaivision/tokenize-anything: [ECCV 2024] Tokenize Anything via Prompting
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

baaivision/tokenize-anything

Repository files navigation

Tokenize Anything via Prompting

Ting Pan1,2*,   Lulu Tang2*,   Xinlong Wang,   Shiguang Shan1

1ICT-CAS,   2BAAI
* Equal Contribution, Project Lead

[Paper] [🤗 Demo]

We present Tokenize Anything via Prompting, a unified an 8000 d promptable model capable of simultaneously segmenting, recognizing, and captioning arbitrary regions, with flexible visual prompts (point, box and sketch). The model is trained with exhaustive segmentation masks sourced from SA-1B, coupled with semantic priors from a pre-trained EVA-CLIP with 5 billion parameters.

Installation

Preliminaries

torch >= 2.1

flash-attn >= 2.3.3 (for TextGeneration)

gradio-image-prompter (for GradioApp, Install from URL)

Installing Package

Clone this repository to local disk and install:

cd tokenize-anything && pip install .

You can also install from the remote repository:

pip install git+ssh://git@github.com/baaivision/tokenize-anything.git

Quick Start

Development

The TAP models can be used for diverse vision and language tasks.

We adopt a modular design that decouples all components and predictors.

As a best practice, implement your custom predictor and asynchronous pipeline as follows:

from tokenize_anything import model_registry

with <distributed_actor>:
    model = model_registry["<model_type>"](checkpoint="<path/to/checkpoint>")
    results = <custom_predictor>(model, *args, **kwargs)

server.collect_results()

See builtin examples (web-demo and evaluations) provided in scripts for more details.

Inference

See Inference Guide.

See Concept Guide.

Evaluation

See Evaluation Guide for TAP-H.

See Evaluation Guide for TAP-L.

See Evaluation Guide for TAP-B.

Models

Model weights

V1.1 Release Notes

  • Three versions of the model are available with different image encoders.
  • Use a longer pre-training and fine-tuning schedule (improved segmentation and caption performance).
  • Apply weight decay for all bias parameters (avoid FP16 overflow in QK matmul).
  • Sample point prompts from predicted mask instead of GT box during VG training.
Model Description Schedule MD5 Weights
tap_vit_h ViT-H TAP v1.1 model (100% SA-1B, 180k), (VG, 50ep) 4bdfb9 🤗 HF link
tap_vit_l ViT-L TAP v1.1 model (100% SA-1B, 180k), (VG, 50ep) c1d41f 🤗 HF link
tap_vit_b ViT-B TAP v1.1 model (100% SA-1B, 180k), (VG, 50ep) 707f80 🤗 HF link

V1.0 Release Notes

  • Two versions of the model are available with different image encoders.
  • Original paper results.
Model Description Schedule MD5 Weights
tap_vit_l ViT-L TAP v1.0 model (50% SA-1B, 90k), (VG, 25ep) 03f8ec 🤗 HF link
tap_vit_b ViT-B TAP v1.0 model (50% SA-1B, 90k), (VG, 25ep) b45cbf 🤗 HF link

Concept weights

Note: You can generate these weights following the Concept Guide.

Concept Description Weights
Merged-2560 Merged concepts 🤗 HF link
LVIS-1203 LVIS concepts 🤗 HF link
COCO-80 COCO concepts 🤗 HF link

License

Apache License 2.0

Citation

@article{pan2023tap,
  title={Tokenize Anything via Prompting},
  author={Pan, Ting and Tang, Lulu and Wang, Xinlong and Shan, Shiguang},
  journal={arXiv preprint arXiv:2312.09128},
  year={2023}
}

Acknowledgement

We thank the repositories: SAM, EVA, LLaMA, FlashAttention, Gradio, Detectron2 and CodeWithGPU.

About

[ECCV 2024] Tokenize Anything via Prompting

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
0