Authors: Mengshi Qi, Pengfei Zhu, Xiangtai Li, Xiaoyang Bi, Lu Qi, Huadong Ma, Ming-Hsuan Yang
[2025/03/17] The validation set for IC-VOS is now released. Enjoy exploring and working with it!
[2025/02/08] The code and dataset for Image-to-image / Image-to-video In-Context Learning is released. Enjoy it:)
- Dual Consistency SAM (DC-SAM) for one-shot segmentation: Fully explores positive and negative features of visual prompts, generating high-quality prompts tailored for one-shot segmentation tasks.
- Query cyclic-consistent cross-attention mechanism: Ensures precise focus on key areas, effectively filtering out confusing components and improving accuracy and specificity in one-shot segmentation.
- New video in-context segmentation benchmark (IC-VOS): Introduces a manually collected benchmark from existing video datasets, providing a robust platform for evaluating state-of-the-art methods.
- Extension to SAM2 with mask tube design: Enhances prompt generation for video object segmentation, achieving strong performance on the proposed IC-VOS benchmark.
Step 1: clone this repository:
git clone https://github.com/zaplm/DC-SAM.git && cd DC-SAM
Step 2: create a conda environment and install the dependencies:
conda create -n DCSAM python=3.10 -y
conda activate DCSAM
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
cd segment-anything-2 && pip install -e .
cd ../segment-anything && pip install -e .
cd .. && pip install -r requirements.txt
Download following datasets:
Download PASCAL VOC2012 devkit (train/val data):
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tarDownload PASCAL VOC2012 SDS extended mask annotations from [Google Drive].
Download COCO2014 train/val images and annotations:
wget http://images.cocodataset.org/zips/train2014.zip wget http://images.cocodataset.org/zips/val2014.zip wget http://images.cocodataset.org/annotations/annotations_trainval2014.zipDownload COCO2014 train/val annotations. train2014.zip [Google Drive], [val2014.zip]. (and locate both train2014/ and val2014/ under annotations/ directory).
Download our proposed IC-VOS validation set from [Baidu Netdisk]/[HuggingFace dataset]. (COCO2014 train/val annotations and PASCAL VOC2012 SDS extended mask annotations can also be accessed via our [HuggingFace dataset])
Create a directory '../dataset' for the above few-shot segmentation datasets and appropriately place each dataset to have following directory structure:
Datasets/
├── VOC2012/ # PASCAL VOC2012 devkit
│ ├── Annotations/
│ ├── ImageSets/
│ ├── ...
│ └── SegmentationClassAug/
├── COCO2014/
│ ├── annotations/
│ │ ├── train2014/ # (dir.) training masks (from Google Drive)
│ │ ├── val2014/ # (dir.) validation masks (from Google Drive)
│ │ └── ..some json files..
│ ├── train2014/
│ └── val2014/
└── IC-VOS/
├── 0a43a414/
│ ├── Annotations/
│ └── JPEGImages/
└── ...
ResNet-50/101_v2 can be download from Google Drive
VGG-16_bn can be download from here
SAM checkpoint can be download from this repository
SAM2 checkpoint can be download from this repository
It is recommend to train DC-SAM using 4 GPUs:
Train on image:
python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=6224 train_image.py --epochs 50/100 --benchmark coco/pascal --lr 1e-4/2e-4 --bsz 2 --nshot 1 \
--num_query 25 --sam_version 1/2 --nworker 8 --fold 0/1/2/3 --backbone resnet50/vgg16 --logpath log_name
Train on video (pretrain or finetune of image-to-video):
python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=6224 train_video.py --epochs 40/10 --benchmark coco_all/coco_mask_tube --lr 1e-4/1e-5 --bsz 8/1 --nshot 1 \
--num_query 25 --data_type image/video --sam_version 2 --nworker 8 --backbone resnet50 --logpath log_name
To evaluate on the IC-VOS:
python eval_video.py --coco_path /path/to/coco --icvos_path /path/to/icvos --ckpt /path/to/ckpt
To evaluate on few-shot segmentation benchmarks:
python eval_iamge.py --datapath /path/to/benchmark --benchmark pascal/coco --fold 0/1/2/3 --ckpt /path/to/ckpt
@article{qi2025dc,
title={DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency},
author={Qi, Mengshi and Zhu, Pengfei and Ma, Huadong and Qi, Lu and Li, Xiangtai and Yang, Ming-Hsuan},
journal={arXiv preprint arXiv:2504.12080},
year={2025}
}
This repository is licensed under Apache 2.0.