🎨 ColorBench

This repo contains the official evaluation code and dataset for the paper "ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness"
Chinese Version: [知乎]

Highlights

🔥 More than 5,800 image-text questions covering diverse application scenarios and practical challenges for VLMs evaluation.
🔥 3 categories and 11 tasks for various color-centric capabilities evaluation including Perception, Reasoning, and Robustness.

Findings

🔎 The scaling law for language and vision part: The scaling law still holds for color understanding but is much weaker and mainly depends on the language model parts.
🔎 Absolute performance gap between different sizes of models: The absolute performances of different VLMs are relatively low, and the gaps between different models (open-source vs. proprietary, small vs. large) are not large.
🔎 Introducing reasoning steps: Adding reasoning steps can still improve VLMs' performance on ColorBench tasks, even for color robustness.
🔎 The effect of color clues: Color clues are indeed leveraged by VLMs in most of the tasks. However, in color illusion and mimicry tasks, colors might mislead VLMs to give wrong answers, and converting colorful images into grayscale can improve the accuracy.

Dataset Introduction

ColorBench is the first benchmark explicitly designed to comprehensively evaluate the color understanding capabilities of VLMs across three key dimensions: Color Perception, Color Reasoning, and Color Robustness. This benchmark consists of 1,448 instances and more than 5,800 image-text questions spanning 11 diverse tasks (Color Recognition, Color Extraction, Object Recognition, Color Proportion, Color Comparison, Color Counting, Object Counting, Color Illusion, Color Mimicry, Color Blindness, and Color Robustness). For the Color Perception and Color Reasoning categories, each instance contains an image, a question, and multiple-choice (3 to 6) options, with only one correct answer. For Color Robustness, each instance consists of 10 multiple-choice image-text questions including a seed image and 9 edited images with color changes.

VLMs' Result

We conduct an extensive evaluation of 32 vision-language models (VLMs) spanning a range of large language model (LLM) sizes and architectures. Our evaluation includes state-of-the-art models such as GPT-4o, Gemini-2-flash, LLaVA-OV, LLaVA-NEXT, Cambrian-1, InternVL2, Qwen2.5-VL, and Eagle. This selection covers a diverse set of architectures, including both proprietary and open-source models, enabling a comprehensive assessment of their reasoning capabilities under different computational constraints.

Evaluation Pipeline

We provide detailed instructions for evaluation as follows.

Environment

Install packages that are necessary for VLMs.

conda create -n colorbench python=3.11
conda activate colorbench

pip3 install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

View Dataset

We release ColorBench on Huggingface, which more than 5,800 image-text pairs. You can download and view the dataset from Huggingface by the following command:

from datasets import load_dataset

dataset = load_dataset("umd-zhou-lab/ColorBench", "test")

# Evaluation samples
print(dataset["test"][0])

Inference with Models

Inference codes for some VLMs are prepared in model. You can use script model_inference.sh to run inference on our benchmark.

bash model_inference.sh

Before running model_inference.sh, modify the necessary folder paths and API keys in model_inference.sh:

ROOT_DIR="PATH/TO/ROOT_DIR"          # Needed only if using json for model inference
RESULT_DIR="PATH/TO/RESULT_DIR"      # Path to save the model inference results
GEMINI_API_KEY="YOUR_API_KEY"
GPT4O_API_KEY="YOUR_API_KEY"

The cache folder for models and dataset can be modified in path_utils.py. If not changed, the cache folder path defaults to the home directory.”:

CACHE_DIR = "YOUR_HF_CACHE_FOLDER"

Evaluation Results

We provide script metrics_eval.py to run evaluation metrics based on the inferenced resultsYou can run this command to get the final result:

python3 evaluation/metrics_eval.py --result_dir=RESULT_DIR --save_dir=SAVE_DIR

The final result will be saved in folder SAVE_DIR.

Citation

@misc{liang2025colorbenchvlmsunderstandcolorful,
      title={ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness}, 
      author={Yijun Liang and Ming Li and Chenrui Fan and Ziyue Li and Dang Nguyen and Kwesi Cobbina and Shweta Bhardwaj and Jiuhai Chen and Fuxiao Liu and Tianyi Zhou},
      year={2025},
      eprint={2504.10514},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.10514}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
assets		assets
evaluation		evaluation
model		model
utils		utils
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
model_inference.sh		model_inference.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎨 ColorBench

Highlights

Findings

Dataset Introduction

VLMs' Result

Evaluation Pipeline

Environment

View Dataset

Inference with Models

Evaluation Results

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

tianyi-lab/ColorBench

Folders and files

Latest commit

History

Repository files navigation

🎨 ColorBench

Highlights

Findings

Dataset Introduction

VLMs' Result

Evaluation Pipeline

Environment

View Dataset

Inference with Models

Evaluation Results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages