SCAM

This repository contains evaluation code for the paper SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models.

[arXiv Preprint] [Project Page] [HuggingFace Dataset]

Overview

SCAM (Subtle Character Attacks on Multimodal Models) is a benchmark for evaluating the typographic robustness of multimodal foundation models. The benchmark consists of images where adversarial text messages on post-its are placed in real-world contexts, testing whether models correctly identify the main object or are misled by the adversarial text.

Our key findings include:

Models of both VLM and LVLM classes experience a significant drop in accuracy when faced with real-world typographic attacks
Synthetic attacks closely align with real-world attacks, providing empirical support for the established practice of evaluating on synthetic datasets
LVLMs, such as the LLaVA family, exhibit vulnerabilities to typographic attacks, reflecting weaknesses inherited from their vision encoders
Employing larger LLM backbones reduces susceptibility to typographic attacks in LVLMs while also improving their textual understanding

This repository provides the code to run evaluations across three categories of multimodal models:

Vision-Language Models (VLMs) via OpenCLIP (main_vlm_openclip.py)
Large Vision-Language Models (LVLMs) via OpenAI's API (main_lvlm_openai.py)
Open-access LVLMs via Ollama (main_lvlm_ollama.py)

Installation

Prerequisites

Python 3.8+
Docker (optional, for containerized execution)

VLM Evaluation via OpenCLIP

cp vlm_openclip/requirements.txt .
pip install -r requirements.txt

For Docker:

cp vlm_openclip/Dockerfile .
docker build -t scam-vlm-openclip .

LVLM Evaluation via OpenAI API

pip install openai pandas python-dotenv tqdm pillow datasets

LVLM Evaluation via Ollama

cp lvlm_ollama/requirements.txt .
pip install -r requirements.txt

For Docker:

cp lvlm_ollama/Dockerfile .
docker build -t scam-lvlm-ollama .

LVLM Evaluation via LLaVA

See our LLaVA fork that is adapted to work with custom OpenCLIP vision encoders and explains how we train the custom LLaVA version evaluated in our paper in appendix A.6.

For Docker:

cp lvlm_llava/Dockerfile .
docker build -t scam-lvlm-ollama .

Note that you need to build a Docker container with the above mentioned LLaVA fork first.

Datasets

SCAM Dataset

The SCAM dataset is automatically downloaded from HuggingFace when running the evaluation scripts.

Additional Datasets

The evaluation scripts also support two additional datasets:

RTA100: Download from Defense-Prefix Repository and extract to data/RTA100/
PAINT: Download from Patching Repository and extract to data/PAINT/

Usage Examples

VLM Evaluation via OpenCLIP

python main_vlm_openclip.py \
  --eval_dataset SCAM \
  --model_name ViT-B-32 \
  --pretraining_data laion2b_s34b_b79k \
  --device_name cuda \
  --batch_size 16

LVLM Evaluation via OpenAI API

Create a .env file with your OpenAI API key:

OPENAI_API_KEY=your_key_here

Run the evaluation:

python main_lvlm_openai.py

LVLM Evaluation via Ollama

python main_lvlm_ollama.py \
  --model_name llava:34b \
  --eval_dataset SCAM

LVLM Evaluation via LLaVA

python main_lvlm_llava.py \
  --model_name llava-7b-UCSC \
  --model_path /llava-checkpoints/llava-v1.5-7b-UCSC-VLAA-ViT-L-14-CLIPA-336-datacomp1B-bs2-bf16-zero2-11.8 \
  --eval_dataset SCAM

LVLM Evalution via VLMEvalKit

See our VLMEvalKit fork.

Results

The following table shows the performance of various Vision-Language Models (VLMs) and Large Vision-Language Models (LVLMs) on the SCAM datasets. The table highlights the accuracy differences between the NoSCAM dataset and the SCAM dataset, showing how much performance drops when typographic attacks are introduced.

Note that internally all LLaVA models that we evaluate utilize ViT-L-14-336 trained by OpenAI for the image encoding. Furthermore, ViT-bigG-14 trained on laion2b is used in the Kandinsky diffusion model.

Model	Training data	NoSCAM Accuracy (%)	SCAM Accuracy (%)	Accuracy Drop (↓)
Vision-Language Models (VLMs)
`RN50`	`opena 7ED4 i`	97.76	36.61	↓61.15
`ViT-B-32`	`laion2b`	98.45	74.68	↓23.77
`ViT-B-16`	`laion2b`	98.71	69.16	↓29.55
`ViT-B-16-SigLIP`	`webli`	99.22	81.40	↓17.82
`ViT-L-14`	`commonpool_xl`	99.48	74.68	↓24.80
`ViT-L-14`	`openai`	99.14	40.14	↓59.00
`ViT-L-14-336`	`openai`	99.22	33.85	↓65.37
`ViT-L-14-CLIPA-336`	`datacomp1b`	99.57	74.76	↓24.81
`ViT-g-14`	`laion2b`	99.05	61.93	↓37.12
`ViT-bigG-14`	`laion2b`	99.40	70.89	↓28.51
Large Vision-Language Models (LVLMs)
`llava-llama3:8b`	-	98.09	39.50	↓58.59
`llava:7b-v1.6`	-	97.50	58.43	↓39.07
`llava:13b-v1.6`	-	98.88	58.00	↓40.88
`llava:34b-v1.6`	-	98.97	84.85	↓14.11
`gemma3:4b`	-	97.24	58.05	↓39.19
`gemma3:12b`	-	99.14	52.02	↓47.12
`gemma3:27b`	-	97.42	81.67	↓15.75
`llama3.2-vision:90b`	-	98.88	71.01	↓27.87
`llama4:scout`	-	99.23	88.12	↓11.10
`gpt-4o-mini-2024-07-18`	-	99.40	84.68	↓14.72
`gpt-4o-2024-08-06`	-	99.48	96.82	↓2.67

You can reproduce these results by running the evaluation scripts as described in the Usage Examples section.

For the full list of all 110 models evaluated, please refer to the table in the appendix of our paper.

Citation

If you find this repository useful for your research, please consider citing our paper:

@misc{scambliss2025,
    title={SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models},
    author={Justus Westerhoff and Erblina Purelku and Jakob Hackstein and Leo Pinetzki and Lorenz Hufe},
    year={2025},
    eprint={2504.04893},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2504.04893},
}

License

This project is licensed under the MIT License.

Acknowledgments

We thank the authors of OpenCLIP, Ollama, and the creators of the underlying multimodal models for making their work available to the research community. We especially thank the BLISS community for making this collaborative research effort possible.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
lvlm_llava		lvlm_llava
lvlm_ollama		lvlm_ollama
vlm_openclip		vlm_openclip
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_lvlm_llava.py		main_lvlm_llava.py
main_lvlm_ollama.py		main_lvlm_ollama.py
main_lvlm_openai.py		main_lvlm_openai.py
main_vlm_openclip.py		main_vlm_openclip.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCAM

Overview

Installation

Prerequisites

VLM Evaluation via OpenCLIP

LVLM Evaluation via OpenAI API

LVLM Evaluation via Ollama

LVLM Evaluation via LLaVA

Datasets

SCAM Dataset

Additional Datasets

Usage Examples

VLM Evaluation via OpenCLIP

LVLM Evaluation via OpenAI API

LVLM Evaluation via Ollama

LVLM Evaluation via LLaVA

LVLM Evalution via VLMEvalKit

Results

Citation

License

Acknowledgments

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

Bliss-e-V/SCAM

Folders and files

Latest commit

History

Repository files navigation

SCAM

Overview

Installation

Prerequisites

VLM Evaluation via OpenCLIP

LVLM Evaluation via OpenAI API

LVLM Evaluation via Ollama

LVLM Evaluation via LLaVA

Datasets

SCAM Dataset

Additional Datasets

Usage Examples

VLM Evaluation via OpenCLIP

LVLM Evaluation via OpenAI API

LVLM Evaluation via Ollama

LVLM Evaluation via LLaVA

LVLM Evalution via VLMEvalKit

Results

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages