GitHub - btzyd/qava: The code of "QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models"

QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models

This is a PyTorch implementation of the QAVA paper:

Preparing the environment, code, data and model

Prepare the environment.

Creating a python environment and activate it via the following command.

cd qava
conda create -n qava python=3.10
conda activate qava
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install transformers==4.31.0 numpy==1.26.4

Clone this repository.

git clone https://github.com/btzyd/qava.git

Prepare the COCO dataset

In the annotation folder, we have generated VQA 32+50 (as defined in the paper) from VQA v2 to form annotation/vqa_val_image_32_ques_50.json.

As for images, you can get COCO-val2014 from its official websites, and then extract it.

Download BLIP-2 and InstructBLIP Models

You can download InstructBLIP-7B models from the HuggingFace links instructblip-vicuna-7b. Of course, you can also download the model while loading it in python code, though the download may be unstable. We recommend that you download the model first and then run the python code to load it from local directories.

Run the QAVA code

Running the QAVA attack to generate adversarial examples

The adversarial attack can be executed with the following command.

# clean baseline
python -m torch.distributed.run \
  --nproc_per_node=8 attack_vqa.py \
  --dataset annotation/vqa_val_image_32_ques_50.json \
  --image_folder /root/nfs/dataset \
  --model /root/huggingface_model/instructblip-vicuna-7b \
  --no_attack \
  --output_dir output/clean_baseline

# L_llm
python -m torch.distributed.run \
  --nproc_per_node=8 attack_vqa.py \
  --dataset annotation/vqa_val_image_32_ques_50.json \
  --image_folder /root/nfs/dataset \
  --model /root/huggingface_model/instructblip-vicuna-7b \
  --loss llm \
  --question_num 10 \
  --output_dir output/pgd_attack_rsq_llm

# L_QAVA
python -m torch.distributed.run \
  --nproc_per_node=8 attack_vqa.py \
  --dataset annotation/vqa_val_image_32_ques_50.json \
  --image_folder /root/nfs/dataset \
  --model /root/huggingface_model/instructblip-vicuna-7b \
  --loss Qout \
  --question_num 10 \
  --output_dir output/pgd_attack_qava

The meaning of the parameters are as follows:

nproc_per_node: Our code supports the use of multiple GPUs, specify the number of GPUs to use with this parameter.
dataset: default to VQA 32+50.
image_folder: The folder which include "val2014".
model: The path to LVLM.
loss: Select "llm" for $\mathcal{L}\text{LLM}$ and "Qout" for $\mathcal{L}\text{QAVA}$.
question_num: The number of random selected questions.
output_dir: Path to save the attack results.

You can use the VQA v2 evaluation code to evaluate the clean and attack results, and we also provide the question file and annotation file required by this repository.

Citation

@misc{zhang2025qavaqueryagnosticvisualattack,
      title={QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models}, 
      author={Yudong Zhang and Ruobing Xie and Jiansheng Chen and Xingwu Sun and Zhanhui Kang and Yu Wang},
      year={2025},
      eprint={2504.11038},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.11038}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
annotation		annotation
attacks		attacks
data		data
datasets		datasets
models		models
.gitignore		.gitignore
README.md		README.md
attack_vqa.py		attack_vqa.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models

Preparing the environment, code, data and model

Run the QAVA code

Running the QAVA attack to generate adversarial examples

Citation

About

Uh oh!

Releases

Packages

Languages

btzyd/qava

Folders and files

Latest commit

History

Repository files navigation

QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models

Preparing the environment, code, data and model

Run the QAVA code

Running the QAVA attack to generate adversarial examples

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages