This is a PyTorch implementation of the QAVA paper:
- Prepare the environment.
Creating a python environment and activate it via the following command.
cd qava
conda create -n qava python=3.10
conda activate qava
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install transformers==4.31.0 numpy==1.26.4
- Clone this repository.
git clone https://github.com/btzyd/qava.git
- Prepare the COCO dataset
In the annotation folder, we have generated VQA 32+50 (as defined in the paper) from VQA v2 to form annotation/vqa_val_image_32_ques_50.json.
As for images, you can get COCO-val2014 from its official websites, and then extract it.
- Download BLIP-2 and InstructBLIP Models
You can download InstructBLIP-7B models from the HuggingFace links instructblip-vicuna-7b. Of course, you can also download the model while loading it in python code, though the download may be unstable. We recommend that you download the model first and then run the python code to load it from local directories.
The adversarial attack can be executed with the following command.
# clean baseline
python -m torch.distributed.run \
--nproc_per_node=8 attack_vqa.py \
--dataset annotation/vqa_val_image_32_ques_50.json \
--image_folder /root/nfs/dataset \
--model /root/huggingface_model/instructblip-vicuna-7b \
--no_attack \
--output_dir output/clean_baseline
# L_llm
python -m torch.distributed.run \
--nproc_per_node=8 attack_vqa.py \
--dataset annotation/vqa_val_image_32_ques_50.json \
--image_folder /root/nfs/dataset \
--model /root/huggingface_model/instructblip-vicuna-7b \
--loss llm \
--question_num 10 \
--output_dir output/pgd_attack_rsq_llm
# L_QAVA
python -m torch.distributed.run \
--nproc_per_node=8 attack_vqa.py \
--dataset annotation/vqa_val_image_32_ques_50.json \
--image_folder /root/nfs/dataset \
--model /root/huggingface_model/instructblip-vicuna-7b \
--loss Qout \
--question_num 10 \
--output_dir output/pgd_attack_qava
The meaning of the parameters are as follows:
nproc_per_node
: Our code supports the use of multiple GPUs, specify the number of GPUs to use with this parameter.dataset
: default to VQA 32+50.image_folder
: The folder which include "val2014".model
: The path to LVLM.loss
: Select "llm" for $\mathcal{L}\text{LLM}$ and "Qout" for $\mathcal{L}\text{QAVA}$.question_num
: The number of random selected questions.output_dir
: Path to save the attack results.
You can use the VQA v2 evaluation code to evaluate the clean and attack results, and we also provide the question file and annotation file required by this repository.
@misc{zhang2025qavaqueryagnosticvisualattack,
title={QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models},
author={Yudong Zhang and Ruobing Xie and Jiansheng Chen and Xingwu Sun and Zhanhui Kang and Yu Wang},
year={2025},
eprint={2504.11038},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.11038},
}