8000 GitHub - btzyd/qava: The code of "QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models"
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

btzyd/qava

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models

This is a PyTorch implementation of the QAVA paper:

Preparing the environment, code, data and model

  1. Prepare the environment.

Creating a python environment and activate it via the following command.

cd qava
conda create -n qava python=3.10
conda activate qava
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install transformers==4.31.0 numpy==1.26.4
  1. Clone this repository.
git clone https://github.com/btzyd/qava.git
  1. Prepare the COCO dataset

In the annotation folder, we have generated VQA 32+50 (as defined in the paper) from VQA v2 to form annotation/vqa_val_image_32_ques_50.json.

As for images, you can get COCO-val2014 from its official websites, and then extract it.

  1. Download BLIP-2 and InstructBLIP Models

You can download InstructBLIP-7B models from the HuggingFace links instructblip-vicuna-7b. Of course, you can also download the model while loading it in python code, though the download may be unstable. We recommend that you download the model first and then run the python code to load it from local directories.

Run the QAVA code

Running the QAVA attack to generate adversarial examples

The adversarial attack can be executed with the following command.

# clean baseline
python -m torch.distributed.run \
  --nproc_per_node=8 attack_vqa.py \
  --dataset annotation/vqa_val_image_32_ques_50.json \
  --image_folder /root/nfs/dataset \
  --model /root/huggingface_model/instructblip-vicuna-7b \
  --no_attack \
  --output_dir output/clean_baseline

# L_llm
python -m torch.distributed.run \
  --nproc_per_node=8 attack_vqa.py \
  --dataset annotation/vqa_val_image_32_ques_50.json \
  --image_folder /root/nfs/dataset \
  --model /root/huggingface_model/instructblip-vicuna-7b \
  --loss llm \
  --question_num 10 \
  --output_dir output/pgd_attack_rsq_llm

# L_QAVA
python -m torch.distributed.run \
  --nproc_per_node=8 attack_vqa.py \
  --dataset annotation/vqa_val_image_32_ques_50.json \
  --image_folder /root/nfs/dataset \
  --model /root/huggingface_model/instructblip-vicuna-7b \
  --loss Qout \
  --question_num 10 \
  --output_dir output/pgd_attack_qava

The meaning of the parameters are as follows:

  • nproc_per_node: Our code supports the use of multiple GPUs, specify the number of GPUs to use with this parameter.
  • dataset: default to VQA 32+50.
  • image_folder: The folder which include "val2014".
  • model: The path to LVLM.
  • loss: Select "llm" for $\mathcal{L}\text{LLM}$ and "Qout" for $\mathcal{L}\text{QAVA}$.
  • question_num: The number of random selected questions.
  • output_dir: Path to save the attack results.

You can use the VQA v2 evaluation code to evaluate the clean and attack results, and we also provide the question file and annotation file required by this repository.

Citation

@misc{zhang2025qavaqueryagnosticvisualattack,
      title={QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models}, 
      author={Yudong Zhang and Ruobing Xie and Jiansheng Chen and Xingwu Sun and Zhanhui Kang and Yu Wang},
      year={2025},
      eprint={2504.11038},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.11038}, 
}

About

The code of "QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0