8000 GitHub - aminbigdeli/fsap-attack
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

aminbigdeli/fsap-attack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Few-Shot Adversarial Attacks against Neural Ranking Models

This repository provides the data and resources for our research on the Few-Shot Adversarial Prompting (FSAP) framework, which crafts adversarial documents to manipulate neural ranking models and promote misleading or harmful content above credible, factual information. FSAP includes two approaches: the IntraQ method, which draws few-shot examples from the same query, and the InterQ method, which uses examples from other queries to increase disruption. All code for generating these adversarial documents and the accompanying evaluation scripts are included here.

The Figure shown below illustrates our threat model and shows how adversarial prompting injects LLM-generated documents into the ranking pool, leading to manipulation of NRM outputs representing false, malicious, counterfactual content to the user.

Data Format

The data/ directory contains two collections, TREC 2020 and TREC 2021. Each collection is organized by query ID:

data/<collection>/qid_<QUERYID>/
├── helpful_documents/      # Markdown files of original helpful documents (doc_<DOCID>.md)
├── harmful_documents/      # Markdown files of original harmful documents (doc_<DOCID>.md)
└── adversarial_documents/           # Generated adversarial documents
    └── doc_<DOCID>/       # Subfolder per original helpful doc
        ├── GPT-4o/         # GPT-4o-generated variants, named by attack method
        └── DeepSeek-R1-claude3.7/   # DeepSeek-R1-generated variants, named by attack method

Prompts

The prompts/ directory contains templates for each attack method and shared instruction sets:

prompts/
├── FSAP-IntraQ.txt            # Template for FSAP-IntraQ adversarial document generation
├── FSAP-InterQ.txt            # Template for FSAP-InterQ adversarial document generation
├── Paraphraser.txt     # Template for paraphraser attack (paraphrase harmful document)
├── Rewriter.txt        # Template for rewriter attack (rewrite harmful document)
├── Liar.txt            # Template for liar attack (generate contrarian content)
├── Fact-Inversion.txt  # Template for fact inversion attack (invert key facts)
├── stance_alignment.txt       # Shared instructions to align content with a required stance
└── adversarial_detection.txt  # Shared instructions to convert a helpful doc into an adversarial variant
  • Attack-specific files (*.txt): each contains the prompt to generate a specific adversarial document type.
  • Shared files:
    • stance_alignment.md provides guidelines for producing stance-consistent content.
    • adversarial_detection.md provides guidelines for transforming factual documents into adversarial examples.

Usage

Cloning and Setup

Clone the repository

git clone https://github.com/aminbigdeli/fsap-attack.git
cd fsap-attack

Create a Python environment

conda create -n fsap-attack-env python=3.9 -y
conda activate fsap-attack-env
pip install -r requirements.txt

Generating Adversarial Documents

# Example: Generate adversarial documents using GPT-4o (OpenAI)
python code/generate_adversarial_document.py \
  --collection trec2020 or trec2021 \
  --data-root ./data \
  --prompts-dir ./prompts \
  --output-root ./data \
  --llm gpt4o \
  --api-key $OPENAI_API_KEY \

# Example: Generate adversarial documents using DeepSeek-R1-claude3.7 (Ollama)
python code/generate_adversarial_document.py \
  --collection trec2020 or trec2021 \
  --data-root ./data \
  --prompts-dir ./prompts \
  --output-root ./data \
  --llm deepseek_r1 \

Evaluating Mean Help-Defeat Rate with Ranking Models

# Example: Evaluate Mean Help-Defeat Rate using OpenAI embeddings
python code/evaluate_openai_embeddings_mdr.py \
  --collection trec2020 or trec2021 \
  --data-root ./data \
  --output-dir ./results/ \
  --api-key $OPENAI_API_KEY \
  --embedding-model text-embedding-ada-002 or text-embedding-3-small

# Example: Evaluate Mean Help-Defeat Rate using Neural Re-Rankers from pygaggle (MonoBERT/MonoT5)
python code/evaluate_reranker_mdr.py \
  --collection trec2020 or trec2021 \
  --data-root ./data \
  --output-dir ./results/ \
  --model MonoBERT or MonoT5

Adversarial Document Detection

python code/adversarial_document_classifier.py \
  --collection trec2020 or trec2021 \
  --data-root ./data \
  --output-dir ./results/ \
  --llm GPT-4o or DeepSeek-R1-claude3.7 \
  --model gpt-4o \
  --api-key $OPENAI_API_KEY \
  --prompts-dir ./prompts

Stance Detection

python code/stance_detector.py \
  --collection trec2020 or trec2021 \
  --data-root ./data \
  --output-dir ./results/ \
  --llm GPT-4o or DeepSeek-R1-claude3.7 \
  --model gpt-4o \
  --api-key $OPENAI_API_KEY \
  --prompts-dir ./prompts

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0