8000 GitHub - cybozu/prompt-hardener: Prompt Hardener is an open-source tool that evaluates and strengthens system prompts used in LLM-based applications.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Prompt Hardener is an open-source tool that evaluates and strengthens system prompts used in LLM-based applications.

License

Notifications You must be signed in to change notification settings

cybozu/prompt-hardener

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

72 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Prompt Hardener

prompt-hardener-logo

Prompt Hardener is an open-source tool that evaluates and strengthens system prompts used in LLM-based applications. It helps developers proactively defend against prompt injection attacks by combining automated evaluation, self-refinement, and attack simulation β€” all exportable as structured reports.

Originally created to help secure LLM agents like chatbots, RAG systems, and tool-augmented assistants, Prompt Hardener is designed to be usable, extensible, and integrable in real-world development workflows.

πŸ“Œ Designed for:

  • LLM application developers
  • Security engineers working on AI pipelines
  • Prompt engineers and red teamers

✨ Features

Feature Description
🧠 Self-refinement Iteratively improves prompts based on security evaluation feedback
πŸ›‘οΈ Hardening Techniques Apply strategies like Spotlighting, Signed Prompt, and Rule Reinforcement
πŸ’£ Automated Attack Testing Runs prompt injection payloads across multiple categories
πŸ“Š HTML Reports Clear, styled summaries of evaluations and attack outcomes
πŸ“ JSON Output Raw data for CI/CD integration or manual inspection
🌐 Web UI Use with Gradio for demos, experimentation, and quick prototyping

πŸš€ Getting Started

πŸ”‘ Set Up API Keys

Prompt Hardener uses LLMs to evaluate prompts, apply security improvements, test prompt injection attacks, and judge whether attacks were successful β€” all in an automated pipeline.

Prompt Hardener supports OpenAI, Anthropic Claude and AWS Bedrock(only claude v3 or newer model) APIs.

You must set at least one of the following environment variables before use:

# For OpenAI API (e.g., GPT-4, GPT-4o)
export OPENAI_API_KEY=...

# For Claude API (e.g., Claude 3.7 Sonnet)
export ANTHROPIC_API_KEY=...

# For Bedrock API (e.g., anthropic.claude-3-5-sonnet-20240620-v1:0)
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...

You can add these lines to your shell profile (e.g., .bashrc, .zshrc) to make them persistent.

πŸ”§ Installation

git clone https://github.com/cybozu/prompt-hardener.git
cd prompt-hardener
pip install -r requirements.txt

πŸ–₯️ CLI Usage

Here is an example of command using cli mode.

python3 src/main.py \
  --target-prompt-path path/to/prompt.json \
  --eval-api-mode openai \
  --eval-model gpt-4o-mini \
  --output-path path/to/hardened.json \
  --user-input-description Comments \
  --max-iterations 3 \
  --test-after \
  --report-dir ~/Downloads
Arguments Overview
Argument Short Type Required Default Description
--target-prompt-path -t str βœ… Yes - Path to the file containing the target system prompt (Chat Completion message format, JSON).
--eval-api-mode -ea str βœ… Yes - LLM API used for evaluation and improvement (openai, claude or bedrock).
--eval-model -em str βœ… Yes - Model name used for evaluation and improvement (e.g., gpt-4o-mini, claude-3-7-sonnet-latest, anthropic.claude-3-5-sonnet-20240620-v1:0).
--attack-api-mode -aa str ❌ No --eval-api-mode LLM API used for executing attacks (defaults to the evaluation API).
--attack-model -am str ❌ No --eval-model Model used to generate and run attacks (defaults to the evaluation model).
--judge-api-mode -ja str ❌ No --eval-api-mode LLM API used for attack insertion and success judgment (defaults to the attack API).
--judge-model -jm str ❌ No --eval-model Model used to insert attack payloads and judge injection success (defaults to the attack model).
--aws-region -ar str ❌ No us-east-1 AWS region for Bedrock API mode. Default is us-east-1.
--user-input-description -ui str ❌ No None Description of user input fields (e.g., Comments), used to guide placement and tagging of user data.
--output-path -o str βœ… Yes - File path to write the final improved prompt as JSON.
--max-iterations -n int ❌ No 3 Maximum number of improvement iterations.
--threshold float ❌ No 8.5 Satisfaction score threshold (0–10) to stop refinement early if reached.
--apply-techniques -a list[str] ❌ No All techniques Defense techniques to apply:
β€’ spotlighting
β€’ signed_prompt
β€’ rule_reinforcement
β€’ structured_output
--test-after -ta flag ❌ No False If set, runs a prompt injection test using various attack payloads after prompt improvement.
--test-separator -ts str ❌ No None Optional string to prepend to each attack payload during injection testing (e.g., \\n, ###).
--tools-path -tp str ❌ No None Path to JSON file defining available tool functions (used for testing function/tool abuse attacks).
--report-dir -rd str ❌ No None Directory to write the evaluation report files (HTML and JSON summary of injection test results and prompt evaluation).

Note:

  • --eval-api-mode, --attack-api-mode, and --judge-api-mode accept openai, claude, or bedrock as options.
  • When using Bedrock, you must specify the Bedrock Model ID for --eval-model, --attack-model, and --judge-model (e.g., .claude-3-5-sonnet-20240620-v1:0).

🌐 Web UI (Gradio)

python3 src/webui.py

Then visit http://localhost:7860 to use Prompt Hardener interactively:

  • Paste prompts
  • Choose models & settings
  • Download hardened prompt and reports

Perfect for demoing, auditing, or collaborating across teams.

Here is the demo screen for the Web UI.

webui

πŸ§ͺ Attack Simulation

Prompt Hardener includes automated adversarial testing after hardening.

Supported categories (extensible):

  • Persona Switching
  • Output Attack
  • Prompt Leaking
  • Chain-of-Thought Escape
  • Function Call Hijacking
  • Ignoring RAG Instructions
  • Privilege Escalation
  • JSON/Structured Output Hijacking
  • Tool Definition Leaking

Each attack is automatically injected and measured for:

  • Injection attacks blocked (βœ… or ❌)
  • LLM response contents
  • Category, payload, and result

πŸ› οΈ Hardening Techniques

Prompt Hardener includes multiple defense strategies:

Technique Description
Spotlighting Emphasizes instruction boundaries and user roles explicitly
Signed Prompt Embeds cryptographic or structural markers to prevent tampering
Rule Reinforcement Repeats constraints or refusals clearly within context
Structured Output Encourages consistent, parseable LLM responses
Role Consistency Ensures system messages do not include user inputs, preserving role purity

You can see the details of each hardening techniques from below.

docs/techniques.md

πŸ“„ Reporting

After each run, Prompt Hardener generates:

  • prompt_security_report_<random value>.html:
    • Visual report of improved prompt, evaluations and injection test results
  • prompt_security_report_<random value>_attack_results.json::
    • Raw structured data of injection test results

Reports are styled for readability and highlight:

  • Initial prompt and hardened final prompt
  • Evaluation category-wise scores and comments
  • Injection attack blocked stats (e.g., 12/15 PASSED)
    • PASSED = attack blocked, FAILED = attack succeeded.

report1 report2 report3

πŸ’ͺ Tutorials

You can see examples of using Prompt Hardener to improve and test system prompts for an AI assistant and a comment-summarizing AI from below.

docs/tutorials.md

About

Prompt Hardener is an open-source tool that evaluates and strengthens system prompts used in LLM-based applications.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  
0