Prompt Hardener is an open-source tool that evaluates and strengthens system prompts used in LLM-based applications. It helps developers proactively defend against prompt injection attacks by combining automated evaluation, self-refinement, and attack simulation β all exportable as structured reports.
Originally created to help secure LLM agents like chatbots, RAG systems, and tool-augmented assistants, Prompt Hardener is designed to be usable, extensible, and integrable in real-world development workflows.
π Designed for:
- LLM application developers
- Security engineers working on AI pipelines
- Prompt engineers and red teamers
Feature | Description |
---|---|
π§ Self-refinement | Iteratively improves prompts based on security evaluation feedback |
π‘οΈ Hardening Techniques | Apply strategies like Spotlighting, Signed Prompt, and Rule Reinforcement |
π£ Automated Attack Testing | Runs prompt injection payloads across multiple categories |
π HTML Reports | Clear, styled summaries of evaluations and attack outcomes |
π JSON Output | Raw data for CI/CD integration or manual inspection |
π Web UI | Use with Gradio for demos, experimentation, and quick prototyping |
Prompt Hardener uses LLMs to evaluate prompts, apply security improvements, test prompt injection attacks, and judge whether attacks were successful β all in an automated pipeline.
Prompt Hardener supports OpenAI, Anthropic Claude and AWS Bedrock(only claude v3 or newer model) APIs.
You must set at least one of the following environment variables before use:
# For OpenAI API (e.g., GPT-4, GPT-4o)
export OPENAI_API_KEY=...
# For Claude API (e.g., Claude 3.7 Sonnet)
export ANTHROPIC_API_KEY=...
# For Bedrock API (e.g., anthropic.claude-3-5-sonnet-20240620-v1:0)
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...
You can add these lines to your shell profile (e.g., .bashrc
, .zshrc
) to make them persistent.
git clone https://github.com/cybozu/prompt-hardener.git
cd prompt-hardener
pip install -r requirements.txt
Here is an example of command using cli mode.
python3 src/main.py \
--target-prompt-path path/to/prompt.json \
--eval-api-mode openai \
--eval-model gpt-4o-mini \
--output-path path/to/hardened.json \
--user-input-description Comments \
--max-iterations 3 \
--test-after \
--report-dir ~/Downloads
Arguments Overview
Argument | Short | Type | Required | Default | Description |
---|---|---|---|---|---|
--target-prompt-path |
-t |
str |
β Yes | - | Path to the file containing the target system prompt (Chat Completion message format, JSON). |
--eval-api-mode |
-ea |
str |
β Yes | - | LLM API used for evaluation and improvement (openai , claude or bedrock ). |
--eval-model |
-em |
str |
β Yes | - | Model name used for evaluation and improvement (e.g., gpt-4o-mini , claude-3-7-sonnet-latest , anthropic.claude-3-5-sonnet-20240620-v1:0 ). |
--attack-api-mode |
-aa |
str |
β No | --eval-api-mode |
LLM API used for executing attacks (defaults to the evaluation API). |
--attack-model |
-am |
str |
β No | --eval-model |
Model used to generate and run attacks (defaults to the evaluation model). |
--judge-api-mode |
-ja |
str |
β No | --eval-api-mode |
LLM API used for attack insertion and success judgment (defaults to the attack API). |
--judge-model |
-jm |
str |
β No | --eval-model |
Model used to insert attack payloads and judge injection success (defaults to the attack model). |
--aws-region |
-ar |
str |
β No | us-east-1 |
AWS region for Bedrock API mode. Default is us-east-1 . |
--user-input-description |
-ui |
str |
β No | None |
Description of user input fields (e.g., Comments ), used to guide placement and tagging of user data. |
--output-path |
-o |
str |
β Yes | - | File path to write the final improved prompt as JSON. |
--max-iterations |
-n |
int |
β No | 3 |
Maximum number of improvement iterations. |
--threshold |
float |
β No | 8.5 |
Satisfaction score threshold (0β10) to stop refinement early if reached. | |
--apply-techniques |
-a |
list[str] |
β No | All techniques | Defense techniques to apply: β’ spotlighting β’ signed_prompt β’ rule_reinforcement β’ structured_output |
--test-after |
-ta |
flag |
β No | False |
If set, runs a prompt injection test using various attack payloads after prompt improvement. |
--test-separator |
-ts |
str |
β No | None |
Optional string to prepend to each attack payload during injection testing (e.g., \\n , ### ). |
--tools-path |
-tp |
str |
β No | None |
Path to JSON file defining available tool functions (used for testing function/tool abuse attacks). |
--report-dir |
-rd |
str |
β No | None |
Directory to write the evaluation report files (HTML and JSON summary of injection test results and prompt evaluation). |
Note:
--eval-api-mode
,--attack-api-mode
, and--judge-api-mode
acceptopenai
,claude
, orbedrock
as options.- When using Bedrock, you must specify the Bedrock Model ID for
--eval-model
,--attack-model
, and--judge-model
(e.g.,.claude-3-5-sonnet-20240620-v1:0
).
python3 src/webui.py
Then visit http://localhost:7860 to use Prompt Hardener interactively:
- Paste prompts
- Choose models & settings
- Download hardened prompt and reports
Perfect for demoing, auditing, or collaborating across teams.
Here is the demo screen for the Web UI.
Prompt Hardener includes automated adversarial testing after hardening.
Supported categories (extensible):
- Persona Switching
- Output Attack
- Prompt Leaking
- Chain-of-Thought Escape
- Function Call Hijacking
- Ignoring RAG Instructions
- Privilege Escalation
- JSON/Structured Output Hijacking
- Tool Definition Leaking
Each attack is automatically injected and measured for:
- Injection attacks blocked (β or β)
- LLM response contents
- Category, payload, and result
Prompt Hardener includes multiple defense strategies:
Technique | Description |
---|---|
Spotlighting | Emphasizes instruction boundaries and user roles explicitly |
Signed Prompt | Embeds cryptographic or structural markers to prevent tampering |
Rule Reinforcement | Repeats constraints or refusals clearly within context |
Structured Output | Encourages consistent, parseable LLM responses |
Role Consistency | Ensures system messages do not include user inputs, preserving role purity |
You can see the details of each hardening techniques from below.
After each run, Prompt Hardener generates:
prompt_security_report_<random value>.html
:- Visual report of improved prompt, evaluations and injection test results
prompt_security_report_<random value>_attack_results.json:
:- Raw structured data of injection test results
Reports are styled for readability and highlight:
- Initial prompt and hardened final prompt
- Evaluation category-wise scores and comments
- Injection attack blocked stats (e.g., 12/15 PASSED)
- PASSED = attack blocked, FAILED = attack succeeded.
You can see examples of using Prompt Hardener to improve and test system prompts for an AI assistant and a comment-summarizing AI from below.