Dataset and code for paper "RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios".
RuleArena is a challenging benchmark to evaluate LLMs on rule-guided reasoning tasks from real-world scenarios:
- Airline: Calculate the total cost for passengers, including their flight ticket and checked baggage fees.
- NBA: Determine whether one or more specified transactions (contract signing or trading) are allowed.
- Tax: Calculate the income tax for one person or family given their financial information.
LLMs are given a the task instruction, the reference rules in this scenario, and a user instance, and required to conduct reasoning and computation for the user input under the guidance of reference rules.
- Run
pip install -r requirements.txt
to install critical dependencies. - Set your API keys as following in
~/.bashrc
, and runsource ~/.bashrc
:
export OPENAI_API_KEY="YOUR OPENAI API KEY"
export CLAUDE_API_KEY="YOUR CLAUDE API KEY"
export QWEN_API_KEY="YOUR QWEN API KEY"
- If you want to use Vertex AI Llama API: follow the instructions here to setup Google Cloud API.
Simply enter the domain (airline
, nba
, or tax
) folder and run the evaluation script auto_test.py
, and specify:
- The LLM (
--llm
) to evaluate - The difficulty level (
--complexity
) of problems - Whether to use 1-shot example (
--use_example
)
For example, to evaluate Claude-3.5 Sonnet (claude-3-5-sonnet-20241022
) on Level-1 (medium difficulty) airline tasks with 1-shot example, do the following:
cd ./airline
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 1 --use_example
To run rule representation experiments, add --textual
to convert tabular rules into textual rules when running airline and tax evaluations at difficulty level 0 like:
cd ./airline
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --textual
To run distractive rule experiments, add --distractor
or --placeholder
to insert distractive rules or meaningless placeholder tokens when running tax evaluations at difficulty level 0 like:
cd ./tax
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --distractor
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --placeholder
DO NOT use these two arguments together.
- The meanings of each parsable argument are written in comments.
- For LLMs except Llama, we use official APIs, for which you can refer to:
- GPT-4o: OpenAI API documents
- Claude-3.5 Sonnet: Anthropic API documents
- Qwen-2.5: Qwen API documents
- Specifically, for Llama APIs you can refer to: Vertex AI Llama API
Please consider citing our paper and giving us a star if you use RuleArena and find it interesting/helpful for your work, and we'd appreciate it! Feel free to contact Ruiwen Zhou or open an issue if you have any questions.
@article{zhou2024rulearena,
author={Ruiwen Zhou and Wenyue Hua and Liangming Pan and Sitao Cheng and Xiaobao Wu and En Yu and William Yang Wang},
title={RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios},
journal={arXiv preprint arXiv:2412.08972},
year={2024}
}