RuleArena

Dataset and code for paper "RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios".

Introduction

RuleArena is a challenging benchmark to evaluate LLMs on rule-guided reasoning tasks from real-world scenarios:

Airline: Calculate the total cost for passengers, including their flight ticket and checked baggage fees.
NBA: Determine whether one or more specified transactions (contract signing or trading) are allowed.
Tax: Calculate the income tax for one person or family given their financial information.

LLMs are given a the task instruction, the reference rules in this scenario, and a user instance, and required to conduct reasoning and computation for the user input under the guidance of reference rules.

Environment

Run pip install -r requirements.txt to install critical dependencies.
Set your API keys as following in ~/.bashrc, and run source ~/.bashrc:

export OPENAI_API_KEY="YOUR OPENAI API KEY"
export CLAUDE_API_KEY="YOUR CLAUDE API KEY"
export QWEN_API_KEY="YOUR QWEN API KEY"

If you want to use Vertex AI Llama API: follow the instructions here to setup Google Cloud API.

How to Use

Main Results

Simply enter the domain (airline, nba, or tax) folder and run the evaluation script auto_test.py, and specify:

The LLM (--llm) to evaluate
The difficulty level (--complexity) of problems
Whether to use 1-shot example (--use_example)

For example, to evaluate Claude-3.5 Sonnet (claude-3-5-sonnet-20241022) on Level-1 (medium difficulty) airline tasks with 1-shot example, do the following:

cd ./airline
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 1 --use_example

Experiments for Different Rule Representation

To run rule representation experiments, add --textual to convert tabular rules into textual rules when running airline and tax evaluations at difficulty level 0 like:

cd ./airline
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --textual

Experiments for Distractive Rules

To run distractive rule experiments, add --distractor or --placeholder to insert distractive rules or meaningless placeholder tokens when running tax evaluations at difficulty level 0 like:

cd ./tax
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --distractor
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --placeholder

DO NOT use these two arguments together.

Notes:

The meanings of each parsable argument are written in comments.
For LLMs except Llama, we use official APIs, for which you can refer to:
- GPT-4o: OpenAI API documents
- Claude-3.5 Sonnet: Anthropic API documents
- Qwen-2.5: Qwen API documents
Specifically, for Llama APIs you can refer to: Vertex AI Llama API

Citation

Please consider citing our paper and giving us a star if you use RuleArena and find it interesting/helpful for your work, and we'd appreciate it! Feel free to contact Ruiwen Zhou or open an issue if you have any questions.

@article{zhou2024rulearena,
  author={Ruiwen Zhou and Wenyue Hua and Liangming Pan and Sitao Cheng and Xiaobao Wu and En Yu and William Yang Wang},
  title={RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios},
  journal={arXiv preprint arXiv:2412.08972},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
airline		airline
assets		assets
nba		nba
tax		tax
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RuleArena

Introduction

Environment

How to Use

Main Results

Experiments for Different Rule Representation

Experiments for Distractive Rules

Notes:

Citation

About

Uh oh!

Releases

Packages

Languages

License

SkyRiver-2000/RuleArena

Folders and files

Latest commit

History

Repository files navigation

RuleArena

Introduction

Environment

How to Use

Main Results

Experiments for Different Rule Representation

Experiments for Distractive Rules

Notes:

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages