8000 GitHub - SkyRiver-2000/RuleArena: Codes and data for our paper - RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Codes and data for our paper - RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

License

Notifications You must be signed in to change notification settings

SkyRiver-2000/RuleArena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RuleArena

Dataset and code for paper "RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios".

Introduction

RuleArena is a challenging benchmark to evaluate LLMs on rule-guided reasoning tasks from real-world scenarios:

  • Airline: Calculate the total cost for passengers, including their flight ticket and checked baggage fees.
  • NBA: Determine whether one or more specified transactions (contract signing or trading) are allowed.
  • Tax: Calculate the income tax for one person or family given their financial information.

LLMs are given a the task instruction, the reference rules in this scenario, and a user instance, and required to conduct reasoning and computation for the user input under the guidance of reference rules.

Environment

  • Run pip install -r requirements.txt to install critical dependencies.
  • Set your API keys as following in ~/.bashrc, and run source ~/.bashrc:
export OPENAI_API_KEY="YOUR OPENAI API KEY"
export CLAUDE_API_KEY="YOUR CLAUDE API KEY"
export QWEN_API_KEY="YOUR QWEN API KEY"
  • If you want to use Vertex AI Llama API: follow the instructions here to setup Google Cloud API.

How to Use

Main Results

Simply enter the domain (airline, nba, or tax) folder and run the evaluation script auto_test.py, and specify:

  • The LLM (--llm) to evaluate
  • The difficulty level (--complexity) of problems
  • Whether to use 1-shot example (--use_example)

For example, to evaluate Claude-3.5 Sonnet (claude-3-5-sonnet-20241022) on Level-1 (medium difficulty) airline tasks with 1-shot example, do the following:

cd ./airline
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 1 --use_example

Experiments for Different Rule Representation

To run rule representation experiments, add --textual to convert tabular rules into textual rules when running airline and tax evaluations at difficulty level 0 like:

cd ./airline
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --textual

Experiments for Distractive Rules

To run distractive rule experiments, add --distractor or --placeholder to insert distractive rules or meaningless placeholder tokens when running tax evaluations at difficulty level 0 like:

cd ./tax
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --distractor
python auto_test.py --llm claude-3-5-sonnet-20241022 --complexity 0 --use_example --placeholder

DO NOT use these two arguments together.

Notes:

Citation

Please consider citing our paper and giving us a star if you use RuleArena and find it interesting/helpful for your work, and we'd appreciate it! Feel free to contact Ruiwen Zhou or open an issue if you have any questions.

@article{zhou2024rulearena,
  author={Ruiwen Zhou and Wenyue Hua and Liangming Pan and Sitao Cheng and Xiaobao Wu and En Yu and William Yang Wang},
  title={RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios},
  journal={arXiv preprint arXiv:2412.08972},
  year={2024}
}

About

Codes and data for our paper - RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0