STORM-BORN is a challenging benchmark of human‐like mathematical derivations designed to push the reasoning capabilities of large language models (LLMs).
Unlike conventional numerical or formal proofs, STORM-BORN focuses on dense, approximation-rich derivations with heuristic cues, curated from the latest academic papers and vetted by human mathematicians via a multi‐agent, human-in-the-loop framework.
This dataset can be used to fine-tune LLMs to enhance their reasoning generalization on other datasets.
It can also serve as a benchmark for evaluating models’ reasoning abilities. Because the reasoning process is difficult to evaluate automatically, we have designed a multiple-choice format that transforms the generation of correct answers into a selection task for evaluation.
├── data
│ ├── storm-born.jsonl # full dataset
│ ├── storm-born-choice.jsonl # multiple-choice format
│ ├── storm_born_train.jsonl # training split
│ ├── storm_born_test.jsonl # test split
│ ├── storm_born_abcd.jsonl # multiple-choice with four options
│ └── storm_born_test_abcd.jsonl # test set in multiple-choice format
│
├── data_generation
│ ├── clean_data.py # post-process raw model outputs
│ ├── generate_v1.py # synthesize initial derivations
│ ├── pipeline.py # end-to-end data generation
│ └── ... # additional helper scripts
│
├── data_evaluation
│ └── benchmark_evaluation
│ ├── llm_as_judge.py # evaluate STORM-BORN with an LLM-as-Judge
│ └── multiple_choice_eval.py # eval LLMs on choice data
│
└── train
├── case_study.png
└── methods.png
-
Clone the repository
git clone <repo_url> && cd STORM-BORN
-
Create a virtual environment
python3 -m venv .venv && source .venv/bin/activate
-
Install dependencies
pip install google-generativeai openai typing-extensions
-
(Optional) Install Axolotl for SFT Follow the instructions at Axolotl if you plan to fine-tune models.
data/storm-born.jsonl
A JSONL file where each line is a problem instance:
-
Clean model outputs
python data_generation/clean_data.py \ --input raw_outputs.jsonl \ --output data/storm-born.jsonl
-
(Re)generate with multi-agent pipeline
python data_generation/generate_v1.py \ --config configs/gen_v1.yaml \ --output-dir data/tmp
Use an LLM to judge model answers on STORM-BORN:
python data_evaluation/benchmark_evaluation/llm_as_judge.py \
--dataset data/storm-born.jsonl \
--model gpt-4 \
--output results/benchmark.json
Use an LLM to select the correct answer on STORM-BORN-CHOICE:
python data_evaluation/benchmark_evaluation/multiple_choice_eval.py \
--dataset data/storm-born-choice.jsonl \
--model gpt-4 \
--output results/benchmark.json
After fine-tuning on STORM-BORN, evaluate your model on downstream tasks using your preferred framework.
We fine-tune models with the Axolotl framework (not included in this repository). A typical command might look like:
python train.py \
--model_name_or_path elephantai/llama-13b \
--data_path data/storm-born.jsonl \
--output_dir checkpoints/storm-born-sft \
--batch_size 4 \
--epochs 3 \
--lr 2e-5
If you use STORM-BORN, please cite:
@inproceedings{liu2025stormborn,
title = {{STORM}-{BORN}: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework},
author = {Liu, Wenhao and Lu, Zhenyi and Hu, Xinyu and Zhang, Jerry and Li, Dailin and Cen, Jiacheng and Cao, Huilin and Wang, Haiteng and Li, Yuhan and Xie, Kun and Li, Dandan and Zhang, Pei and Zhang, Chengbo and Ren, Yuxiang and Ma, Yan and Huang, Xiaohong},
booktitle = {The 63rd Annual Meeting of the Association for Computational Linguistics},
year = {2025},
url = {https://github.com/lwhere/STORM-BORN}
}
This project is released under the MIT License. See LICENSE for details.