RealMath Benchmark

📄 Paper | 🤗 RealMath

This is the official code for RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics by Jie Zhang, Cezara Petrui, Kristina Nikolić, Florian Tramèr.

A novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks.

We provide a HuggingFace dataset, which includes 633 samples from math.arXiv, 111 samples from cs.arXiv, and 542 samples from Math Stack Exchange.

Motivatation

We aim to address the following question:

How effective are large language models as assistants for practicing mathematicians today?

To explore this, our benchmark is designed with the following criteria in mind:

Real-world application focus
Content is sourced from authentic mathematical research to reflect real-world use cases.
Automated verification
Samples are designed to be easy to verify programmatically for scalable and reliable evaluation.
Continuous acquisition
The benchmark is refreshable and designed to avoid dataset contamination over time.

In this work, we present RealMath:

A data pipeline that extracts verifiable mathematical statements from research papers (e.g., arXiv) and mathematical forums (e.g., Stack Exchange), creating a rich corpus of research-level mathematical content.
An evaluation methodology focused on verifiable answers rather than proof assessment, allowing for automated correctness checking.
A continually refreshable dataset design, that leverages the vast and growing body of mathematical literature, allowing for regular updates with new content to mitigate contamination concerns.

Overview

This project implements an end-to-end pipeline that:

Retrieves papers related to mathematical problems from arXiv, e.g., CS, Math, etc.
Extracts and processes LaTeX source code
Extracts theorems from these papers
Generates question-answer pairs from theorems with fixed answers
Evaluate the capabilities of LLMs on solving the question-answer pairs

Requirements

Python 3.12
Dependencies:
```
pip install -r requirements.txt
```
As this repo doesn't require any GPU, it's easy to run the server locally. But make sure you have a latex installation on your machine, because we ensure the QA pairs in latex format can be directly rendered when we mannually check the theorems.

Quick Start

Note that our benchmark is fully automated and refreshable. For example, we can simply run the following script to retrieve the latest math papers from May 2025 and evaluate frontier models on them.

#!/bin/bash
OUTPUT_PATH=MATH_2025_5

# 1. Retrieve math papers
python helpers/arxiv_retriever.py --year 2025 --month 5 --output $OUTPUT_PATH/papers --max-results 1000 --category math

# 2. Extract LaTeX source
python helpers/extract_latex_text.py --input $OUTPUT_PATH/papers --output $OUTPUT_PATH/latex

# 3. Extract theorems
python helpers/extract_theorems.py --input $OUTPUT_PATH/latex --output $OUTPUT_PATH/theorems 

# 4. Generate QA pairs
python helpers/generate_qa.py --input $OUTPUT_PATH/theorems --output $OUTPUT_PATH/qa_pairs 

# 5. Evaluate the QA pairs
python eval_math.py --model o4-mini --dataset $OUTPUT_PATH/qa_pairs --output $OUTPUT_PATH/results  &

python eval_math.py --model claude-3.7-sonnet --dataset $OUTPUT_PATH/qa_pairs --output $OUTPUT_PATH/results   &

python eval_math.py --model claude-3.7-sonnet --dataset $OUTPUT_PATH/qa_pairs --use_thinking --parallel 10 --output $OUTPUT_PATH/results &

# Wait for both parallel processes to complete
wait

echo "Done!"

Cite

If you use this code/dataset in your research, please cite the following paper:

@misc{zhang2025realmathcontinuousbenchmarkevaluating,
      title={RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics}, 
      author={Jie Zhang and Cezara Petrui and Kristina Nikolić and Florian Tramèr},
      year={2025},
      eprint={2505.12575},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.12575}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
analysis		analysis
helpers		helpers
README.md		README.md
eval_math.py		eval_math.py
pipeline.png		pipeline.png
requirements.txt		requirements.txt
teaser.png		teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RealMath Benchmark

📄 Paper | 🤗 RealMath

Motivatation

Overview

Requirements

Quick Start

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ethz-spylab/RealMath

Folders and files

Latest commit

History

Repository files navigation

RealMath Benchmark

📄 Paper | 🤗 RealMath

Motivatation

Overview

Requirements

Quick Start

Cite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages