Temporal Consistency for LLM Reasoning Process Error Identification

🔍 A New Method for Math LLM Verification 🎯

Overview

We introduce an iterative self-checking approach that significantly improves the verification accuracy of mathematical solutions for language models. Our method achieves remarkable performance with Deepseek R1 distilled models, enabling 7B/8B models to surpass all 70B/72B models and GPT-4o on ProcessBench. For example, with DeepSeek-R1-Distill-Qwen-7B, our method achieves 71.3% accuracy on ProcessBench, while with DeepSeek- 79AF R1-Distill-Llama-8B it achieves 67.2% accuracy. In the Iterative Self-checking method, each LLM iteratively checks its own identifications, and the final output is only produced when multiple LLMs demonstrate consistent self-checking over time.

Getting Started 🎯

Installation

git clone https://github.com/jcguo123/Iterative-self-checking.git
cd Iterative-self-checking
pip install -r requirements.txt

Usage

For data in a JSON file. First, run the following code to generate the results for the first verification.

python ./code/generation/initial_verification.py --model_path <MODEL_PATH> --dataset_path <DATASET_PATH> --output_dic <OUTPUT_DIR>

Then, run the following code for the self-checking rounds.

python ./code/generation/self_checking.py --model_path <MODEL_PATH> --output_dic <OUTPUT_DIRECTORY>

Evaluation

To run our evaluation scripts, run:

python ./code/evaluation/evaluation.py --model <MODEL_NAME> --output_dic <OUTPUT_DIRECTORY>

Performance

To evaluate the error detection capabilities, we choose three annotated mathematical step datasets, PRM800K, ProcessBench, and MathCheck. These benchmarks contain problems from GSM8K, MATH dataset, OlympiadBench, and Omni-MATH.

Model	Mathcheck*	ProcessBench	PRM800K
GPT-4o mini	78.8	52.9	34.0
+ Iterative Self-checking	84.8	58.2	39.0
GPT-40	87.3	62.5	41.6
+ Iterative Self-checking	91.8	69.1	51.6
Llama 3.1 8B Instruct	13.3	6.4	2.4
+ Iterative Self-checking	60.2	35.5	22.1
Mistral 7B Instruct v0.3	26.4	20.3	13.0
+ Iterative Self-checking	37.4	22.5	13.3
Deepseek-R1-Qwen-7B	86.0	54.8	46.2
+ Iterative Self-checking	89.5	71.3	57.7
Deepseek-R1-Llama-8B	35.9	29.3	21.2
+ Iterative Self-checking	82.5	67.2	50.2
Deepseek-R1-Qwen-14B	91.1	71.6	54.6
+ Iterative Self-checking	94.8	82.2	63.2
Deepseek-R1	93.5	83.3	62.9

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
code		code
figure		figure
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Temporal Consistency for LLM Reasoning Process Error Identification

Overview

Getting Started 🎯

Installation

Usage

Evaluation

Performance

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

jcguo123/Temporal-Consistency

Folders and files

Latest commit

History

Repository files navigation

Temporal Consistency for LLM Reasoning Process Error Identification

Overview

Getting Started 🎯

Installation

Usage

Evaluation

Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages