Official repository for the paper Self-Training Elicits Concise Reasoning in Large Language Models by Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, and Se-Young Yun.
Try our models directly in the browser with our HuggingFace Space demo.
We provide all the fine-tuned models for concise reasoning on GSM8K and MATH:
-
LLaMA-3.2 Models:
-
Qwen2.5 Models:
-
Gemma-2 Models:
-
DeepSeek Models:
We also provide a Gradio interface for running our models locally. See the gradio_demo directory for instructions.
- Setup Conda environment:
conda create --name concise python=3.12 -y
conda activate concise
pip install -r requirements.txt
The pipeline expects the following directory structure:
.
├── models/ # Pre-trained models
│ ├── llama-3.2-1b-instruct/
│ ├── llama-3.2-3b-instruct/
│ └── ...
├── data/ # Dataset and generated data
│ ├── gsm8k/
│ ├── math/
│ └── few_shot_examples/ # Few-shot examples for prompting
- Llama 3.2 (1B and 3B)
- Qwen 2.5 (Math 1.5B and 3B)
- Gemma 2 (2B)
- Llama 3.1 (8B)
- DeepSeek Math (7B)
- GSM8K
- MATH
Our training pipeline (training_pipeline.sh
) supports two primary training modes:
- Simple Training: Train using either zero-shot or few-shot generated data
- Augmented Training: Train using a combination of both approaches
- Clone the repository
- Download the original pre-trained models and place them in the
models/
directory: - Run the training pipeline:
TRAINING_TYPE="augmented" ZERO_SHOT_PROMPT_SYSTEM="irpo:16" FEW_SHOT_PROMPT_SYSTEM="gpt4o:16" ./src/scripts/training_pipeline.sh
This command will:
- Generate 16 diverse reasoning paths for each problem using both IRPO zero-shot and GPT-4o few-shot approaches
- Combine these datasets for augmented training
- Train models with the shortest correct reasoning path for each question
- Evaluate models on test sets and report accuracy metrics
The training_pipeline.sh
script orchestrates the entire training process:
-
Generation Phase: Creates reasoning paths using specified prompting approaches
-
Preprocessing Phase: Converts generated paths into a training-ready format
-
Training Phase: Fine-tunes the model on the generated data
-
Evaluation Phase: Tests model performance on benchmark datasets
You can modify the following parameters in the script:
TRAINING_TYPE
: Choose "simple" or "augmented"SIMPLE_APPROACH
: If using simple training, choose "zero-shot" or "few-shot"ZERO_SHOT_PROMPT_SYSTEM
: Format is "method:num_paths" (e.g., "irpo:16")FEW_SHOT_PROMPT_SYSTEM
: Format is "method:num_paths" (e.g., "gpt4o:16")USE_SHORTEST
: Set to true to use only the shortest rationales during trainingCUDA_DEVICES
: Specify which GPUs to use (e.g., "0,1,2,3")
If you find our work useful, please consider citing our paper:
@article{munkhbat2025self,
title={Self-Training Elicits Concise Reasoning in Large Language Models},
author={Munkhbat, Tergel and Ho, Namgyu and Kim, Seohyun and Yang, Yongjin and Kim, Yujin and Yun, Se-Young},
journal={arXiv preprint arXiv:2502.20122},
year={2025}
}