This repository provides the codebase utilized by the paper Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
We introduce a novel framework for interpreting the reasoning capabilities of large language models (LLMs) through the lens of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM’s parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.
- 🐍 Python 3.10
- ⚡ PyTorch 2.6.0
- 🧠 veRL
- 🚀 vLLM
- 📊 OpenCompass
We provide the reasoning trajectories generated by Qwen2.5-72B-Instruct and DeepSeek-Distill-Qwen-14B on Hugging Face:
We utilize veRL to conduct training, you can refer to its github for launching training.
The relevant code is located in src/third_party/verl
:
📦 verl
┣ 📂trainer
┃ ┣ 📂config
┃ ┃ ┗ 📜 ppo_trainer.yaml
┃ ┗ 📂ppo
┃ ┃ ┗ 📜 ray_trainer.py
┣ 📂workers
┃ ┗ 📂actor
┃ ┃ ┗ 📜 dp_actor.py
┣ 📜 data_process_grpo.py
┗ 📜 data_process_sft.py
data_process_sft.py
: Script to generate Parquet files for SFT training.data_process_grpo.py
: Script to generate Parquet files for GRPO training.
The GRPO training implementation is detailed in:
verl/trainer/ppo/ray_trainer.py
verl/workers/actor/dp_actor.py
verl/trainer/config/ppo_trainer.yaml
Specifically, we generate the gen_mask
to control the number of update trajectories per question and add it to the batch
in lines 867–881 of ray_trainer.py
:
from collections import defaultdict
index = batch.non_tensor_batch['uid']
prompt2sample = defaultdict(list)
for i in range(len(index)):
prompt2sample[index[i]].append(i)
responses = batch.batch['responses']
response_length = responses.size(1)
attention_mask = batch.batch['attention_mask']
response_mask = attention_mask[:, -response_length:]
mask = torch.zeros_like(response_mask)
import random
for indices in prompt2sample.values():
for i in random.sample(indices, self.config.actor_rollout_ref.actor.num_update_sample_per_prompt):
mask[i] = 1
batch.batch.update({"gen_mask": mask})
Then, in line 279 of dp_actor.py
, we obtain the gen_mask
:
gen_mask = data['gen_mask']
We combine the gen_mask
and response_mask
in lines 291–296 of dp_actor.py
:
pg_loss, pg_clipfrac, ppo_kl = core_algos.compute_policy_loss(
old_log_prob=old_log_prob,
log_prob=log_prob,
advantages=advantages,
eos_mask=response_mask & gen_mask,
cliprange_min=clip_ratio_min,
cliprange_max=clip_ratio_max
)
We also add the configuration parameter num_update_sample_per_prompt
in ppo_trainer.yaml
.
For the SFT experiments, you can control the number of update trajectories per question in
data_process_sft.py
by adjusting the training trajectories.
We utilize OpenCompass for evaluation. A demo evaluation config file is provided in src/third_party/opencompass/evaluation_config.py
.
📦 opencompass
┗ 📜 evaluation_config.py
Please refer to the OpenCompass documentation for details on running the evaluation.
We provide several demo notebooks in src/
for experiments discussed in the paper:
📦 src
┣ 📂third_party
┃ ┣ 📂opencompass
┃ ┗ 📂verl
┣ 📜 landscape.ipynb # Visualizing landscape
┣ 📜 pseudo_gradient_descent.ipynb # Experiments on pseudo gradient descent
┗ 📜 trajectory_pruning.ipynb # Experiments on trajectory pruning
If you find this repository helpful, please consider citing our paper:
@article{abs-2505-19815,
author = {Junnan Liu and Hongwei Liu and Linchen Xiao and Shudong Liu and Taolin Zhang and Zihan Ma and Songyang Zhang and Kai Chen},
title = {Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective},
journal = {CoRR},
volume = {abs/2505.19815},
year = {2025}
}