8000 GitHub - open-compass/RaML: [Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

open-compass/RaML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RaML: Deciphering Trajectory-Aided LLM Reasoning from an Optimization Perspective

🤗 Dataset  |  📄 arXiv

This repository provides the codebase utilized by the paper Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Framework Overview

📋 Introduction

We introduce a novel framework for interpreting the reasoning capabilities of large language models (LLMs) through the lens of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM’s parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.

🧰 Dependencies

  • 🐍 Python 3.10
  • ⚡ PyTorch 2.6.0
  • 🧠 veRL
  • 🚀 vLLM
  • 📊 OpenCompass

📁 Code Organization

📚 Data

We provide the reasoning trajectories generated by Qwen2.5-72B-Instruct and DeepSeek-Distill-Qwen-14B on Hugging Face:

🎓 SFT & GRPO Training

We utilize veRL to conduct training, you can refer to its github for launching training.

The relevant code is located in src/third_party/verl:

📦 verl
 ┣ 📂trainer
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜 ppo_trainer.yaml
 ┃ ┗ 📂ppo
 ┃ ┃ ┗ 📜 ray_trainer.py
 ┣ 📂workers
 ┃ ┗ 📂actor
 ┃ ┃ ┗ 📜 dp_actor.py
 ┣ 📜 data_process_grpo.py
 ┗ 📜 data_process_sft.py

🛠️ Data Processing Scripts

  • data_process_sft.py: Script to generate Parquet files for SFT training.
  • data_process_grpo.py: Script to generate Parquet files for GRPO training.

⚙️ GRPO Training Configuration

The GRPO training implementation is detailed in:

  • verl/trainer/ppo/ray_trainer.py
  • verl/workers/actor/dp_actor.py
  • verl/trainer/config/ppo_trainer.yaml

Specifically, we generate the gen_mask to control the number of update trajectories per question and add it to the batch in lines 867–881 of ray_trainer.py:

from collections import defaultdict
index = batch.non_tensor_batch['uid']
prompt2sample = defaultdict(list)
for i in range(len(index)):
    prompt2sample[index[i]].append(i)
responses = batch.batch['responses']
response_length = responses.size(1)
attention_mask = batch.batch['attention_mask']
response_mask = attention_mask[:, -response_length:]
mask = torch.zeros_like(response_mask)
import random
for indices in prompt2sample.values():
    for i in random.sample(indices, self.config.actor_rollout_ref.actor.num_update_sample_per_prompt):
        mask[i] = 1
batch.batch.update({"gen_mask": mask})

Then, in line 279 of dp_actor.py, we obtain the gen_mask:

gen_mask = data['gen_mask']

We combine the gen_mask and response_mask in lines 291–296 of dp_actor.py:

pg_loss, pg_clipfrac, ppo_kl = core_algos.compute_policy_loss(
    old_log_prob=old_log_prob,
    log_prob=log_prob,
    advantages=advantages,
    eos_mask=response_mask & gen_mask,
    cliprange_min=clip_ratio_min,
    cliprange_max=clip_ratio_max
)

We also add the configuration parameter num_update_sample_per_prompt in ppo_trainer.yaml.

For the SFT experiments, you can control the number of update trajectories per question in data_process_sft.py by adjusting the training trajectories.

🧪 Evaluation on Benchmarks

We utilize OpenCompass for evaluation. A demo evaluation config file is provided in src/third_party/opencompass/evaluation_config.py.

📦 opencompass
 ┗ 📜 evaluation_config.py

Please refer to the OpenCompass documentation for details on running the evaluation.

📊 Analysis Notebooks

We provide several demo notebooks in src/ for experiments discussed in the paper:

📦 src
 ┣ 📂third_party
 ┃ ┣ 📂opencompass
 ┃ ┗ 📂verl
 ┣ 📜 landscape.ipynb                # Visualizing landscape
 ┣ 📜 pseudo_gradient_descent.ipynb  # Experiments on pseudo gradient descent
 ┗ 📜 trajectory_pruning.ipynb       # Experiments on trajectory pruning

📖 Citation

If you find this repository helpful, please consider citing our paper:

@article{abs-2505-19815,
  author       = {Junnan Liu and Hongwei Liu and Linchen Xiao and Shudong Liu and Taolin Zhang and Zihan Ma and Songyang Zhang and Kai Chen},
  title        = {Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective},
  journal      = {CoRR},
  volume       = {abs/2505.19815},
  year         = {2025}
}

About

[Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0