We initially explored GRPO based on the Reasoning model synthetic CoT Graph Extraction data, with LLM involved in the reward function.
.
├── LICENSE
├── ground_truth_gen # data gen via DeepSeek R1
│ ├── polished_rl_training_data.csv
│ └── r1_distill_reasoning_graph_extraction.ipynb
└── train
└── Qwen_GRPO_Graph_Extraction.ipynb # training process
update: Seems the training notebook doesnt render properly in github, check from colab instead:
Data Gen | Training |
---|---|
- DeepSeek-Math and DeepSeek R1's work and DeepSeek R1 Model
- Qwen's Great Base model
- Will's, this(cannot find the author), and Unsloth's Daniel Han Chen work