“The limits of my language mean the limits of my world.”
– Wittgenstein
This repository is based on TinyZero, which is a reproduction of DeepSeek R1 Zero in countdown built upon veRL.
conda create -n zero python=3.9
# install torch [or you can skip this step and let vllm to install the correct version for you]
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
pip install ray
# verl
pip install -e .
# flash attention 2
pip install flash-attn --no-build-isolation
# quality of life
pip install wandb IPython matplotlib
# behavioral evals
pip install asynciolimiter loguru tenacity anthropic openai
First, you can generate the Countdown dataset from the original repo.
Data Preparation
conda activate zero
python ./examples/data_preprocess/countdown.py --local_dir {path_to_your_dataset}
To generate the priming data, we use claude-3.5-sonnet
.
We generate 5 datasets with different behaviors.
python generate_cot_datasets/api_gen.py --api_key {your_api_key} --dataset_type {dataset_type} --target_samples {target_samples} --output_file {output_file} --seed {seed} --max_target {max_target} --min_target {min_target}
# process the data into parquet format
sh ./scripts/process_data.sh
We generate 2 datasets with empty COT, one that is length matched to all strategies and one that just has an empty COT.
python generate_cot_datasets/generate_empty_cot.py --input_file {input_file} --output_file {output_file}
We convert the all strategies dataset to only have incorrect examples.
python generate_cot_datasets/generate_no_positive_cot.py --input_file {input_file} --output_file {output_file}
We run SFT on the priming data to get a new primed base model.
chmod +x scripts/sft.sh
./scripts/sft.sh
We run PPO on the primed model.
sh ./scripts/train.sh
We run the behavioral evals with gpt-4o-mini
to get the results.
sh ./scripts/behavioral_evals.sh
First, we label the pretraining data to get the behavior counts.
# Classify behaviors
python pretraining_analysis/relabel_pretrain_offline.py --user username --start 0 --end 1000000 --save_every 10000 --dataset_name {dataset_name}
# Process and get stats
python pretraining_analysis/process_pretrain_labelled.py --dataset_name {dataset_name}
python pretraining_analysis/get_stats.py --dataset_name {dataset_name}
We generate a new dataset with the labeled data.
# Label as COT
python pretraining_analysis/relabel_pretrain_qa.py --user username --start 0 --end 1000000 --save_every 10000 --dataset_name {dataset_name} --mehtod {curated or negative}
# Format as QA and save as parquet
python pretraining_analysis/generate_qa_parquets.py --dataset_name {dataset_name}
# Trim for SFT and save as parquet
python pretraining_analysis/save_parquets.py --dataset_name {dataset_name}
We then run SFT on the new dataset to get a new primed base model and run PPO on it.