s1: Simple test-time scaling

Minimal recipe for test-time scaling and strong reasoning performance matching o1-preview with just 1,000 examples & budget forcing

Updates:

2025-02: I have adjusted the original work to function on my GPU-poor RTX 4070 SUPER with 12GB vRAM, using a 1.5B base model on Windows OS.
2025-02: We released s1.1 a better model than s1 by reusing the same s1K questions but with reasoning traces generated by r1 instead of Gemini: s1K-1.1. Check this tweet for details
2025-01: We released our paper announced via this tweet.

This repository provides an overview of all resources for the paper "s1: Simple test-time scaling".

Artifacts
Structure
Inference
- transformers
Training
Evaluation
Data
Visuals
Known Issues
Citation

Artifacts

Paper: https://arxiv.org/abs/2501.19393
Model: https://hf.co/simplescaling/s1-32B
Data: https://hf.co/datasets/simplescaling/s1K
- s1-prob: https://hf.co/datasets/simplescaling/s1-prob
- s1-teasers: https://hf.co/datasets/simplescaling/s1-teasers
- Full 59K: https://hf.co/datasets/simplescaling/data_ablation_full59K

Structure

eval/: Evaluation scripts
data/: Synthetic data creation scripts & co
train/: Training scripts

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load fine-tuned model
model_path = "./s1-1"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

def generate_response(prompt, system_prompt="", max_length=2048, temperature=0.1):
    # Build the prompt according to the Qwen2.5 template:
    # 1. System block (guiding instructions)
    # 2. User block (the user's prompt)
    # 3. Assistant block (model's response starts here; we open a <think> block for the model to generate into)
    formatted_prompt = (
        f"<|im_start|>system\n"
        f"{system_prompt}\n"
        f"<|im_end|>\n"
        f"<|im_start|>user\n"
        f"{prompt}\n"
        f"<|im_end|>\n"
        f"<|im_start|>assistant\n"
        f"<think>\n"  # Leave this block open so that the model can generate the closing </think> and answer.
    )
    
    # Tokenize the prompt and move it to GPU (if available)
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
    
    # Generate the output from the model
    output = model.generate(
        **inputs,
        max_length=max_length,
        temperature=temperature,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True
    )
    
    # Decode the generated tokens to a string (preserving special tokens)
    full_response = tokenizer.decode(output[0], skip_special_tokens=False)
    
    # Only keep text before the end marker (<|im_end|>)
    # response = full_response.split("<|im_end|>")[0]
    
    return full_response


# Test the generation
user_input = "What is the third derivative of -25*o - 11*o**2 - 33*o + 58*o + 23*o**5 wrt o?"
# 1380*o**2
print("Bot:", generate_response(user_input))

With budget forcing (coming soon...)

# soon...

transformers

Install the transformers & torch libraries and run:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "./s1-1"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r in raspberry"
messages = [
    {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Training

To run training, you can find our script at train/sft.py which you can invoke directly.

To train s1.1-1.5B, I used 1x RTX4070 SUPER with 12GB vRAM (+/- 7.5 hours). Check the wandb logs here.

Quick start (I used Conda environment and container):

conda create -n llm-training python=3.10 -y
conda activate llm-training
git clone https://github.com/BimwerxNZ/s1.1-1.5B.git
cd s1.1-1.5B
pip install -r requirements.txt
python train/sft.py

Note: If you encounter an out-of-memory (OOM) issue with 8 GPUs, consider enabling gradient checkpointing by adding the following line to your script: --gradient_checkpointing=True.

Evaluation

We cloned lm-evaluation-harness at commit 4cec66e4e468d15789473d6d63c3a61a751fa524 and modified it. Setup:

cd eval/lm-evaluation-harness
pip install -e .[math,vllm]

All commands are in eval/commands.sh. For AIME24 we always pick the aime24_nofigures result, which uses a dataset that only contains the AIME24 figures if they are important for the task.

If you want to compute statistics (avg thinking tokens etc) for an evaluation run you can use python eval/compute_sample_stats.py path_to_samples_file.jsonl

All our evaluation result files are at: https://hf.co/datasets/simplescaling/results

To run REBASE: commands are in eval/rebase/run.sh Note that for the evaluations in the Discussion section with REBASE we used https://huggingface.co/simplescaling/step-conditional-control-old trained on an older version of our dataset https://huggingface.co/datasets/simplescaling/s1K-step-conditional-control-old and run on an older version of our evaluation using https://huggingface.co/datasets/Maxwell-Jia/AIME_2024.

Data

To recreate s1K follow the steps below. In various files you will have to rename the organizations simplescaling and qfq with an organization that you own. Note that s1K-1.1 is a better dataset generated with r1 traces instead of Gemini traces.

Run data/collect_data.py followed by data/fix_gpqa.py & data/add_aime.py to collect the questions; Make sure to change the hub path in the respective files to one of your own
Generate traces with Gemini via python data/gemini.py.
Generate answers with Qwen via python data/bulk_inference.py that can be launched with data/bulk_inference.sh.
Add features by running python data/featurization.py.
Run final filtering via going through data/filter.ipynb.

Visuals

All figures and some tables are created via this colab equivalent to visuals/visuals.ipynb. Some are subsequently edited via the visuals/s1.fig file, which you can load in Figma.

Known Issues

vLLM throws ValueError: Token id XXXXX is out of vocabulary
- This can happen with budget forcing, especially when running with temperature 1, where the model will sometimes do crazy stuff and predict a vocab id that is larger than its max token id but still within its embedding size i.e. anything <152064, >151664; When we refeed the model's previous outputs to it which is done when setting e.g. max_thinking_tokens in the evaluation then this will cause the error cuz vLLM does this check even though it would only be an issue for IDs >152064. To fix it you can just uncomment the vLLM ValueError (It is the line if max_input_id > tokenizer.max_token_id: in vllm/engine/llm_engine.py)

Citation

@misc{muennighoff2025s1simpletesttimescaling,
      title={s1: Simple test-time scaling}, 
      author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},
      year={2025},
      eprint={2501.19393},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.19393}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
data		data
eval		eval
llama_cpp		llama_cpp
train		train
visuals		visuals
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
gguf.py		gguf.py
inf.py		inf.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

s1: Simple test-time scaling

Artifacts

Structure

Inference

With budget forcing (coming soon...)

transformers

Training

Evaluation

Data

Visuals

Known Issues

Citation

About

Uh oh!

Releases

Packages

Languages

License

BimwerxNZ/s1.1-1.5B

Folders and files

Latest commit

History

Repository files navigation

s1: Simple test-time scaling

Artifacts

Structure

Inference

With budget forcing (coming soon...)

transformers

Training

Evaluation

Data

Visuals

Known Issues

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages