Awesome LLM Pre-training

中文版 | English Version

Pre-training is the first and most crucial training stage in the development of large language models. As the open-source community continues to improve in areas such as model architecture, training strategies, open-source datasets, and data methods, we are committed to continuously tracking resources available for large model pre-training to give back to developers in the open-source large language model community.

Compared to comprehensive reviews, our scope is limited to commonly used resources and cutting-edge attempts related to pre-training, aiming to help users quickly get started with large language model pre-training. We also welcome contributions and updates from the open-source community to jointly promote the development of large models.

Related project links: [LLMSurvey] [YuLan-Chat] | [YuLan-Mini]

I. Technical Reports

Technical reports often rely on hundreds or thousands of computing resources. Therefore, it is highly recommended to read some open-source technical reports.

1.1 Dense Models

The Llama 3 Herd of Models. [paper]
Qwen2.5 Technical Report. [paper]
Gemma 3 Technical Report. [paper]
Nemotron-4 340B Technical Report. [paper]
Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs. [paper]
Baichuan 2: Open Large-scale Language Models. [paper]

1.2 MoE Models

DeepSeek-V3 Technical Report. [paper]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
Mixtral of Experts. [paper]
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]
OLMoE: Open Mixture-of-Experts Language Models. [paper]
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]

1.3 Models with Open-source Datasets

YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]
LLM360: Towards Fully Transparent Open-Source LLMs. [paper]
Nemotron-4 15B Technical Report. [paper]

1.4 Training/Data Strategies

Phi-4 Technical Report. [paper]
OLMo: Accelerating the Science of Language Models. [paper]
2 OLMo 2 Furious. [paper]
Yi: Open Foundation Models by 01.AI. [paper]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]

1.5 Hybrid/Linear Models

Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]
MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]

⬆️ Back to Table of Contents

All Technical Reports

LLaMA Series

LLaMA: Open and Efficient Foundation Language Models. [paper]
Llama 2: Open Foundation and Fine-Tuned Chat Models. [paper]
The Llama 3 Herd of Models. [paper]

Qwen Series

Qwen Technical Report. [paper]
Qwen2 Technical Report. [paper]
Qwen2.5 Technical Report. [paper]

DeepSeek Series

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. [paper]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. [paper]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. [paper]
DeepSeek-V3 Technical Report. [paper]

Gemma Series

Gemma: Open Models Based on Gemini Research and Technology. [paper]
Gemma 2: Improving Open Language Models at a Practical Size. [paper]
Gemma 3 Technical Report. [paper]

Gemini Series

Gemini: A Family of Highly Capable Multimodal Models. [paper]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. [paper]

Mistral Series

Mistral 7B. [paper]
Mixtral of Experts. [paper]

Phi Series

Textbooks Are All You Need. [paper]
Textbooks Are All You Need II: phi-1.5 technical report. [paper]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. [paper]
Phi-4 Technical Report. [paper]

GLM Series

GLM: General Language Model Pretraining with Autoregressive Blank Infilling. [paper]
GLM-130B: An Open Bilingual Pre-trained Model. [paper]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. [paper]

Baichuan Series

Baichuan 2: Open Large-scale Language Models. [paper]
Baichuan-M1: Pushing the Medical Capability of Large Language Models. [paper]

Falcon Series

The Falcon Series of Open Language Models. [paper]
Falcon2-11B Technical Report. [paper]
Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]

InternLM Series

InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. [paper]
InternLM2 Technical Report. [paper]

MiniCPM

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]

Yi Series

Yi: Open Foundation Models by 01.AI. [paper]
Yi-Lightning Technical Report. [paper]

Minimax Series

MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]

Reka Series

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models. [paper]

Skywork Series

Skywork: A More Open Bilingual Foundation Model. [paper]
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]

Hunyuan Series

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]

Nemotron Series

Nemotron-4 15B Technical Report. [paper]
Nemotron-4 340B Technical Report. [paper]
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]

Ling Series

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]

OLMo Series

OLMo: Accelerating the Science of Language Models. [paper]
2 OLMo 2 Furious. [paper]
OLMoE: Open Mixture-of-Experts Language Models. [paper]

YuLan Series

YuLan: An Open-source Large Language Model. [resource] [code] [paper]
YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]

MAP-Neo Series

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]

LLM360 Project

LLM360: Towards Fully Transparent Open-Source LLMs. [paper]

⬆️ Back to Table of Contents

II. Training Strategies

We discuss training strategies from aspects such as training frameworks, training strategies, interpretability, model architecture improvements, and learning rate annealing.

2.1 Training Frameworks

The most commonly used training 8000 framework is Megatron-LM, which provides a good out-of-the-box and efficient benchmark. Combining it with other libraries can achieve better training speed.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. [code] [paper]

The most commonly used pre-training framework, with a high entry threshold but more stability.
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. [paper]

Computation-communication overlapping for MoE.
DeepEP: an efficient expert-parallel communication library. [code]

Expert parallel acceleration.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling. [code]

Accelerating FP8 matrix multiplication using the asynchronous features of Hopper.
Liger Kernel: Efficient Triton Kernels for LLM Training. [code] [paper]

Triton acceleration operator library.

All Training Frameworks

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. [code] [paper]

Zero-redundancy data parallelism
TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. [code] [paper]

Torch-native parallelism based on DTensor
Flash Linear Attention [code]

Efficient Triton-based implementations for state-of-the-art linear attention models

2.2 Training Strategies

Regarding hyperparameter Scaling Law, parallel strategies, initialization strategies, optimizer selection, FP8 training, etc.

Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining. [paper] [homepage]

About the Scaling Law of hyperparameters.
The Ultra-Scale Playbook: Training LLMs on GPU Clusters. [demo]

Visualizing the memory usage of parallel strategies.
A Spectral Condition for Feature Learning. [paper]

An advanced version of MuP.
Muon is Scalable for LLM Training. [code] [paper]

An efficient optimizer.
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training. [paper] [code]

Training with optimizer states and activation values also in FP8.
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models. [paper]

About the Scaling Law of MoE.

2.3 Interpretability

We list some interpretability works that are inspiring for pre-training.

On the Biology of a Large Language Model. [blog]
Physics of Language Models. [homepage]
In-context Learning and Induction Heads. [blog] [paper]
Rethinking Reflection in Pre-Training. [paper]

2.4 Model Architecture Improvements

We list some recent improvements to model architectures.

Gated Delta Networks: Improving Mamba2 with Delta Rule. [paper]
RWKV-7 "Goose" with Expressive Dynamic State Evolution. [paper]
Mixture of Hidden-Dimensions Transformer. [paper]
Titans: Learning to Memorize at Test Time. [paper]
Ultra-Sparse Memory Network. [paper]
Large Language Diffusion Models. [paper]
Better & Faster Large Language Models via Multi-token Prediction. [paper]
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. [paper]
Stick-breaking Attention. [code] [paper]
Forgetting Transformer: Softmax Attention with a Forget Gate. [paper]
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. [unofficial code] [paper]
MoBA: Mixture of Block Attention for Long-Context LLMs. [code] [paper]
KV Shifting Attention Enhances Language Modeling. [paper]
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models. [paper]
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts. [paper]
ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs. [paper]
μnit Scaling: Simple and Scalable FP8 LLM Training. [paper]

2.5 Learning Rate Annealing

Learning rate annealing is often combined with data quality screening.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations. [paper]
Scaling Law with Learning Rate Annealing. [paper]

⬆️ Back to Table of Contents

III. Open-source Datasets

We discuss existing open-source datasets mainly from four aspects: web pages, mathematics, code, and general-purpose.

3.1 Web Pages

Web page data will form the core corpus in pre-training.

DCLM. [paper] [resource]

An open-source web page dataset, a 3.8T dataset obtained after screening by Fasttext, etc.
FineWeb-Edu [paper] [resource]

A corpus for educational quality scoring, screened and scored from FineWeb, which has certain effects on knowledge-intensive questions.
Nemotron-CC-HQ. [paper] [resource]

NVIDIA's CC corpus.
Chinese-FineWeb-Edu. [resource]

An open-source Chinese educational quality scoring corpus by OpenCSG, screened and scored from Map-CC, SkyPile, WuDao, Wanjuan, etc.
FineWeb2: A sparkling update with 1000s of languages [resource]

A multilingual dataset.

3.2 Mathematics

Mathematical pre-training corpora can significantly improve the mathematical ability of the base model and the upper limit of post-training.

MegaMath: Pushing the Limits of Open Math Corpora. [resource] [paper]

The largest open-source high-quality mathematical CC corpus.
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models. [resource] [paper]

Synthetic mathematical instruction data.
mlfoundations-dev/stackoverflow_math. [resource]

Math-related questions.
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning. [resource] [paper]

A high-difficulty mathematical dataset.
YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]

Collecting open-source Lean theorem proving datasets.

3.3 Code

Code data can not only enhance the code generation ability of the base model but also improve its mathematical and logical abilities.

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models. [resource] [paper]

Cleaned from The-Stack-V2.
SmolLM-corpus. [resource]

Python educational quality scoring.
The-Stack-V2. [resource]

The largest-scale uncleaned code data.
YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]

Cleaning Jupyter-Notebook and Python data with educational quality.
HuggingFaceTB/issues-kaggle-notebooks. [resource]

GitHub Issues and Kaggle Notebooks data.
mlfoundations-dev/stackoverflow. [resource]

Programming Q&A forum.
Magicoder: Empowering Code Generation with OSS-Instruct. [resource] [paper]

Training with synthetic instruction data generated from open-source code.

3.4 General-purpose (Books, Encyclopedias, Instructions, Long Contexts, etc.)

General-purpose data is often scarce long-tail data, which plays a crucial role in the usability of post-training models.

YuLan: An Open-source Large Language Model. [resource] [code] [paper]

Long-tail knowledge enhancement and cleaning of various general-purpose data sources.
MinerU: An Open-Source Solution for Precise Document Content Extraction. [code] [paper]

Converting PDF to Markdown with strong compatibility.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling. [homepage] [paper]

arXiv, conversations, DM Math, etc.
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. [resource] [paper]

Encyclopedias, books, papers, Reddit, etc.
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models. [resource] [paper]

Law, exams, news, patents, encyclopedias, etc.
MAmmoTH2: Scaling Instructions from the Web. [resource] [paper]

Q&A for web pages.
togethercomputer/Long-Data-Collections. [resource]

Filtered books, papers, web pages, and instructions from datasets such as RedPajama, Pile, and P3.
Longattn: Selecting long-context training data via token-level attention. [resource] [paper]

Q&A for long-range dependencies.

⬆️ Back to Table of Contents

IV. Data Methods

Datasets are often paired with high-quality data methods. We elaborate on this from aspects such as tokenizers, data配比 and courses, and data synthesis.

4.1 Tokenizers

Tokenization is an important but often overlooked part of the model, which can significantly affect the model's ability in mathematics, knowledge, etc.

SuperBPE: Space Travel for Language Models. [code] [paper]

A multi-word tokenizer training method.
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies. [code] [demo] [paper]

Predicting the vocabulary size.
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs. [code] [paper]

Comparing the tokenization methods of numbers.

4.2 Data Mixing and Curriculum

Multi-stage pre-training often enables the model to fully learn high-quality and small-scale data. Introducing more mathematical, code, CoT, and even long-thinking chain data in the continued pre-training (CPT) stage will form the core capabilities of the next generation of pre-trained models.

Nemotron-4 15B Technical Report. [paper]

Divided into 8T pre-training + CPT with a smaller data scale.
YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]

Using educational scores for course data.
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. [code] [paper]

Optimizing the pre-training data mixing ratio.
Efficient Online Data Mixing For Language Model Pre-Training. [paper]

Online data mixing.
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance. [paper]

Data mixing laws.
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?. [paper]

Cracking the data ratio of commercial models such as GPT through the merging rules of BPE tokenizers.
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training. [homepage] [paper]

A clustering-based iterative data mixing bootstrapping framework.
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens. [demo] [homepage] [paper]

Building an index for large-scale pre-training datasets to check data quality.

4.3 Data Synthesis

In addition to the synthetic data for mathematics and code mentioned above, we summarize some general synthetic data methods and resources. Moreover, using more long-thinking data in the later stage of pre-training is also becoming a direction worthy of exploration.

Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems. [resource] [code] [paper]

Imitation learning based on long-thinking chain synthetic data.
Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions. [code] [paper]

Generating synthetic instruction data rich in information to learn knowledge from a limited corpus.
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs. [resource] [code] [paper]

Constructing long-text Creative Writing.
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use. [paper]

Multi-step reasoning data synthesis, decomposing complex tasks into sub-trajectories and optimizing data generation with reinforcement learning.
WildChat: 1M ChatGPT Interaction Logs in the Wild. [resource] [paper]

An open-source dataset of real user conversations.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. [resource] [code] [paper]

Alignment data synthesis.

⬆️ Back to Table of Contents

Contribution

If you have suggestions for the project content, please submit Issues and PRs to jointly promote the development of large language models.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
CITATION.cff		CITATION.cff
README.md		README.md
README_ZH.md		README_ZH.md

RUCAIBox/awesome-llm-pretraining

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM Pre-training

Table of Contents

I. Technical Reports

1.1 Dense Models

1.2 MoE Models

1.3 Models with Open-source Datasets

1.4 Training/Data Strategies

1.5 Hybrid/Linear Models

LLaMA Series

Qwen Series

DeepSeek Series

Gemma Series

Gemini Series

Mistral Series

Phi Series

GLM Series

Baichuan Series

Falcon Series

InternLM Series

MiniCPM

Yi Series

Minimax Series

Reka Series

Skywork Series

Hunyuan Series

Nemotron Series

Ling Series

OLMo Series

YuLan Series

MAP-Neo Series

LLM360 Project

II. Training Strategies

2.1 Training Frameworks

2.2 Training Strategies

2.3 Interpretability

2.4 Model Architecture Improvements

2.5 Learning Rate Annealing

III. Open-source Datasets

3.1 Web Pages

3.2 Mathematics

3.3 Code

3.4 General-purpose (Books, Encyclopedias, Instructions, Long Contexts, etc.)

IV. Data Methods

4.1 Tokenizers

4.2 Data Mixing and Curriculum

4.3 Data Synthesis

Contribution

Star History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages