中文版 | English Version
Pre-training is the first and most crucial training stage in the development of large language models. As the open-source community continues to improve in areas such as model architecture, training strategies, open-source datasets, and data methods, we are committed to continuously tracking resources available for large model pre-training to give back to developers in the open-source large language model community.
Compared to comprehensive reviews, our scope is limited to commonly used resources and cutting-edge attempts related to pre-training, aiming to help users quickly get started with large language model pre-training. We also welcome contributions and updates from the open-source community to jointly promote the development of large models.
Related project links: [LLMSurvey] [YuLan-Chat] | [YuLan-Mini]
Technical reports often rely on hundreds or thousands of computing resources. Therefore, it is highly recommended to read some open-source technical reports.
- The Llama 3 Herd of Models. [paper]
- Qwen2.5 Technical Report. [paper]
- Gemma 3 Technical Report. [paper]
- Nemotron-4 340B Technical Report. [paper]
- Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs. [paper]
- Baichuan 2: Open Large-scale Language Models. [paper]
- DeepSeek-V3 Technical Report. [paper]
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
- Mixtral of Experts. [paper]
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]
- Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]
- OLMoE: Open Mixture-of-Experts Language Models. [paper]
- Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]
- YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]
- MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]
- LLM360: Towards Fully Transparent Open-Source LLMs. [paper]
- Nemotron-4 15B Technical Report. [paper]
- Phi-4 Technical Report. [paper]
- OLMo: Accelerating the Science of Language Models. [paper]
- 2 OLMo 2 Furious. [paper]
- Yi: Open Foundation Models by 01.AI. [paper]
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]
- Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]
- MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]
- Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]
All Technical Reports
- LLaMA: Open and Efficient Foundation Language Models. [paper]
- Llama 2: Open Foundation and Fine-Tuned Chat Models. [paper]
- The Llama 3 Herd of Models. [paper]
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. [paper]
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. [paper]
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. [paper]
- DeepSeek-V3 Technical Report. [paper]
- Gemma: Open Models Based on Gemini Research and Technology. [paper]
- Gemma 2: Improving Open Language Models at a Practical Size. [paper]
- Gemma 3 Technical Report. [paper]
- Gemini: A Family of Highly Capable Multimodal Models. [paper]
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. [paper]
- Textbooks Are All You Need. [paper]
- Textbooks Are All You Need II: phi-1.5 technical report. [paper]
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. [paper]
- Phi-4 Technical Report. [paper]
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. [paper]
- GLM-130B: An Open Bilingual Pre-trained Model. [paper]
- ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. [paper]
- Baichuan 2: Open Large-scale Language Models. [paper]
- Baichuan-M1: Pushing the Medical Capability of Large Language Models. [paper]
- The Falcon Series of Open Language Models. [paper]
- Falcon2-11B Technical Report. [paper]
- Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]
- InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. [paper]
- InternLM2 Technical Report. [paper]
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]
- MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]
- Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models. [paper]
- Skywork: A More Open Bilingual Foundation Model. [paper]
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]
- Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]
- Nemotron-4 15B Technical Report. [paper]
- Nemotron-4 340B Technical Report. [paper]
- Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]
- Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]
- OLMo: Accelerating the Science of Language Models. [paper]
- 2 OLMo 2 Furious. [paper]
- OLMoE: Open Mixture-of-Experts Language Models. [paper]
- YuLan: An Open-source Large Language Model. [resource] [code] [paper]
- YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]
- MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]
- LLM360: Towards Fully Transparent Open-Source LLMs. [paper]
We discuss training strategies from aspects such as training frameworks, training strategies, interpretability, model architecture improvements, and learning rate annealing.
The most commonly used training 8000 framework is Megatron-LM, which provides a good out-of-the-box and efficient benchmark. Combining it with other libraries can achieve better training speed.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. [code] [paper]
The most commonly used pre-training framework, with a high entry threshold but more stability.
- Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. [paper]
Computation-communication overlapping for MoE.
- DeepEP: an efficient expert-parallel communication library. [code]
Expert parallel acceleration.
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling. [code]
Accelerating FP8 matrix multiplication using the asynchronous features of Hopper.
- Liger Kernel: Efficient Triton Kernels for LLM Training. [code] [paper]
Triton acceleration operator library.
All Training Frameworks
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. [code] [paper]
Zero-redundancy data parallelism
- TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. [code] [paper]
Torch-native parallelism based on DTensor
- Flash Linear Attention [code]
Efficient Triton-based implementations for state-of-the-art linear attention models
Regarding hyperparameter Scaling Law, parallel strategies, initialization strategies, optimizer selection, FP8 training, etc.
- Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining. [paper] [homepage]
About the Scaling Law of hyperparameters.
- The Ultra-Scale Playbook: Training LLMs on GPU Clusters. [demo]
Visualizing the memory usage of parallel strategies.
- A Spectral Condition for Feature Learning. [paper]
An advanced version of MuP.
- Muon is Scalable for LLM Training. [code] [paper]
An efficient optimizer.
- COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training. [paper] [code]
Training with optimizer states and activation values also in FP8.
- Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models. [paper]
About the Scaling Law of MoE.
We list some interpretability works that are inspiring for pre-training.
- On the Biology of a Large Language Model. [blog]
- Physics of Language Models. [homepage]
- In-context Learning and Induction Heads. [blog] [paper]
- Rethinking Reflection in Pre-Training. [paper]
We list some recent improvements to model architectures.
- Gated Delta Networks: Improving Mamba2 with Delta Rule. [paper]
- RWKV-7 "Goose" with Expressive Dynamic State Evolution. [paper]
- Mixture of Hidden-Dimensions Transformer. [paper]
- Titans: Learning to Memorize at Test Time. [paper]
- Ultra-Sparse Memory Network. [paper]
- Large Language Diffusion Models. [paper]
- Better & Faster Large Language Models via Multi-token Prediction. [paper]
- Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. [paper]
- Stick-breaking Attention. [code] [paper]
- Forgetting Transformer: Softmax Attention with a Forget Gate. [paper]
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. [unofficial code] [paper]
- MoBA: Mixture of Block Attention for Long-Context LLMs. [code] [paper]
- KV Shifting Attention Enhances Language Modeling. [paper]
- Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models. [paper]
- Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts. [paper]
- ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs. [paper]
- μnit Scaling: Simple and Scalable FP8 LLM Training. [paper]
Learning rate annealing is often combined with data quality screening.
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]
- Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations. [paper]
- Scaling Law with Learning Rate Annealing. [paper]
We discuss existing open-source datasets mainly from four aspects: web pages, mathematics, code, and general-purpose.
Web page data will form the core corpus in pre-training.
- DCLM. [paper] [resource]
An open-source web page dataset, a 3.8T dataset obtained after screening by Fasttext, etc.
- FineWeb-Edu [paper] [resource]
A corpus for educational quality scoring, screened and scored from FineWeb, which has certain effects on knowledge-intensive questions.
- Nemotron-CC-HQ. [paper] [resource]
NVIDIA's CC corpus.
- Chinese-FineWeb-Edu. [resource]
An open-source Chinese educational quality scoring corpus by OpenCSG, screened and scored from Map-CC, SkyPile, WuDao, Wanjuan, etc.
- FineWeb2: A sparkling update with 1000s of languages [resource]
A multilingual dataset.
Mathematical pre-training corpora can significantly improve the mathematical ability of the base model and the upper limit of post-training.
- MegaMath: Pushing the Limits of Open Math Corpora. [resource] [paper]
The largest open-source high-quality mathematical CC corpus.
- JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models. [resource] [paper]
Synthetic mathematical instruction data.
- mlfoundations-dev/stackoverflow_math. [resource]
Math-related questions.
- DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning. [resource] [paper]
A high-difficulty mathematical dataset.
- YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]
Collecting open-source Lean theorem proving datasets.
Code data can not only enhance the code generation ability of the base model but also improve its mathematical and logical abilities.
- OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models. [resource] [paper]
Cleaned from The-Stack-V2.
- SmolLM-corpus. [resource]
Python educational quality scoring.
- The-Stack-V2. [resource]
The largest-scale uncleaned code data.
- YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]
Cleaning Jupyter-Notebook and Python data with educational quality.
- HuggingFaceTB/issues-kaggle-notebooks. [resource]
GitHub Issues and Kaggle Notebooks data.
- mlfoundations-dev/stackoverflow. [resource]
Programming Q&A forum.
- Magicoder: Empowering Code Generation with OSS-Instruct. [resource] [paper]
Training with synthetic instruction data generated from open-source code.
General-purpose data is often scarce long-tail data, which plays a crucial role in the usability of post-training models.
- YuLan: An Open-source Large Language Model. [resource] [code] [paper]
Long-tail knowledge enhancement and cleaning of various general-purpose data sources.
- MinerU: An Open-Source Solution for Precise Document Content Extraction. [code] [paper]
Converting PDF to Markdown with strong compatibility.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. [homepage] [paper]
arXiv, conversations, DM Math, etc.
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. [resource] [paper]
Encyclopedias, books, papers, Reddit, etc.
- WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models. [resource] [paper]
Law, exams, news, patents, encyclopedias, etc.
- MAmmoTH2: Scaling Instructions from the Web. [resource] [paper]
Q&A for web pages.
- togethercomputer/Long-Data-Collections. [resource]
Filtered books, papers, web pages, and instructions from datasets such as RedPajama, Pile, and P3.
- Longattn: Selecting long-context training data via token-level attention. [resource] [paper]
Q&A for long-range dependencies.
Datasets are often paired with high-quality data methods. We elaborate on this from aspects such as tokenizers, data配比 and courses, and data synthesis.
Tokenization is an important but often overlooked part of the model, which can significantly affect the model's ability in mathematics, knowledge, etc.
- SuperBPE: Space Travel for Language Models. [code] [paper]
A multi-word tokenizer training method.
- Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies. [code] [demo] [paper]
Predicting the vocabulary size.
- Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs. [code] [paper]
Comparing the tokenization methods of numbers.
Multi-stage pre-training often enables the model to fully learn high-quality and small-scale data. Introducing more mathematical, code, CoT, and even long-thinking chain data in the continued pre-training (CPT) stage will form the core capabilities of the next generation of pre-trained models.
- Nemotron-4 15B Technical Report. [paper]
Divided into 8T pre-training + CPT with a smaller data scale.
- YuLan-Mini: An Open Data-efficient Language Model. [code] [resource] [paper]
Using educational scores for course data.
- DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. [code] [paper]
Optimizing the pre-training data mixing ratio.
- Efficient Online Data Mixing For Language Model Pre-Training. [paper]
Online data mixing.
- Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance. [paper]
Data mixing laws.
- Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?. [paper]
Cracking the data ratio of commercial models such as GPT through the merging rules of BPE tokenizers.
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training. [homepage] [paper] A clustering-based iterative data mixing bootstrapping framework.
- Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens. [demo] [homepage] [paper]
Building an index for large-scale pre-training datasets to check data quality.
In addition to the synthetic data for mathematics and code mentioned above, we summarize some general synthetic data methods and resources. Moreover, using more long-thinking data in the later stage of pre-training is also becoming a direction worthy of exploration.
- Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems. [resource] [code] [paper]
Imitation learning based on long-thinking chain synthetic data.
- Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions. [code] [paper]
Generating synthetic instruction data rich in information to learn knowledge from a limited corpus.
- LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs. [resource] [code] [paper]
Constructing long-text Creative Writing.
- Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use. [paper]
Multi-step reasoning data synthesis, decomposing complex tasks into sub-trajectories and optimizing data generation with reinforcement learning.
- WildChat: 1M ChatGPT Interaction Logs in the Wild. [resource] [paper]
An open-source dataset of real user conversations.
- Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. [resource] [code] [paper]
Alignment data synthesis.
If you have suggestions for the project content, please submit Issues and PRs to jointly promote the development of large language models.