GitHub - tanishqkumar/beyond-nanogpt: Minimal and annotated implementations of key ideas from modern deep learning research.

Beyond NanoGPT: Go From LLM Beginner to AI Researcher!

Beyond-NanoGPT is the minimal and educational repo aiming to bridge between nanoGPT and research-level deep learning. This repo includes annotated and from-scratch implementations of almost 100 crucial modern techniques in frontier deep learning, aiming to help newcomers learn enough to start running experiments of their own.

The repo implements everything from KV caching and speculative decoding for LLMs to architectures like vision transformers and MLP-mixers; from attention variants like linear or multi-latent attention to generative models like denoising diffusion models and flow matching algorithms; from landmark RL papers like PPO, A3C, and AlphaZero to systems fundamentals like GPU communication algorithms and data/tensor parallelism.

Because everything is implemented by-hand, the code comments explain the especially subtle details often glossed over both in papers and production codebases.

_{A glimpse of some plots you can make!

(Left) Language model speedups from
attention-variants/linear_attention.ipynb,

(Center) Samples from a small denoising diffusion model trained on MNIST in
generative-models/train_ddpm.py,

(Right) Reward over time for a small MLP policy on CartPole in
rl/fundamentals/train_ppo.py.}

LESSONS.md documents some of the things I've learned in the months spent writing this codebase.

Quickstart

Clone the Repo:

git clone https://github.com/tanishqkumar/beyond-nanogpt.git

Get Minimal Dependencies:

pip install torch numpy torchvision wandb tqdm transformers datasets diffusers matplotlib pillow jupyter gym

Start learning! The code is meant for you to read carefully, hack around with, then re-implement yourself from scratch and compare to. You can just run .py files with vanilla Python in the following way.
```
cd architectures/
python train_dit.py
```
or for instance
```
cd rl/fundamentals/
python train_reinforce.py --verbose --wandb 
```
Everything is written to be run on a single GPU. The code is self-documenting with comments for intuition and elaborating on subtleties I found tricky to implement. Arguments are specified at the bottom of each file. Jupyter notebooks are meant to be stepped through.

Current Implementations and Roadmap

Asterisks (*) denote particularly tricky implementations.

Architectures

Basic Transformer language-models/transformer.py and train_naive.py [paper]
Vision Transformer (ViT) architectures/train_vit.py [paper]
Diffusion Transformer (DiT) architectures/train_dit.py [paper]
Recurrent Neural Network (RNN) architectures/train_rnn.py [paper]
Residual Networks (ResNet) architectures/train_resnet.py [paper]
MLP-Mixer architectures/train_mlp_mixer.py [paper]
LSTM architectures/train_lstm.py [paper]
Mixture-of-Experts* (MoE) architectures/train_moe.py [paper]
Mamba* architectures/train_mamba.py [paper]

Attention Variants

Vanilla Self-Attention attention-variants/vanilla_attention.ipynb [paper]
Multi-head Self-Attention attention-variants/mhsa.ipynb [paper]
8000 Grouped-Query Attention attention-variants/gqa.ipynb [paper]
Linear Attention* attention-variants/linear_attention.ipynb [paper]
Sparse Attention attention-variants/sparse_attention.ipynb [paper]
Cross Attention attention-variants/cross_attention.ipynb [paper]
Multi-Latent Attention* attention-variants/mla.ipynb [paper]

Language Models

Optimized Dataloading language-models/dataloaders [reference]
- Producer-consumer asynchronous dataloading
- Sequence packing
Byte-Pair Encoding language-models/bpe.ipynb [paper]
KV Caching language-models/KV_cache.ipynb [reference]
Speculative Decoding language-models/speculative_decoding.ipynb [paper]
RoPE embeddings* language-models/rope.ipynb [paper]
Multi-token Prediction language-models/train_mtp.py [paper]

Reinforcement Learning

Deep RL
- Fundamentals rl/fundamentals
  - DQN train_dqn.py [paper]
  - REINFORCE train_reinforce.py [paper]
  - PPO train_ppo.py [paper]
- Actor-Critic and Key Variants rl/actor-critic
  - Advantage Actor-Critic (A2C) train_a2c.py [paper]
  - Asynchronous Advantage Actor-Critic (A3C) train_a3c.py [paper]
  - IMPALA* (distributed RL) train_impala.py [paper]
  - Deep Deterministic Policy Gradient (DDPG) train_ddpg.py [paper]
  - Soft Actor-Critic* (SAC) train_sac.py [paper]
- Model-based RL rl/model-based
  - Model Predictive Control (MPC) train_mpc.py[reference]
  - Expert Iteration (MCTS) train_expert_iteration.py [paper]
  - Probabilistic Ensembles with Trajectory Sampling (PETS)
- Neural Chess Engine (AlphaZero) rl/chess [paper]
  - Define the architecture and environment model.py and env.py
  - MCTS for move search mcts.py
  - Self-play train.py
  - Dynamic batching and multiprocessing mcts.py
LLMs rl/llms
- RLHF a base model with UltraFeedback
- DPO a base model with UltraFeedback
- GRPO for reasoning: outcome reward on math train_grpo_gsm.py [paper]
- GRPO to use a new API correctly
- GRPO to write good haikus with an LLM autograder

Generative Models

Generative Adversarial Networks (GAN) generative-models/train_gan.py [paper]
Pix2Pix (Conditional GANs) generative-models/train_pix2pix.py [paper]
Variational Autoencoders (VAE) generative-models/train_vae.py [paper]
- Train an autoencoder for reconstruction generative-models/train_autoencoder.py
Neural Radiance Fields (NeRF)
Denoising Diffusion Probablistic Models* (DDPM) generative-models/train_ddpm.py [paper]
Classifier-based diffusion guidance generative-models/ddpm_classifier_guidance.py [paper]
- Classifier-free diffusion guidance generative-models/ddpm_classifier_free_guidance.py [paper]
Flow matching generative-models/train_flow_matching.py [paper]

MLSys

GPU Communication Algorithms* (scatter, gather, ring/tree allreduce) mlsys/comms.py [reference]
Distributed Data Parallel mlsys/train_ddp.py [paper]
Tensor Parallel
Ring Attention (Context Parallel)
Paged Attention
Flash Attention in Triton

Evals

BERT on SST-2 (old-school NLP)
GSM8k (generative) evals/eval_gsm8k.py [paper]
MMLU (multiple-choice) evals/eval_mmlu.py [paper]
SimpleQA (LLM judge) evals/eval_simpleqa.py [paper]
Design our own eval ("good taste")

RAG

Train Small Embedding and Reranking Models
RAG 101: Retrieval on Q&A Answers rag/intro_rag.py
Multi-Hop Decomposition RAG
Sparse and Dense Retrieval
Graph RAG

Agents

Let an LLM use internet search for Q&A agents/basic-search-use
Coding Agent agents/coding-agent
- Tool use (search, run code, read/write files) & sandboxing for powerful tools [/tools]
- ReAct (iterated CoT with tool use in between) agent.py
- Memory/context management distinguishing short vs long term memory memory.py
- Evaluate: can it make a correct PR end-to-end in reponse to a GitHub issue? [demo]
Simulate a society with language models
Tree-of-Thoughts deep research agents
Parallel multi-agent deep research

Notes

The codebase will generally work with either a CPU or GPU, but most implementations basically require a GPU as they will be untenably slow otherwise. I recommend either a consumer laptop with GPU, paying for Colab/Runpod, or simply asking a compute provider or local university for a compute grant if those are out of budget (this works surprisingly well, people are very generous). Obvious exceptions like data/tensor parallel require multi-GPU nodes.
All .py scripts take in --verbose and --wandb as command line arguments when you run them. Feel free to hack these to your needs.
Feel free to email me at tanishq@stanford.edu with feedback, implementation/feature requests, and to raise any bugs as GitHub issues. I am committing to implementing new techniques people want over the next month, and welcome contributions or bug fixes by others.

If this codebase helped you, please share it and give it a star! You can cite the repository in your work as follows.

@misc{kumar2025beyond,
  author = {Tanishq Kumar},
  title = {Beyond-NanoGPT: From LLM Beginner to AI Researcher},
  year = {2025},
  howpublished = {\url{https://github.com/tanishqkumar/beyond-nanogpt}},
  note = {Accessed: 2025-01-XX}
}

Happy coding, and may your gradients never vanish!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Beyond NanoGPT: Go From LLM Beginner to AI Researcher!

Quickstart

Current Implementations and Roadmap

Architectures

Attention Variants

Language Models

Reinforcement Learning

Generative Models

MLSys

Evals

RAG

Agents

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 379 Commits
agents		agents
architectures		architectures
attention-variants		attention-variants
evals		evals
generative-models		generative-models
language-models		language-models
mlsys		mlsys
rag		rag
rl		rl
.gitignore		.gitignore
LESSONS.md		LESSONS.md
LICENSE		LICENSE
README.md		README.md

License

tanishqkumar/beyond-nanogpt

Folders and files

Latest commit

History

Repository files navigation

Beyond NanoGPT: Go From LLM Beginner to AI Researcher!

Quickstart

Current Implementations and Roadmap

Architectures

Attention Variants

Language Models

Reinforcement Learning

Generative Models

MLSys

Evals

RAG

Agents

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages