Andrej Karpathy’s open-source reference for the full LLM training pipeline. The repo covers tokenization, pretraining, supervised fine-tuning, RLHF, evaluation, inference, and a minimal chat UI in roughly 8,000 lines of Python.
Released in 2025. MIT licensed. The GitHub repository showed 54,644 stars, 7,421 forks, and a May 5, 2026 push timestamp when rechecked through GitHub’s API on June 12, 2026. Its README frames the project as a minimal experimental harness for training LLMs on a single GPU node, with the current example claiming GPT-2-capability training for about $48 on an 8xH100 node in roughly two hours, or closer to $15 on a spot instance.
System Verdict
Pick nanochat if the goal is understanding how a ChatGPT-class system is actually built. The codebase reads end-to-end in a day. Every stage from tokenizer to RLHF is visible without wrappers hiding the mechanics.
Skip it for production anything. It is not a serving framework, not a multi-node distributed trainer, not a chatbot. Use a hosted API (Claude, ChatGPT) for deployment. Use Megatron-LM, NeMo, or Axolotl for real training workloads.
The natural companion is nanoGPT only. Pick nanoGPT if the transformer and chat serving.
Key Facts
| Author | Andrej Karpathy (former OpenAI, Tesla AI) |
| Released | October 2025 |
| License | MIT |
| Lines of code | ~8,000 Python |
| Pipeline coverage | Tokenizer, pretraining, SFT, RLHF, eval, inference, chat UI |
| Reference reproduction run | GPT-2-capability model in about two hours on an 8xH100 node, with the README citing about $48 or about $15 on spot |
| Hyperparameter control | Single --depth flag; other hparams auto-computed |
| Eval suite included | MMLU, GSM8K, HumanEval |
| Hardware floor | CPU or Apple MPS for toy runs. 8xH100 for the speedrun. |
| Repository signal | 54,644 stars, 7,421 forks, MIT license, and May 5, 2026 push timestamp as of June 12, 2026 |
| Recent experiments | FP8 precision, batch-size tweaks, NVIDIA ClimbMix data, autoresearch loops |
What it actually is
A single-repo walk-through of the LLM stack. The core library ships the tokenizer, transformer, training loop, and inference. Scripts handle each pipeline stage: pretraining on Fineweb/ClimbMix data, SFT on instruction data, RLHF, and a chat-interface demo.
The design dial is --depth layer count and auto-derives the rest for compute-optimal training. No hundred-parameter config files.
The GPT-2-capability example is the headline benchmark, and chat UI without a production framework hiding the mechanics.
When to pick nanochat
- Learning how language models are built. The codebase does not hide mechanics behind abstractions.
- Teaching LLM internals. Educators get a complete, citable, modern reference implementation in one repo.
- Research ablations on a small budget. Minimal baseline makes architecture experiments fast to iterate.
- Understanding what small-scale pretraining costs in 2026. The README’s $48/spot-cost examples are concrete enough to make the tradeoff visible.
- Companion reading to a theory course. Hugging Face and Stanford CS224N cover the math; nanochat is the working code.
When to pick something else
- Production LLM training at scale: Megatron-LM, NeMo, or Axolotl for fine-tuning. nanochat is not a distributed trainer.
- Deploying a chatbot: Claude or ChatGPT APIs. nanochat’s chat UI is a demo, not a product.
- Pretraining-only study: nanoGPT is Karpathy’s earlier repo. Smaller scope, fewer moving parts.
- Tiny LLM research with a ready-made checkpoint: TinyLlama (1.1B, fully trained). nanochat gives training code, not a usable model.
- Multimodal or MoE work: Out of scope. nanochat sticks to one well-defined text-only path.
Pricing
| Component | Cost |
|---|---|
| nanochat codebase | Free (MIT) |
| GPU example run | About $48 on an 8xH100 node for about two hours, with the README citing about $15 on spot |
| CPU or MPS experimentation | Free on existing hardware |
| Inference after training | User’s choice of provider or self-host |
Verified 2026-06-12 via the nanochat GitHub README and GitHub repository metadata.
Against the alternatives
| nanochat | nanoGPT | Megatron-LM | |
|---|---|---|---|
| Scope | Full pipeline incl. RLHF and chat UI | Pretraining only | Industrial distributed training |
| Lines of code | ~8,000 | ~300 core | 100,000+ |
| Readability | High | Highest | Low |
| Production-ready | No | No | Yes |
| Multi-node training | Not primary target | No | Yes |
| RLHF included | Yes | No | Add-on required |
| Best viewed as | Complete reference | Minimal pretraining demo | Production trainer |
Failure modes
- Not a deployable chatbot. Models trained here are GPT-2-scale research artifacts. Quality is nowhere near a production assistant.
- Not a production training framework. No multi-node distribution, no production data pipelines, no inference safety rails.
- Hardware requirement for meaningful runs. The headline example still assumes an 8xH100 node. CPU and MPS paths exist but produce toy models.
- Scope is intentionally narrow. Multimodal, mixture-of-experts, and vision-language models are out of the design remit.
- Pedagogical value depends on the author. Karpathy’s commentary in release notes and videos is part of the learning loop. Without that context the code alone teaches less.
- Speedrun leaderboard implies competition the code was not built for. Community entries favor efficiency tricks that can obscure the teaching value of the default path.
Methodology
This page was produced by the aipedia.wiki editorial pipeline, an automated system that ingests vendor documentation, verifies claims against primary sources, and generates the editorial analysis shown here. No individual human wrote this review. Scoring follows the four-dimension rubric at /about/scoring/ (Utility x Value x Moat x Longevity, unweighted average). Last verified 2026-06-12 against the nanochat GitHub repo, README, and GitHub repository metadata.
FAQ
Is nanochat a chatbot I can use? No. The repo includes a minimal chat interface as an inference demo. Models trained with it are GPT-2-scale, not production assistants. For a real chatbot, use Claude or ChatGPT.
How many lines of code is nanochat? About 8,000 across the core library and scripts (GitHub). The design goal is a codebase a competent reader can walk end-to-end in a day.
What hardware is needed? For learning and small experiments, a laptop with CPU or Apple MPS runs the code at toy scale. For the headline GPT-2-capability example, the README cites roughly two hours on an 8xH100 node, about $48 on on-demand compute, and closer to $15 on spot.
What changed vs nanoGPT? nanoGPT covers pretraining only. nanochat adds the tokenizer, SFT, RLHF, eval suite, inference, and a chat UI in the same repo. Pick nanoGPT for pretraining theory, nanochat for the complete pipeline.
Can nanochat produce a usable model? Not in the modern assistant sense. The speedrun output is a GPT-2-grade model suitable for research and teaching, not for production chat. Use it to understand how capability scales with compute, not to deploy.
Sources
- nanochat GitHub (karpathy/nanochat): README, architecture, license, repository metadata, and current training-cost example
- nanoGPT reference: companion repo for pretraining-only study
Related
- Category: AI Research · AI Coding