A minimal, dependency-free GPT implementation in pure Python that trains a transformer-based language model and generates text.
MicroGPT is a bare-bones transformer with these key components:
- Autograd Engine (
Valueclass): Implements backpropagation via computational graph traversal - Tokenizer: Maps characters to token IDs (0 to vocab_size-1)
- Transformer: Single-layer GPT-2-like architecture with:
- Token & position embeddings
- Multi-head self-attention (4 heads)
- Feed-forward MLP layers
- RMSNorm (layer normalization)
- Training: Adam optimizer over 1000 steps to minimize next-token prediction loss
- Inference: Samples new sequences token-by-token using temperature-controlled sampling
- Prepare input data in
input.txt(one item per line):
emma
olivia
ava
- Run training & inference:
python3 microgpt.pyThe script automatically trains for 1000 steps and generates 20 samples.
Input (10 names):
emma, olivia, ava, sophia, isabella, mia, charlotte, amelia, harper, evelyn
Training:
- Starts with high loss (~2.8) and decreases to ~0.35
- Model learns to predict next characters in names
Generated Output:
sample 1: charlotte
sample 2: sophia
sample 3: evelyn
sample 4: ava
...
The model learns character patterns and generates realistic-sounding names from the training data.
| Step | What Happens |
|---|---|
| Tokenization | Characters → IDs (BOS token marks sequence start/end) |
| Forward Pass | Token → embedding → attention → MLP → next-token logits |
| Loss | Cross-entropy loss between predicted & actual next token |
| Backward | Gradients flow through computation graph via chain rule |
| Optimizer | Adam updates all parameters with learning rate decay |
| Inference | Sample from softmax distribution (with temperature control) |
Key hyperparameters in the code:
n_embd = 16: Embedding dimensionn_head = 4: Number of attention headsn_layer = 1: Number of transformer layersblock_size = 16: Maximum sequence lengthnum_steps = 1000: Training iterationstemperature = 0.5: Sampling creativity (0-1, lower = more deterministic)
Uses only Python standard library: os, math, random.
Attribution: Core algorithm by @karpathy