This project includes
- Implementation of a micro causal language model with modern architectural optimization, including RoPE with YaRN scaling, GQA, RMSNorm, KV Cache, and pre-norm attention block.
- Pretraining ablation studies on hyperparameters and architecture.
| params | len_vocab | rope_theta | n_layers | d_model | kv_heads | q_heads |
|---|---|---|---|---|---|---|
| 26M | 6400 | 1e6 | 8 | 512 | 2 | 8 |
- Pretraining takes 20 minutes on a single H200, which costs approximately 1 dollar using RunPod.
- RMSNorm stabilizes the training process, and the model converges to a lower loss.
- For Post-Norm, the learning rate needs to be decreased 5 times to achieve stable training.
- The Pre-Norm setup simplifies the gradient flow, making training more stable and allowing for a higher learning rate.
- Adding relative position embedding allows the model to learn better as RoPE converges to a lower loss.
- Prompting the model with NoPE shows that it can infer position information without being explicitly provided with position embeddings.
- For SiLU, d_ff is set to 4 × d_model, to approximately match the parameter count of the SwiGLU feed-forward network.
- GQA can provide up to an 8x reduction in memory for the KV cache with no degradation in perplexity.
- A single KV might be sufficient to capture all the information, as many attention heads might converge to learn similar patterns during pretraining.
- Batch size is set to 32 to promote better generalization.
- Learning rate is set to 5e-4. Values (5e-5, 5e-6) of this magnitude cause optimization to stall, with the loss remaining stuck in a plateau. Values higher than this risk catastrophic forgetting.