TransformerLM

This project includes

Implementation of a micro causal language model with modern architectural optimization, including RoPE with YaRN scaling, GQA, RMSNorm, KV Cache, and pre-norm attention block.
Pretraining ablation studies on hyperparameters and architecture.

Architecture

params	len_vocab	rope_theta	n_layers	d_model	kv_heads	q_heads
26M	6400	1e6	8	512	2	8

Pretraining takes 20 minutes on a single H200, which costs approximately 1 dollar using RunPod.

RMSNorm stabilizes the training process, and the model converges to a lower loss.

For Post-Norm, the learning rate needs to be decreased 5 times to achieve stable training.
The Pre-Norm setup simplifies the gradient flow, making training more stable and allowing for a higher learning rate.

Adding relative position embedding allows the model to learn better as RoPE converges to a lower loss.
Prompting the model with NoPE shows that it can infer position information without being explicitly provided with position embeddings.

For SiLU, d_ff is set to 4 × d_model, to approximately match the parameter count of the SwiGLU feed-forward network.

GQA can provide up to an 8x reduction in memory for the KV cache with no degradation in perplexity.
A single KV might be sufficient to capture all the information, as many attention heads might converge to learn similar patterns during pretraining.

Batch size is set to 32 to promote better generalization.
Learning rate is set to 5e-4. Values (5e-5, 5e-6) of this magnitude cause optimization to stall, with the loss remaining stuck in a plateau. Values higher than this risk catastrophic forgetting.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
dataset		dataset
model		model
trainer		trainer
README.md		README.md
eval.llm.py		eval.llm.py
requirements.txt		requirements.txt