Skip to content

xc2450/TransformerLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TransformerLM

This project includes

  • Implementation of a micro causal language model with modern architectural optimization, including RoPE with YaRN scaling, GQA, RMSNorm, KV Cache, and pre-norm attention block.
  • Pretraining ablation studies on hyperparameters and architecture.

Architecture

523407115-8232054f-5a2f-4b94-9ec4-187d3151eb1f

Config

params len_vocab rope_theta n_layers d_model kv_heads q_heads
26M 6400 1e6 8 512 2 8
  • Pretraining takes 20 minutes on a single H200, which costs approximately 1 dollar using RunPod.

Experiments

Pretraining

learning rate

image3

batch size

image2

RMSNorm

image7
  • RMSNorm stabilizes the training process, and the model converges to a lower loss.

Pre-Norm vs. Post-Norm

image4
  • For Post-Norm, the learning rate needs to be decreased 5 times to achieve stable training.
  • The Pre-Norm setup simplifies the gradient flow, making training more stable and allowing for a higher learning rate.

NoPE vs. RoPE

image1
  • Adding relative position embedding allows the model to learn better as RoPE converges to a lower loss.
  • Prompting the model with NoPE shows that it can infer position information without being explicitly provided with position embeddings.

SwiGLU vs. SiLU

image6
  • For SiLU, d_ff is set to 4 × d_model, to approximately match the parameter count of the SwiGLU feed-forward network.

number of KV heads in GQA

image8
  • GQA can provide up to an 8x reduction in memory for the KV cache with no degradation in perplexity.
  • A single KV might be sufficient to capture all the information, as many attention heads might converge to learn similar patterns during pretraining.

Post-training

SFT

W B Chart 12_10_2025, 11_57_39 PM
  • Batch size is set to 32 to promote better generalization.
  • Learning rate is set to 5e-4. Values (5e-5, 5e-6) of this magnitude cause optimization to stall, with the loss remaining stuck in a plateau. Values higher than this risk catastrophic forgetting.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages