Adamba

Adaptive Mamba: Elastic compute with dynamic Matryoshka scaling

🤗 Model on HuggingFace | 📂 GitHub

Also known as: ElasticGPT • Accordion-Net • Dynamic Compute Budget Model

Architecture Overview

Adamba combines three efficiency techniques into a unified pipeline:

Technique	Implementation	Purpose
Matryoshka (MRL)	Width: 128 → 4096 per layer	Elastic compute
Early Exit	ConfidenceGate	Skip layers when confident
Static SSM	Mamba at full dim	Stable memory backbone

┌─────────────────────────────────────────────────┐
│  PROMPT → LayerDimPredictor → [dim per layer]   │
│                                                 │
│  Attention + MLP: Dynamic (sliced)              │
│  Mamba:           Static (full dim)             │
│                                                 │
│  Gate > 0.95 → EXIT EARLY                       │
│  Gate < 0.50 → EXPAND remaining layers          │
└─────────────────────────────────────────────────┘

V2 Architecture ✨

🎯 LayerDimPredictor: Predicts per-layer dims upfront (no graph breaks)
🚪 ConfidenceGate: Unified early exit + dim expansion
📦 MatryoshkaKVCache: Slice-down cache strategy
🧠 Static Mamba: Uses efficient CUDA kernel (no SSM state resizing)

Key insight: Resizing SSM states on the fly is mathematically messy. Resizing Attention heads is trivial. Keep Mamba static, make Attention/MLP dynamic.

Validated Results

tiny_experiment validation:
  Early exits: 71.5%  (37.5% compute saved!)
  Gate loss:   0.28 → 0.03  (self-supervised difficulty learning)
  Hard tasks get more dims than easy tasks ✓

The gate trains itself using shadow loss: comparing what loss would be at each layer to teach the gate when it's safe to exit.

Architecture

nanochat-d32 (1.9B, 32 layers, dim=2048)
    ↓ Surgery (add 32 Mamba layers)
Stage 1: 6.4B  (dim=2048)  ← Hybrid, no expansion
    ↓ Progressive expand
Stage 2: 9.3B  (dim=2560)  
    ↓ Progressive expand
Stage 3: 20B   (dim=4096)

Training Cost

Stage	Model Size	Dim	Hours (8×H100)	Est. Cost
1	6.4B	2048	40h	$1,000
2	9.3B	2560	50h	$1,200
3	20B	4096	100h	$2,400
Total			190h	~$4,600

Quick Start

# 1. Download nanochat-d32 base
huggingface-cli download karpathy/nanochat-d32 \
    --local-dir ~/.cache/nanochat/chatsft_checkpoints/d32

# 2. Create 6.4B hybrid
python -m scripts.surgery --new-dim=2048

# 3. Train Stage 1
torchrun --nproc_per_node=8 -m scripts.fractal_train \
    --checkpoint ~/.cache/nanochat/hybrid_checkpoints/d32_2048/model.pt \
    --expanded-dim=2048 --matryoshka --sample-dim

# 4. Expand → Stage 2
python -m scripts.surgery --expand-from=2048 --new-dim=2560

Files

File	Purpose
`nanochat/hybrid_gpt.py`	Adamba model (Mamba+Attention+Gate)
`nanochat/mamba_block.py`	Static Mamba with SSM fallback
`nanochat/matryoshka.py`	Dimension slicing + energy loss
`nanochat/confidence_probe.py`	V2: LayerDimPredictor, ConfidenceGate
`scripts/surgery.py`	Create/expand hybrid checkpoints
`scripts/fractal_train.py`	Matryoshka training
`tiny_experiment/`	Local validation suite

Compute Modes

Mode	Dim	Compute	Use Case
Ghost	128	~0.4%	Trivial tasks
Whisper	512	~6%	Simple Q&A
Normal	1024	~25%	General use
Think	2048+	100%	Complex reasoning

Related Work

Matryoshka Embeddings (OpenAI/Harvard): MRL applied to model weights
FastBERT / DeeBERT: Confidence-based early exit
Mixture of Depths (Google DeepMind): Dynamic FLOP allocation

Training Pipeline

Phase-Aware Training (Implemented ✓)

Use --phase flag in scripts/fractal_train.py:

Phase	Command	What Trains	Matryoshka
1	`--phase=1`	Mamba only (frozen attn/mlp)	✗ Off
2	`--phase=2`	All + Gates	✓ On
3	`--phase=3`	Expanded weights	✓ On

# Phase 1: Integrate Mamba (freeze attention)
torchrun -m scripts.fractal_train --phase=1 --checkpoint=phase1.pt

# Phase 2: Matryoshka + Gates (unfreeze all)
torchrun -m scripts.fractal_train --phase=2 --checkpoint=phase2.pt

# Phase 3: After expansion surgery
torchrun -m scripts.fractal_train --phase=3 --checkpoint=phase3.pt

Smarter Expansion Initialization (Implemented ✓)

Mamba (Stage 1): Uses zero-init ✓ (correct, nothing to copy)

MLP/Attention Expansion (Stage 2/3): Uses LoRA-style initialization:

# Instead of zeros, new dims = A @ B (low-rank, small init)
expand_weight_lora(weight, target_size, dim, rank=16, std=0.01)

Attention Weight Interleaving (Implemented ✓)

Problem Solved: MHA weights are stored as [Head1 | Head2 | ... | Head16].

Solution: expand_attention_interleaved() expands each head's dims separately:

[Head1_128+32 | Head2_128+32 | ... | Head16_128+32]  ← CORRECT

Functions in scripts/surgery.py:

expand_weight_lora() - LoRA-style low-rank initialization
expand_attention_interleaved() - Per-head dimension expansion

Based on nanochat by Andrej Karpathy

Name		Name	Last commit message	Last commit date
Latest commit History 316 Commits
dev		dev
nanochat		nanochat
runs		runs
scripts		scripts
tasks		tasks
tests		tests
tiny_experiment		tiny_experiment
.gitignore		.gitignore
.python-version		.python-version
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
TRAINING_PLAN.md		TRAINING_PLAN.md
phase1_6b_base_config.json		phase1_6b_base_config.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adamba

Architecture Overview

V2 Architecture ✨

Validated Results

Architecture

Training Cost

Quick Start

Files

Compute Modes

Related Work

Training Pipeline

Phase-Aware Training (Implemented ✓)

Smarter Expansion Initialization (Implemented ✓)

Attention Weight Interleaving (Implemented ✓)

About

Uh oh!

Releases

Packages

Contributors 40

Uh oh!

Languages

License

unixsysdev/adamba

Folders and files

Latest commit

History

Repository files navigation

Adamba

Architecture Overview

V2 Architecture ✨

Validated Results

Architecture

Training Cost

Quick Start

Files

Compute Modes

Related Work

Training Pipeline

Phase-Aware Training (Implemented ✓)

Smarter Expansion Initialization (Implemented ✓)

Attention Weight Interleaving (Implemented ✓)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 40

Uh oh!

Languages

Packages