Distiller

Extract training data from your Claude Code sessions.

The Problem

Every time you use Claude Code, you generate valuable training signal:

Task completions — Real problems solved in real codebases
Corrections — When you guide Claude back on track, that's gold
Tool usage patterns — File operations, git commands, test runs
Domain knowledge — Your codebase patterns, preferences, workflows

This data just sits in ~/.claude/projects/, unused. Meanwhile, fine-tuning requires exactly this kind of high-quality, task-specific data.

The Solution

Distiller reads your Claude Code sessions, scores them for quality, and exports training-ready datasets:

You get:

Quality-tiered outputs — High/good/diverse quality buckets
PII removal — API keys, emails, secrets automatically redacted
Rich metadata — Turn counts, tool usage, quality scores for filtering
ChatML format — Ready for fine-tuning pipelines

How It Works

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   Claude Code        Distiller Pipeline              Output     │
│   ───────────        ─────────────────              ──────      │
│                                                                 │
│   ~/.claude/    ──▶  Ingest  ──▶  Clean  ──▶  Score  ──▶  JSONL │
│   projects/          (parse)      (PII)      (quality)   (tier) │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Pipeline stages:

Ingest — Reads session files, extracts conversations
Clean — Redacts emails, API keys, secrets, sensitive paths
Score — Rates quality using heuristics + LLM judge
Export — Creates tiered ChatML datasets

Quality scoring combines two methods:

Method	What it measures
Heuristic (fast, free)	Tool success rate, response depth, task completion signals
LLM Judge (via OpenRouter)	Task completion, tool effectiveness, code quality (1-5 scales)

Scores are normalized to 0-1 and combined. By default, conversations below 0.6 are filtered out but this is configurable.

Quick Start

# Clone and install
git clone https://github.com/roach88/distiller.git
cd distiller
uv venv && source .venv/bin/activate
uv pip install -e .

# Set up LLM scoring (optional but recommended)
echo "OPENROUTER_API_KEY=your-key-here" > .env

# Initialize
distiller init

# Extract training data
distiller run

Your training data will be in output/.

Installation

Requirements

Python 3.10+
uv (recommended) or pip
macOS or Linux
OpenRouter API key (optional, for LLM scoring — free tier at openrouter.ai/keys)

Detailed Setup

Clone the repository

git clone https://github.com/roach88/distiller.git
cd distiller

Create virtual environment and install

uv venv && source .venv/bin/activate
uv pip install -e .

# Optional: S3 support
uv pip install -e ".[s3]"

Configure LLM scoring (optional)
```
echo "OPENROUTER_API_KEY=sk-or-v1-your-key" > .env
```
Without this, Distiller uses heuristic scoring only (still works, just less accurate).
Initialize
```
distiller init
```
This creates the database, imports existing sessions, and sets up a weekly cron job.
Add to PATH (for global access)

The distiller command is available when the venv is active. To use it from anywhere without activating, run this from inside the repo:
```
# Add to your shell config (run from inside the distiller directory)
echo "export PATH=\"\$PATH:$PWD/.venv/bin\"" >> ~/.zshrc
source ~/.zshrc
```
For bash, use ~/.bashrc instead of ~/.zshrc.

Usage

Commands

distiller                 # Show help
distiller init            # First-time setup
distiller run             # Extract with LLM scoring
distiller run --no-llm-judge  # Heuristic only (faster, free)
distiller status          # Database stats and recent outputs
distiller cron            # View/update weekly schedule
distiller reprocess       # Re-extract everything
distiller paths           # Show file locations

Options

# Project filter
distiller run -p myproject

# Quality threshold (default: 0.6)
distiller run --min-score 0.8

# Upload to S3
distiller run --s3-bucket my-training-bucket

Scheduling

Distiller sets up a weekly cron job during init:

# View current schedule
distiller cron

# Change to Fridays at 9am
distiller cron --hour 9 --day 5

# Disable
distiller cron --remove

Output Format

Training data is saved as ChatML-format JSONL:

output/
├── training_high_quality.jsonl   # score ≥ 0.8
├── training_good_quality.jsonl   # score 0.7-0.8
└── training_diverse.jsonl        # score 0.6-0.7

Each example:

{
  "messages": [
    {"role": "system", "content": "You are an expert AI coding assistant..."},
    {"role": "user", "content": "Fix the auth bug in login.ts"},
    {"role": "assistant", "content": "I'll investigate...", "tool_calls": [...]},
    {"role": "tool", "tool_call_id": "call_123", "name": "Read", "content": "..."}
  ],
  "metadata": {
    "conversation_id": "session_abc123",
    "project": "my-app",
    "turn_count": 12,
    "tool_call_count": 8,
    "quality_scores": {
      "heuristic": {"overall": 0.75},
      "llm_judge": {"overall": 0.85, "reasoning": "..."},
      "overall": 0.80
    }
  }
}

Configuration

Environment Variables

Variable	Description	Default
`OPENROUTER_API_KEY`	LLM scoring API key	(none)
`DISTILLER_DB_PATH`	Database location	`data/raw_logs.db`
`DISTILLER_S3_BUCKET`	S3 upload bucket	(none)

LLM Provider

Edit config/scoring.yaml:

provider: "openrouter"  # or "ollama" for local
model: "anthropic/claude-3-haiku"

scoring:
  temperature: 0.1
  max_tokens: 500
  timeout: 30

Local Scoring with Ollama

For fully local, free scoring:

brew install ollama
ollama pull llama3

Then in config/scoring.yaml:

provider: "ollama"
model: "llama3"

S3 Upload

uv pip install -e ".[s3]"
aws configure
distiller run --s3-bucket my-bucket

Troubleshooting

"No sessions found"

Check that Claude Code has created sessions:

ls ~/.claude/projects/

If empty, use Claude Code for a while first.

"OPENROUTER_API_KEY not set"

Either create .env with your key, or run without LLM scoring:

distiller run --no-llm-judge

Low quality scores

Lower the threshold: --min-score 0.4
Try a different scoring model in config/scoring.yaml
Check that conversations have meaningful exchanges (not just single-turn)

Database issues

rm data/raw_logs.db
distiller init

Development

uv pip install -e ".[dev]"
pytest

# Regenerate BAML client after editing baml_src/
baml-cli generate

Architecture

distiller/
├── cli.py              # CLI entry point
├── llm_providers.py    # OpenRouter/Ollama abstraction
├── scripts/
│   ├── ingest.py       # Session import
│   ├── parse.py        # Conversation extraction
│   ├── clean_pii.py    # PII redaction
│   ├── score_quality.py
│   └── export_chatml.py
├── baml_src/           # LLM scoring prompts
├── baml_client/        # Generated client
└── config/
    ├── scoring.yaml
    └── pii_patterns.yaml

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.claude		.claude
baml_client		baml_client
baml_src		baml_src
components		components
config		config
distiller		distiller
hooks		hooks
schema		schema
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
README.md		README.md
pipeline.py		pipeline.py
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distiller

The Problem

The Solution

How It Works

Quick Start

Installation

Requirements

Detailed Setup

Usage

Commands

Options

Scheduling

Output Format

Configuration

Environment Variables

LLM Provider

Local Scoring with Ollama

S3 Upload

Troubleshooting

"No sessions found"

"OPENROUTER_API_KEY not set"

Low quality scores

Database issues

Development

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

roach88/distiller

Folders and files

Latest commit

History

Repository files navigation

Distiller

The Problem

The Solution

How It Works

Quick Start

Installation

Requirements

Detailed Setup

Usage

Commands

Options

Scheduling

Output Format

Configuration

Environment Variables

LLM Provider

Local Scoring with Ollama

S3 Upload

Troubleshooting

"No sessions found"

"OPENROUTER_API_KEY not set"

Low quality scores

Database issues

Development

Architecture

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages