Benchmarks measuring TOON's token efficiency and retrieval accuracy compared to JSON, XML, YAML, and CSV.
Note
Results are automatically embedded in the main README. This guide focuses on running the benchmarks locally.
# Run token efficiency benchmark
pnpm benchmark:token-efficiency
# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracyMeasures token count reduction across JSON, XML, YAML, CSV, and TOON:
- Generate datasets (GitHub repos, analytics, orders)
- Convert to all formats (TOON, JSON, XML, YAML, CSV)
- Tokenize using
gpt-tokenizer(o200k_baseencoding) - Calculate savings and generate report
pnpm benchmark:token-efficiencyResults are saved to results/token-efficiency.md.
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):
- Generate ~150-160 questions across 4 datasets
- Convert each dataset to all 6 formats
- Query each LLM with formatted data + question
- Validate answers using
gpt-5-nanoas judge - Aggregate metrics and generate report
- Edit
src/evaluate.tsand add models to the exportedmodelsarray:export const models: LanguageModelV2[] = [ openai('gpt-5-nano'), anthropic('claude-haiku-4-5-20251001'), google('gemini-2.5-flash'), xai('grok-4-fast-non-reasoning'), // Add your models here ]
- Duplicate
.env.exampleto.envand add your API keys:cp .env.example .env
# Full benchmark
pnpm benchmark:accuracy
# Dry run (10 questions only, for testing setup)
DRY_RUN=true pnpm benchmark:accuracyRunning the script will:
- Prompt you to select which models to test.
- Skip models with existing results (rerun to overwrite).
- Show progress with rate limiting.
- Save results to
results/accuracy/models/{model-id}.json. - Generate report at
results/retrieval-accuracy.md.
Edit src/constants.ts to adjust:
MODEL_RPM_LIMITSβ Rate limits per modelDEFAULT_CONCURRENCYβ Parallel tasks (default: 10)DRY_RUN_LIMITSβ Questions per dry run (default: 10)
scripts/
βββ accuracy-benchmark.ts # Retrieval accuracy benchmark
βββ token-efficiency-benchmark.ts # Token counting benchmark
βββ fetch-github-repos.ts # Update GitHub dataset
src/
βββ constants.ts # Configuration
βββ datasets.ts # Test data generators
βββ evaluate.ts # LLM evaluation
βββ formatters.ts # Format converters
βββ questions.ts # Question generation
βββ report.ts # Markdown reports
βββ storage.ts # Result caching
βββ utils.ts # Helpers
data/
βββ github-repos.json # Top 100 GitHub repos
results/
βββ token-efficiency.md # Token savings report
βββ retrieval-accuracy.md # Accuracy report
βββ accuracy/models/ # Per-model results (JSON)