Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
microsoft dashboard evaluation openai mlops vite github-actions huggingface yaml-pipeline azure-openai ai-evaluation code-interpreter gpt-5 anthropic benchmark-automation llm-benchmark gdpval real-world-tasks self-qa professional-tasks
-
Updated
Mar 3, 2026 - Python