[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Testing how well LLMs can solve jigsaw puzzles
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).
From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing from the data, then Anthropic published the same patterns. 3-part series.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Comprehensive benchmark of OpenRouter free-tier LLMs for practical applications. Evaluates models for coding, Thai language, and general use.
Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams
Claude Code skill that pits Claude, ChatGPT, and Gemini against each other, then lets them cross-judge each other blind
A decentralized, adversarial + dynamic AI evaluation protocol on Bittensor. Combats benchmark saturation by measuring genuine intelligence through dynamic, zero-shot generalization tasks.
is it better to run a Tiny Model (2B-4B) at High Precision (FP16/INT8), or a Large Model (8B+) at Low Precision (INT4)?" This benchmark framework allows developers to scientifically choose the best model for resource-constrained environments (consumer GPUs, laptops, edge devices) by measuring the trade-off between Speed and Intelligence
The hundred-eyed watcher for your LLM providers. Monitor uptime, TTFT, TPS, and latency across OpenAI, Anthropic, Azure, Bedrock, Ollama, LM Studio, and 100+ providers through a single dashboard. Benchmark, compare, and get alerts — all self-hosted.
Yes, LLM's just regurgitate the same jokes on the internet over and over again. But some are slightly funnier than others.
🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty with the ERR-EVAL benchmark for reliable epistemic reasoning.
GateBench is a challenging benchmark for Vision Language Models (VLMs) that tests visual reasoning by requiring models to extract boolean algebra expressions from logic gate circuit diagrams.
Compare how vision models reason about images — not just their accuracy scores
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."