llm-benchmark

Star

Here are 32 public repositories matching this topic...

TsinghuaC3I / AdsQA

Star

[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621

benchmark video-understanding video-question-answering advertisement-dataset video-llms llm-benchmark

Updated Oct 30, 2025
Python

GustyCube / ERR-EVAL

Star

Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.

python nlp benchmark machine-learning ai evaluation collaborate ai-safety llm llm-evaluation hallucination-detection ai-benchmark llm-benchmark

Updated Jan 2, 2026
Python

ishida-lab / capbencher

Star

CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming

contamination-detection llm llm-datasets llm-benchmark leaderboard-hacking

Updated Feb 24, 2026
Python

hyeonsangjeon / gdpval-realworks

Star

Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

Updated Mar 2, 2026
Python

filipbasara0 / llm-jigsaw

Star

Testing how well LLMs can solve jigsaw puzzles

benchmark vision-language-model llm-reasoning llm-benchmark

Updated Jan 8, 2026
Python

playeriv65 / EasyLocomo

Star

🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.

refactoring refactor evaluation-framework locomo long-term-memory openai-api long-context llm-benchmark

Updated Feb 12, 2026
Python

nagu-io / agent-settlement-bench

Star

Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).

security distributed-systems benchmark blockchain cryptocurrency alignment ai-safety ai-agents payment-systems llm-evaluation agentic-ai llm-benchmark

Updated Feb 27, 2026
JavaScript

sstklen / washin-api-benchmark

Star

From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing from the data, then Anthropic published the same patterns. 3-part series.

Updated Feb 24, 2026

RafaelParonis / jailbench

Star

🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.

python flask analytics openai alignment model-evaluation ai-safety security-testing red-teaming model-robustness anthropic litellm content-safety llm-jailbreaks tool-calling llm-benchmark ai-evals textual-tui

Updated Mar 2, 2026
Python

vibheksoni / jailbench

Star

Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.

Updated Aug 12, 2025
Python

bejranonda / openclaw-eval

Star

Comprehensive benchmark of OpenRouter free-tier LLMs for practical applications. Evaluates models for coding, Thai language, and general use.

thai-language thai-nlp gemma model-evaluation openrouter ai-gateway nemotron llm-comparison llm-benchmark free-ai-models openclaw step-flash trinity-mini chatbot-benchmark free-tier-llm

Updated Feb 24, 2026

ALucek / banana-bench

Star

Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams

benchmarks llm llm-benchmark

Updated Dec 30, 2025
HTML

vanderheijden86 / showdown-claude-skill

Star

Claude Code skill that pits Claude, ChatGPT, and Gemini against each other, then lets them cross-judge each other blind

gemini ai-tools chatgpt anthropic-claude llm-comparison claude-code llm-benchmark claude-skill claude-plugin

Updated Feb 11, 2026
Shell

Anand-0037 / openarena

Star

A decentralized, adversarial + dynamic AI evaluation protocol on Bittensor. Combats benchmark saturation by measuring genuine intelligence through dynamic, zero-shot generalization tasks.

subnet ideathon bittensor dynamic-evaluation llm-benchmark kaggleingest

Updated Mar 2, 2026
TypeScript

G-B-KEVIN-ARJUN / size-precision-slm-bench

Star

is it better to run a Tiny Model (2B-4B) at High Precision (FP16/INT8), or a Large Model (8B+) at Low Precision (INT4)?" This benchmark framework allows developers to scientifically choose the best model for resource-constrained environments (consumer GPUs, laptops, edge devices) by measuring the trade-off between Speed and Intelligence

inference pytorch slm quantization inference-optimization edge-ai huggingface nvidia-rtx llm generative-ai llama-3 phi-3-5 llm-benchmark

Updated Jan 17, 2026
Python

bluet / arguslm

Star

The hundred-eyed watcher for your LLM providers. Monitor uptime, TTFT, TPS, and latency across OpenAI, Anthropic, Azure, Bedrock, Ollama, LM Studio, and 100+ providers through a single dashboard. Benchmark, compare, and get alerts — all self-hosted.

Updated Mar 1, 2026
Python

demegire / funny-arena

Star

Yes, LLM's just regurgitate the same jokes on the internet over and over again. But some are slightly funnier than others.

arena humor elo jokes funny elo-rating humor-generation llm llm-evaluation llm-benchmark llm-humor llm-jokes

Updated Nov 9, 2025
Python

prorok9898 / ERR-EVAL

Star

🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty with the ERR-EVAL benchmark for reliable epistemic reasoning.

python nlp benchmark machine-learning ai evaluation collaborate ai-safety llm llm-evaluation hallucination-detection ai-benchmark llm-benchmark

Updated Mar 2, 2026
Python

johnbean393 / GateBench

Star

GateBench is a challenging benchmark for Vision Language Models (VLMs) that tests visual reasoning by requiring models to extract boolean algebra expressions from logic gate circuit diagrams.

benchmark vlm llm vlm-benchmark llm-benchmark

Updated Jan 27, 2026
Python

mishafyi / hot-dog-or-not

Star

Compare how vision models reason about images — not just their accuracy scores

python machine-learning typescript ai computer-vision nextjs fastapi vision-model openrouter llm-benchmark nvidia-nemotron

Updated Feb 10, 2026
TypeScript

Improve this page

Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-benchmark

Here are 32 public repositories matching this topic...

TsinghuaC3I / AdsQA

GustyCube / ERR-EVAL

ishida-lab / capbencher

hyeonsangjeon / gdpval-realworks

filipbasara0 / llm-jigsaw

playeriv65 / EasyLocomo

nagu-io / agent-settlement-bench

sstklen / washin-api-benchmark

RafaelParonis / jailbench

vibheksoni / jailbench

bejranonda / openclaw-eval

ALucek / banana-bench

vanderheijden86 / showdown-claude-skill

Anand-0037 / openarena

G-B-KEVIN-ARJUN / size-precision-slm-bench

bluet / arguslm

demegire / funny-arena

prorok9898 / ERR-EVAL

johnbean393 / GateBench

mishafyi / hot-dog-or-not

Improve this page

Add this topic to your repo