Agent Eval Framework

A production-grade evaluation and continuous improvement system for AI agents and agentic workflows.

The Problem

Most AI agents work great in demos but fail in production. Why?

No systematic evaluation - Teams rely on vibes instead of metrics
Prompt rot - Agent performance degrades as use cases expand
Blind spots - Edge cases that testing never catches
No feedback loop - No mechanism to learn from failures

This framework provides a structured approach to continuously evaluate and improve your agents.

Core Philosophy

Every agent has latent potential unlocked through:

Better Prompts - Specific instructions with examples and reasoning frameworks
Continuous Evaluation - Testing reveals blind spots humans miss
Cross-Agent Learning - Patterns that work for one agent help others
Systematic Improvement - Small, validated changes compound over time

Features

5-Dimension Evaluation Model - Comprehensive scoring across clarity, completeness, reasoning, actionability, and alignment
Automated Training Loop - Analyze → Evaluate → Diagnose → Prescribe → Validate → Track
Prompt Optimization - Data-driven prompt improvements with validation
Memory Integration - Track improvements and regressions over time
Multi-Agent Support - Train entire agent fleets simultaneously

Installation

pip install agent-eval-framework

Or from source:

git clone https://github.com/yourusername/agent-eval-framework.git
cd agent-eval-framework
pip install -e .

Quick Start

Basic Evaluation

from agent_eval import AgentEvaluator, EvalDimension

# Create evaluator
evaluator = AgentEvaluator()

# Define your agent's prompt
agent_prompt = """
You are a helpful assistant that answers questions about Python.
Be concise and provide code examples when relevant.
"""

# Evaluate the prompt
results = evaluator.evaluate(
    prompt=agent_prompt,
    test_cases=[
        {"input": "How do I read a file?", "expected_traits": ["code example", "with statement"]},
        {"input": "Explain decorators", "expected_traits": ["function wrapper", "@ syntax"]},
    ]
)

print(f"Overall Score: {results.overall_score}/10")
for dim, score in results.dimension_scores.items():
    print(f"  {dim.name}: {score}/10")

Continuous Training

from agent_eval import AgentTrainer

trainer = AgentTrainer()

# Train a single agent
improved_prompt, report = trainer.train(
    agent_name="python_assistant",
    current_prompt=agent_prompt,
    test_cases=test_cases,
    apply_if_improved=True  # Auto-apply if validation passes
)

print(report)

Training Multiple Agents

from agent_eval import train_all

results = train_all(
    agents={
        "python_assistant": python_prompt,
        "code_reviewer": reviewer_prompt,
        "doc_writer": doc_prompt,
    },
    shared_test_suite=common_tests,
)

# See which agents need attention
for agent, result in results.items():
    if result.needs_improvement:
        print(f"⚠️  {agent}: {result.weakest_dimension}")

Evaluation Dimensions

Agents are scored across 5 dimensions (0-10 scale):

Dimension	Description	What It Measures
Clarity	Is the purpose crystal clear?	Unambiguous instructions, clear role definition
Completeness	Are edge cases handled?	Coverage of failure modes, graceful degradation
Reasoning	Does it show its work?	Chain-of-thought, explanation of decisions
Actionability	Are outputs immediately usable?	Concrete next steps, structured formats
Alignment	Does it serve true goals?	User intent vs stated task, value alignment

Training Loop

The framework implements a 6-step improvement cycle:

┌─────────────────────────────────────────────────────────┐
│                    TRAINING LOOP                        │
├─────────────────────────────────────────────────────────┤
│                                                         │
│   1. ANALYZE    ──→  Read prompt, find assumptions     │
│        ↓                                                │
│   2. EVALUATE   ──→  Run test cases, score dimensions  │
│        ↓                                                │
│   3. DIAGNOSE   ──→  Find patterns, root causes        │
│        ↓                                                │
│   4. PRESCRIBE  ──→  Generate targeted improvements    │
│        ↓                                                │
│   5. VALIDATE   ──→  Test improved vs original         │
│        ↓                                                │
│   6. TRACK      ──→  Save results, monitor trends      │
│        ↓                                                │
│   [Loop back to 1 with new baseline]                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

🚀 Real-World Use Cases

We've included 3 complete, runnable examples that solve real problems:

Use Case 1: Customer Support Agent

Problem: Support responses are robotic, miss emotional cues, and don't resolve issues.

python examples/use_case_1_customer_support.py

What you'll learn:

Handling angry customers with empathy-first responses
Structured output (Acknowledgment → Solution → Next Steps)
Edge cases for urgent issues, cancellations, and feature requests

Use Case 2: Code Review Agent

Problem: Code reviews miss security vulnerabilities and give vague feedback.

python examples/use_case_2_code_review.py

What you'll learn:

Security-first review priorities (SQL injection, memory leaks, race conditions)
Severity-based output (CRITICAL → LOW)
Always providing working fix code, not just descriptions

Use Case 3: Sales Email Agent

Problem: Cold emails get 2% response rates (industry average: 5-15%).

python examples/use_case_3_sales_email.py

What you'll learn:

Personalization beyond "I saw you work at {company}"
The 75-word maximum rule
Follow-up and breakup email strategies

Interactive Tutorial

New to the framework? Start here:

python examples/quickstart_tutorial.py

Architecture

agent-eval-framework/
├── src/
│   └── agent_eval/
│       ├── __init__.py          # Public API exports
│       ├── dimensions.py        # EvalDimension enum & scoring
│       ├── evaluator.py         # Core AgentEvaluator class
│       ├── trainer.py           # AgentTrainer with training loop
│       └── prompts.py           # Prompt optimization utilities
├── examples/
│   ├── quickstart_tutorial.py   # Interactive 5-min tutorial
│   ├── use_case_1_customer_support.py
│   ├── use_case_2_code_review.py
│   ├── use_case_3_sales_email.py
│   ├── basic_evaluation.py
│   └── continuous_training.py
├── tests/
└── .github/workflows/ci.yml     # Automated testing

Best Practices for Prompt Improvement

Based on training hundreds of agents, these patterns consistently improve scores:

Role Clarity - Start with a specific role definition
Structured Output - Provide explicit output format templates
Reasoning Frameworks - Add "First... Then... Finally..." type structures
Examples - Include 1-2 examples of ideal outputs
Edge Cases - Address common failure modes explicitly
Self-Evaluation - Add criteria for the agent to check its own work
User Alignment - Connect outputs to user's actual goals

Example: Before & After

Before (Score: 5.2/10)

You are a helpful coding assistant. Answer programming questions.

After (Score: 8.7/10)

# ROLE
You are a Senior Software Engineer assistant specializing in Python.

# TASK
When answering programming questions:
1. First, clarify the user's actual goal (not just the stated question)
2. Provide a working code example
3. Explain WHY the solution works, not just WHAT it does
4. Mention common pitfalls to avoid

# OUTPUT FORMAT
## Solution
[Working code with comments]

## Explanation
[Why this approach works]

## Watch Out For
[Common mistakes with this pattern]

# SELF-CHECK
Before responding, verify:
- [ ] Code is syntactically correct
- [ ] Example is complete (can be copy-pasted)
- [ ] Explanation matches user's experience level

Configuration

from agent_eval import AgentEvaluator, EvalConfig

config = EvalConfig(
    min_improvement_threshold=0.5,  # Only apply if +0.5 improvement
    max_training_iterations=5,       # Cap optimization rounds
    dimensions_weights={             # Custom dimension importance
        "clarity": 1.0,
        "completeness": 1.2,         # Weight completeness higher
        "reasoning": 1.0,
        "actionability": 1.5,        # Weight actionability higher
        "alignment": 1.0,
    },
    llm_provider="anthropic",        # or "openai", "ollama"
    llm_model="claude-sonnet-4-20250514",
)

evaluator = AgentEvaluator(config=config)

Integration with LangGraph

from langgraph.graph import StateGraph
from agent_eval import AgentEvaluator

# Add evaluation as a node in your graph
def evaluate_output(state):
    evaluator = AgentEvaluator()
    score = evaluator.quick_score(state["agent_output"])
    state["quality_score"] = score
    return state

graph = StateGraph()
graph.add_node("agent", agent_node)
graph.add_node("evaluate", evaluate_output)
graph.add_edge("agent", "evaluate")

Roadmap

Web dashboard for training visualization
A/B testing for prompt variants
Automatic test case generation
Integration with LangSmith, Weights & Biases
Multi-turn conversation evaluation
Custom dimension plugins

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Built with lessons learned from shipping 47 AI agents (31 failed in production).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
examples		examples
src/agent_eval		src/agent_eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Eval Framework

The Problem

Core Philosophy

Features

Installation

Quick Start

Basic Evaluation

Continuous Training

Training Multiple Agents

Evaluation Dimensions

Training Loop

🚀 Real-World Use Cases

Use Case 1: Customer Support Agent

Use Case 2: Code Review Agent

Use Case 3: Sales Email Agent

Interactive Tutorial

Architecture

Best Practices for Prompt Improvement

Example: Before & After

Before (Score: 5.2/10)

After (Score: 8.7/10)

Configuration

Integration with LangGraph

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

goker/agent-eval-framework

Folders and files

Latest commit

History

Repository files navigation

Agent Eval Framework

The Problem

Core Philosophy

Features

Installation

Quick Start

Basic Evaluation

Continuous Training

Training Multiple Agents

Evaluation Dimensions

Training Loop

🚀 Real-World Use Cases

Use Case 1: Customer Support Agent

Use Case 2: Code Review Agent

Use Case 3: Sales Email Agent

Interactive Tutorial

Architecture

Best Practices for Prompt Improvement

Example: Before & After

Before (Score: 5.2/10)

After (Score: 8.7/10)

Configuration

Integration with LangGraph

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages