Skip to content

Production-grade evaluation and continuous improvement system for AI agents

License

Notifications You must be signed in to change notification settings

goker/agent-eval-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Eval Framework

CI Python 3.9+ License: MIT

A production-grade evaluation and continuous improvement system for AI agents and agentic workflows.

The Problem

Most AI agents work great in demos but fail in production. Why?

  • No systematic evaluation - Teams rely on vibes instead of metrics
  • Prompt rot - Agent performance degrades as use cases expand
  • Blind spots - Edge cases that testing never catches
  • No feedback loop - No mechanism to learn from failures

This framework provides a structured approach to continuously evaluate and improve your agents.

Core Philosophy

Every agent has latent potential unlocked through:

  1. Better Prompts - Specific instructions with examples and reasoning frameworks
  2. Continuous Evaluation - Testing reveals blind spots humans miss
  3. Cross-Agent Learning - Patterns that work for one agent help others
  4. Systematic Improvement - Small, validated changes compound over time

Features

  • 5-Dimension Evaluation Model - Comprehensive scoring across clarity, completeness, reasoning, actionability, and alignment
  • Automated Training Loop - Analyze → Evaluate → Diagnose → Prescribe → Validate → Track
  • Prompt Optimization - Data-driven prompt improvements with validation
  • Memory Integration - Track improvements and regressions over time
  • Multi-Agent Support - Train entire agent fleets simultaneously

Installation

pip install agent-eval-framework

Or from source:

git clone https://github.com/yourusername/agent-eval-framework.git
cd agent-eval-framework
pip install -e .

Quick Start

Basic Evaluation

from agent_eval import AgentEvaluator, EvalDimension

# Create evaluator
evaluator = AgentEvaluator()

# Define your agent's prompt
agent_prompt = """
You are a helpful assistant that answers questions about Python.
Be concise and provide code examples when relevant.
"""

# Evaluate the prompt
results = evaluator.evaluate(
    prompt=agent_prompt,
    test_cases=[
        {"input": "How do I read a file?", "expected_traits": ["code example", "with statement"]},
        {"input": "Explain decorators", "expected_traits": ["function wrapper", "@ syntax"]},
    ]
)

print(f"Overall Score: {results.overall_score}/10")
for dim, score in results.dimension_scores.items():
    print(f"  {dim.name}: {score}/10")

Continuous Training

from agent_eval import AgentTrainer

trainer = AgentTrainer()

# Train a single agent
improved_prompt, report = trainer.train(
    agent_name="python_assistant",
    current_prompt=agent_prompt,
    test_cases=test_cases,
    apply_if_improved=True  # Auto-apply if validation passes
)

print(report)

Training Multiple Agents

from agent_eval import train_all

results = train_all(
    agents={
        "python_assistant": python_prompt,
        "code_reviewer": reviewer_prompt,
        "doc_writer": doc_prompt,
    },
    shared_test_suite=common_tests,
)

# See which agents need attention
for agent, result in results.items():
    if result.needs_improvement:
        print(f"⚠️  {agent}: {result.weakest_dimension}")

Evaluation Dimensions

Agents are scored across 5 dimensions (0-10 scale):

Dimension Description What It Measures
Clarity Is the purpose crystal clear? Unambiguous instructions, clear role definition
Completeness Are edge cases handled? Coverage of failure modes, graceful degradation
Reasoning Does it show its work? Chain-of-thought, explanation of decisions
Actionability Are outputs immediately usable? Concrete next steps, structured formats
Alignment Does it serve true goals? User intent vs stated task, value alignment

Training Loop

The framework implements a 6-step improvement cycle:

┌─────────────────────────────────────────────────────────┐
│                    TRAINING LOOP                        │
├─────────────────────────────────────────────────────────┤
│                                                         │
│   1. ANALYZE    ──→  Read prompt, find assumptions     │
│        ↓                                                │
│   2. EVALUATE   ──→  Run test cases, score dimensions  │
│        ↓                                                │
│   3. DIAGNOSE   ──→  Find patterns, root causes        │
│        ↓                                                │
│   4. PRESCRIBE  ──→  Generate targeted improvements    │
│        ↓                                                │
│   5. VALIDATE   ──→  Test improved vs original         │
│        ↓                                                │
│   6. TRACK      ──→  Save results, monitor trends      │
│        ↓                                                │
│   [Loop back to 1 with new baseline]                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

🚀 Real-World Use Cases

We've included 3 complete, runnable examples that solve real problems:

Use Case 1: Customer Support Agent

Problem: Support responses are robotic, miss emotional cues, and don't resolve issues.

python examples/use_case_1_customer_support.py

What you'll learn:

  • Handling angry customers with empathy-first responses
  • Structured output (Acknowledgment → Solution → Next Steps)
  • Edge cases for urgent issues, cancellations, and feature requests

Use Case 2: Code Review Agent

Problem: Code reviews miss security vulnerabilities and give vague feedback.

python examples/use_case_2_code_review.py

What you'll learn:

  • Security-first review priorities (SQL injection, memory leaks, race conditions)
  • Severity-based output (CRITICAL → LOW)
  • Always providing working fix code, not just descriptions

Use Case 3: Sales Email Agent

Problem: Cold emails get 2% response rates (industry average: 5-15%).

python examples/use_case_3_sales_email.py

What you'll learn:

  • Personalization beyond "I saw you work at {company}"
  • The 75-word maximum rule
  • Follow-up and breakup email strategies

Interactive Tutorial

New to the framework? Start here:

python examples/quickstart_tutorial.py

Architecture

agent-eval-framework/
├── src/
│   └── agent_eval/
│       ├── __init__.py          # Public API exports
│       ├── dimensions.py        # EvalDimension enum & scoring
│       ├── evaluator.py         # Core AgentEvaluator class
│       ├── trainer.py           # AgentTrainer with training loop
│       └── prompts.py           # Prompt optimization utilities
├── examples/
│   ├── quickstart_tutorial.py   # Interactive 5-min tutorial
│   ├── use_case_1_customer_support.py
│   ├── use_case_2_code_review.py
│   ├── use_case_3_sales_email.py
│   ├── basic_evaluation.py
│   └── continuous_training.py
├── tests/
└── .github/workflows/ci.yml     # Automated testing

Best Practices for Prompt Improvement

Based on training hundreds of agents, these patterns consistently improve scores:

  1. Role Clarity - Start with a specific role definition
  2. Structured Output - Provide explicit output format templates
  3. Reasoning Frameworks - Add "First... Then... Finally..." type structures
  4. Examples - Include 1-2 examples of ideal outputs
  5. Edge Cases - Address common failure modes explicitly
  6. Self-Evaluation - Add criteria for the agent to check its own work
  7. User Alignment - Connect outputs to user's actual goals

Example: Before & After

Before (Score: 5.2/10)

You are a helpful coding assistant. Answer programming questions.

After (Score: 8.7/10)

# ROLE
You are a Senior Software Engineer assistant specializing in Python.

# TASK
When answering programming questions:
1. First, clarify the user's actual goal (not just the stated question)
2. Provide a working code example
3. Explain WHY the solution works, not just WHAT it does
4. Mention common pitfalls to avoid

# OUTPUT FORMAT
## Solution
[Working code with comments]

## Explanation
[Why this approach works]

## Watch Out For
[Common mistakes with this pattern]

# SELF-CHECK
Before responding, verify:
- [ ] Code is syntactically correct
- [ ] Example is complete (can be copy-pasted)
- [ ] Explanation matches user's experience level

Configuration

from agent_eval import AgentEvaluator, EvalConfig

config = EvalConfig(
    min_improvement_threshold=0.5,  # Only apply if +0.5 improvement
    max_training_iterations=5,       # Cap optimization rounds
    dimensions_weights={             # Custom dimension importance
        "clarity": 1.0,
        "completeness": 1.2,         # Weight completeness higher
        "reasoning": 1.0,
        "actionability": 1.5,        # Weight actionability higher
        "alignment": 1.0,
    },
    llm_provider="anthropic",        # or "openai", "ollama"
    llm_model="claude-sonnet-4-20250514",
)

evaluator = AgentEvaluator(config=config)

Integration with LangGraph

from langgraph.graph import StateGraph
from agent_eval import AgentEvaluator

# Add evaluation as a node in your graph
def evaluate_output(state):
    evaluator = AgentEvaluator()
    score = evaluator.quick_score(state["agent_output"])
    state["quality_score"] = score
    return state

graph = StateGraph()
graph.add_node("agent", agent_node)
graph.add_node("evaluate", evaluate_output)
graph.add_edge("agent", "evaluate")

Roadmap

  • Web dashboard for training visualization
  • A/B testing for prompt variants
  • Automatic test case generation
  • Integration with LangSmith, Weights & Biases
  • Multi-turn conversation evaluation
  • Custom dimension plugins

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.


Built with lessons learned from shipping 47 AI agents (31 failed in production).

About

Production-grade evaluation and continuous improvement system for AI agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages