A production-grade evaluation and continuous improvement system for AI agents and agentic workflows.
Most AI agents work great in demos but fail in production. Why?
- No systematic evaluation - Teams rely on vibes instead of metrics
- Prompt rot - Agent performance degrades as use cases expand
- Blind spots - Edge cases that testing never catches
- No feedback loop - No mechanism to learn from failures
This framework provides a structured approach to continuously evaluate and improve your agents.
Every agent has latent potential unlocked through:
- Better Prompts - Specific instructions with examples and reasoning frameworks
- Continuous Evaluation - Testing reveals blind spots humans miss
- Cross-Agent Learning - Patterns that work for one agent help others
- Systematic Improvement - Small, validated changes compound over time
- 5-Dimension Evaluation Model - Comprehensive scoring across clarity, completeness, reasoning, actionability, and alignment
- Automated Training Loop - Analyze → Evaluate → Diagnose → Prescribe → Validate → Track
- Prompt Optimization - Data-driven prompt improvements with validation
- Memory Integration - Track improvements and regressions over time
- Multi-Agent Support - Train entire agent fleets simultaneously
pip install agent-eval-frameworkOr from source:
git clone https://github.com/yourusername/agent-eval-framework.git
cd agent-eval-framework
pip install -e .from agent_eval import AgentEvaluator, EvalDimension
# Create evaluator
evaluator = AgentEvaluator()
# Define your agent's prompt
agent_prompt = """
You are a helpful assistant that answers questions about Python.
Be concise and provide code examples when relevant.
"""
# Evaluate the prompt
results = evaluator.evaluate(
prompt=agent_prompt,
test_cases=[
{"input": "How do I read a file?", "expected_traits": ["code example", "with statement"]},
{"input": "Explain decorators", "expected_traits": ["function wrapper", "@ syntax"]},
]
)
print(f"Overall Score: {results.overall_score}/10")
for dim, score in results.dimension_scores.items():
print(f" {dim.name}: {score}/10")from agent_eval import AgentTrainer
trainer = AgentTrainer()
# Train a single agent
improved_prompt, report = trainer.train(
agent_name="python_assistant",
current_prompt=agent_prompt,
test_cases=test_cases,
apply_if_improved=True # Auto-apply if validation passes
)
print(report)from agent_eval import train_all
results = train_all(
agents={
"python_assistant": python_prompt,
"code_reviewer": reviewer_prompt,
"doc_writer": doc_prompt,
},
shared_test_suite=common_tests,
)
# See which agents need attention
for agent, result in results.items():
if result.needs_improvement:
print(f"⚠️ {agent}: {result.weakest_dimension}")Agents are scored across 5 dimensions (0-10 scale):
| Dimension | Description | What It Measures |
|---|---|---|
| Clarity | Is the purpose crystal clear? | Unambiguous instructions, clear role definition |
| Completeness | Are edge cases handled? | Coverage of failure modes, graceful degradation |
| Reasoning | Does it show its work? | Chain-of-thought, explanation of decisions |
| Actionability | Are outputs immediately usable? | Concrete next steps, structured formats |
| Alignment | Does it serve true goals? | User intent vs stated task, value alignment |
The framework implements a 6-step improvement cycle:
┌─────────────────────────────────────────────────────────┐
│ TRAINING LOOP │
├─────────────────────────────────────────────────────────┤
│ │
│ 1. ANALYZE ──→ Read prompt, find assumptions │
│ ↓ │
│ 2. EVALUATE ──→ Run test cases, score dimensions │
│ ↓ │
│ 3. DIAGNOSE ──→ Find patterns, root causes │
│ ↓ │
│ 4. PRESCRIBE ──→ Generate targeted improvements │
│ ↓ │
│ 5. VALIDATE ──→ Test improved vs original │
│ ↓ │
│ 6. TRACK ──→ Save results, monitor trends │
│ ↓ │
│ [Loop back to 1 with new baseline] │
│ │
└─────────────────────────────────────────────────────────┘
We've included 3 complete, runnable examples that solve real problems:
Problem: Support responses are robotic, miss emotional cues, and don't resolve issues.
python examples/use_case_1_customer_support.pyWhat you'll learn:
- Handling angry customers with empathy-first responses
- Structured output (Acknowledgment → Solution → Next Steps)
- Edge cases for urgent issues, cancellations, and feature requests
Problem: Code reviews miss security vulnerabilities and give vague feedback.
python examples/use_case_2_code_review.pyWhat you'll learn:
- Security-first review priorities (SQL injection, memory leaks, race conditions)
- Severity-based output (CRITICAL → LOW)
- Always providing working fix code, not just descriptions
Problem: Cold emails get 2% response rates (industry average: 5-15%).
python examples/use_case_3_sales_email.pyWhat you'll learn:
- Personalization beyond "I saw you work at {company}"
- The 75-word maximum rule
- Follow-up and breakup email strategies
New to the framework? Start here:
python examples/quickstart_tutorial.pyagent-eval-framework/
├── src/
│ └── agent_eval/
│ ├── __init__.py # Public API exports
│ ├── dimensions.py # EvalDimension enum & scoring
│ ├── evaluator.py # Core AgentEvaluator class
│ ├── trainer.py # AgentTrainer with training loop
│ └── prompts.py # Prompt optimization utilities
├── examples/
│ ├── quickstart_tutorial.py # Interactive 5-min tutorial
│ ├── use_case_1_customer_support.py
│ ├── use_case_2_code_review.py
│ ├── use_case_3_sales_email.py
│ ├── basic_evaluation.py
│ └── continuous_training.py
├── tests/
└── .github/workflows/ci.yml # Automated testing
Based on training hundreds of agents, these patterns consistently improve scores:
- Role Clarity - Start with a specific role definition
- Structured Output - Provide explicit output format templates
- Reasoning Frameworks - Add "First... Then... Finally..." type structures
- Examples - Include 1-2 examples of ideal outputs
- Edge Cases - Address common failure modes explicitly
- Self-Evaluation - Add criteria for the agent to check its own work
- User Alignment - Connect outputs to user's actual goals
You are a helpful coding assistant. Answer programming questions.
# ROLE
You are a Senior Software Engineer assistant specializing in Python.
# TASK
When answering programming questions:
1. First, clarify the user's actual goal (not just the stated question)
2. Provide a working code example
3. Explain WHY the solution works, not just WHAT it does
4. Mention common pitfalls to avoid
# OUTPUT FORMAT
## Solution
[Working code with comments]
## Explanation
[Why this approach works]
## Watch Out For
[Common mistakes with this pattern]
# SELF-CHECK
Before responding, verify:
- [ ] Code is syntactically correct
- [ ] Example is complete (can be copy-pasted)
- [ ] Explanation matches user's experience level
from agent_eval import AgentEvaluator, EvalConfig
config = EvalConfig(
min_improvement_threshold=0.5, # Only apply if +0.5 improvement
max_training_iterations=5, # Cap optimization rounds
dimensions_weights={ # Custom dimension importance
"clarity": 1.0,
"completeness": 1.2, # Weight completeness higher
"reasoning": 1.0,
"actionability": 1.5, # Weight actionability higher
"alignment": 1.0,
},
llm_provider="anthropic", # or "openai", "ollama"
llm_model="claude-sonnet-4-20250514",
)
evaluator = AgentEvaluator(config=config)from langgraph.graph import StateGraph
from agent_eval import AgentEvaluator
# Add evaluation as a node in your graph
def evaluate_output(state):
evaluator = AgentEvaluator()
score = evaluator.quick_score(state["agent_output"])
state["quality_score"] = score
return state
graph = StateGraph()
graph.add_node("agent", agent_node)
graph.add_node("evaluate", evaluate_output)
graph.add_edge("agent", "evaluate")- Web dashboard for training visualization
- A/B testing for prompt variants
- Automatic test case generation
- Integration with LangSmith, Weights & Biases
- Multi-turn conversation evaluation
- Custom dimension plugins
Contributions welcome! See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
Built with lessons learned from shipping 47 AI agents (31 failed in production).