Agent Evaluation

PraisonAI provides a comprehensive evaluation framework for testing and benchmarking AI agents. The evaluation system supports multiple evaluation types with zero performance impact when not in use.

Evaluation Types

Type	Description	Use Case
Accuracy	Compare output against expected output using LLM-as-judge	Verify correctness
Performance	Measure runtime and memory usage	Benchmark speed
Reliability	Verify expected tool calls are made	Test tool usage
Criteria	Evaluate against custom criteria	Quality assessment

Installation

The evaluation framework is included in praisonaiagents:

pip install praisonaiagents

Python Usage

Accuracy Evaluation

Compare agent outputs against expected results using an LLM judge:

from praisonaiagents import Agent
from praisonaiagents.eval import AccuracyEvaluator

# Create agent
agent = Agent(instructions="You are a math tutor. Answer concisely.")

# Create evaluator
evaluator = AccuracyEvaluator(
    agent=agent,
    input_text="What is 2 + 2?",
    expected_output="4",
    num_iterations=3,  # Run multiple times for statistical significance
    
)

# Run evaluation
result = evaluator.run(print_summary=True)

print(f"Average Score: {result.avg_score}/10")
print(f"Passed: {result.passed}")

Performance Evaluation

Benchmark agent runtime and memory usage:

from praisonaiagents import Agent
from praisonaiagents.eval import PerformanceEvaluator

agent = Agent(instructions="You are a helpful assistant.")

evaluator = PerformanceEvaluator(
    agent=agent,
    input_text="What is the capital of France?",
    num_iterations=10,  # Number of benchmark runs
    warmup_runs=2,      # Warmup runs before measurement
    track_memory=True,  # Track memory usage
    
)

result = evaluator.run(print_summary=True)

print(f"Average Time: {result.avg_run_time:.4f}s")
print(f"Min Time: {result.min_run_time:.4f}s")
print(f"Max Time: {result.max_run_time:.4f}s")
print(f"P95 Time: {result.p95_run_time:.4f}s")
print(f"Avg Memory: {result.avg_memory:.2f} MB")

Reliability Evaluation

Verify that agents call the expected tools:

from praisonaiagents import Agent
from praisonaiagents.eval import ReliabilityEvaluator

def search_web(query: str) -> str:
    """Search the web."""
    return f"Results for: {query}"

def calculate(expression: str) -> str:
    """Calculate expression."""
    return str(eval(expression))

agent = Agent(
    instructions="You have search and calculator tools.",
    tools=[search_web, calculate]
)

evaluator = ReliabilityEvaluator(
    agent=agent,
    input_text="Search for weather and calculate 25 * 4",
    expected_tools=["search_web", "calculate"],
    forbidden_tools=["delete_file"],  # Should NOT be called
    
)

result = evaluator.run(print_summary=True)

print(f"Passed: {result.passed}")
print(f"Pass Rate: {result.pass_rate:.1%}")

Criteria Evaluation

Evaluate outputs against custom criteria using LLM-as-judge:

from praisonaiagents import Agent
from praisonaiagents.eval import CriteriaEvaluator

agent = Agent(instructions="You are a customer service agent.")

# Numeric scoring (1-10)
evaluator = CriteriaEvaluator(
    criteria="Response is helpful, empathetic, and provides a clear solution",
    agent=agent,
    input_text="My order hasn't arrived yet.",
    scoring_type="numeric",  # Score 1-10
    threshold=7.0,           # Pass if score >= 7
    num_iterations=2,
    
)

result = evaluator.run(print_summary=True)

print(f"Average Score: {result.avg_score}/10")
print(f"Pass Rate: {result.pass_rate:.1%}")

# Binary scoring (pass/fail)
binary_evaluator = CriteriaEvaluator(
    criteria="Response does not contain offensive language",
    agent=agent,
    input_text="Tell me a joke",
    scoring_type="binary",
    
)

binary_result = binary_evaluator.run(print_summary=True)

Failure Callbacks

Handle evaluation failures with callbacks:

from praisonaiagents.eval import CriteriaEvaluator

def handle_failure(score):
    print(f"ALERT: Evaluation failed with score {score.score}")
    print(f"Reasoning: {score.reasoning}")
    # Send alert, log to monitoring system, etc.

evaluator = CriteriaEvaluator(
    criteria="Response is professional",
    agent=agent,
    input_text="Help me",
    on_fail=handle_failure,
    threshold=8.0
)

evaluator.run()

Evaluate Pre-generated Outputs

Evaluate outputs without running the agent:

from praisonaiagents.eval import AccuracyEvaluator, CriteriaEvaluator

# Accuracy evaluation of pre-generated output
accuracy_eval = AccuracyEvaluator(
    func=lambda x: "unused",  # Placeholder
    input_text="What is 2+2?",
    expected_output="4"
)

result = accuracy_eval.evaluate_output("The answer is 4")

# Criteria evaluation of pre-generated output
criteria_eval = CriteriaEvaluator(
    criteria="Response is helpful and accurate",
    func=lambda x: "unused"
)

result = criteria_eval.evaluate_output("Here's how to solve that...")

Saving Results

Save evaluation results to files:

evaluator = AccuracyEvaluator(
    agent=agent,
    input_text="Test input",
    expected_output="Expected output",
    save_results_path="results/{name}_{eval_id}.json"  # Supports placeholders
)

result = evaluator.run()
# Results automatically saved to file

CLI Usage

Accuracy Evaluation

praisonai eval accuracy \
    --prompt "What is 2+2?" \  # Direct prompt (no agents.yaml needed)
    --expected "4"

# Or with agents.yaml:
praisonai eval accuracy \
    --agent agents.yaml \
    --input "What is 2+2?" \
    --expected "4" \
    --iterations 3 \
    --output results.json \
    --verbose

Performance Evaluation

praisonai eval performance \
    --agent agents.yaml \
    --input "Hello" \
    --iterations 10 \
    --warmup 2 \
    --memory \
    --output perf_results.json

Reliability Evaluation

praisonai eval reliability \
    --agent agents.yaml \
    --input "Search for weather" \
    --expected-tools "search_web,calculate" \
    --forbidden-tools "delete_file" \
    --output reliability.json

Criteria Evaluation

praisonai eval criteria \
    --agent agents.yaml \
    --input "Help me with my order" \
    --criteria "Response is helpful and professional" \
    --scoring numeric \
    --threshold 7.0 \
    --iterations 2 \
    --output criteria.json

Batch Evaluation

Run multiple test cases from a JSON file:

praisonai eval batch \
    --agent agents.yaml \
    --test-file tests.json \
    --batch-type accuracy \
    --output batch_results.json

Test file format (tests.json):

[
    {
        "input": "What is 2+2?",
        "expected": "4"
    },
    {
        "input": "What is the capital of France?",
        "expected": "Paris"
    }
]

CLI Options Reference

Common Options

Option	Short	Description
`--agent`	`-a`	Path to agents.yaml file
`--output`	`-o`	Output file for results
`--verbose`	`-v`	Enable verbose output
`--quiet`	`-q`	Suppress JSON output

Accuracy Options

Option	Short	Description
`--input`	`-i`	Input text for the agent
`--expected`	`-e`	Expected output
`--iterations`	`-n`	Number of iterations
`--model`	`-m`	LLM model for judging

Performance Options

Option	Short	Description
`--input`	`-i`	Input text for the agent
`--iterations`	`-n`	Number of benchmark iterations
`--warmup`	`-w`	Number of warmup runs
`--memory`		Track memory usage

Reliability Options

Option	Short	Description
`--input`	`-i`	Input text for the agent
`--expected-tools`	`-t`	Expected tools (comma-separated)
`--forbidden-tools`	`-f`	Forbidden tools (comma-separated)

Criteria Options

Option	Short	Description
`--input`	`-i`	Input text for the agent
`--criteria`	`-c`	Evaluation criteria
`--scoring`	`-s`	Scoring type (numeric/binary)
`--threshold`		Pass threshold for numeric scoring
`--iterations`	`-n`	Number of iterations
`--model`	`-m`	LLM model for judging

Result Data Structures

AccuracyResult

result.evaluations  # List of individual scores
result.avg_score    # Average score (0-10)
result.min_score    # Minimum score
result.max_score    # Maximum score
result.std_dev      # Standard deviation
result.passed       # True if avg_score >= 7
result.to_dict()    # Convert to dictionary
result.to_json()    # Convert to JSON string

PerformanceResult

result.metrics          # List of PerformanceMetrics
result.avg_run_time     # Average runtime in seconds
result.min_run_time     # Minimum runtime
result.max_run_time     # Maximum runtime
result.median_run_time  # Median runtime
result.p95_run_time     # 95th percentile runtime
result.avg_memory       # Average memory usage (MB)
result.max_memory       # Peak memory usage (MB)

ReliabilityResult

result.tool_results  # List of ToolCallResult
result.passed_calls  # Tools that passed
result.failed_calls  # Tools that failed
result.pass_rate     # Pass rate (0-1)
result.passed        # True if all checks passed
result.status        # "PASSED" or "FAILED"

CriteriaResult

result.evaluations  # List of CriteriaScore
result.criteria     # The evaluation criteria
result.scoring_type # "numeric" or "binary"
result.threshold    # Pass threshold
result.avg_score    # Average score
result.pass_rate    # Pass rate (0-1)
result.passed       # True if passed threshold

LLM Judge in Interactive Tests

The interactive test runner integrates LLM-as-judge evaluation for automated response quality assessment. This allows you to validate not just tool calls and file outputs, but also the quality of agent responses.

Using Judge in CSV Tests

Add a judge_rubric column to your CSV test file:

id,name,prompts,judge_rubric,judge_threshold,judge_model
test_01,Helpful Response,"Explain Python decorators",Response is clear and accurate,7.0,gpt-4o-mini
test_02,Code Quality,"Create a function to sort a list",Code is correct and well-documented,8.0,gpt-4o-mini

Judge Configuration

Option	Default	Description
`judge_rubric`	(empty)	Evaluation criteria for the judge
`judge_threshold`	7.0	Minimum score to pass (1-10 scale)
`judge_model`	gpt-4o-mini	Model used for evaluation

CLI Options for Judge

# Run with judge evaluation
praisonai test interactive --csv tests.csv

# Skip judge even if rubric is present
praisonai test interactive --csv tests.csv --no-judge

# Use a different judge model
praisonai test interactive --csv tests.csv --judge-model gpt-4o

Judge Output

When judge evaluation is enabled, results include:

Score: 1-10 rating based on rubric
Passed: Whether score meets threshold
Reasoning: Detailed explanation of the score

Example artifact (judge_result.json):

{
  "score": 8.5,
  "passed": true,
  "reasoning": "SCORE: 8.5\nREASONING: The response clearly explains...",
  "threshold": 7.0,
  "model": "gpt-4o-mini"
}

Writing Effective Rubrics

Good rubrics are:

Specific: “Response includes code example” vs “Response is good”
Measurable: “Explains at least 3 benefits” vs “Comprehensive”
Relevant: Focus on what matters for the test case

Examples:

# Good rubrics
"Response contains working Python code with proper error handling"
"Explanation covers syntax, use cases, and at least one example"
"File was created with correct content and proper formatting"

# Avoid vague rubrics
"Response is helpful"
"Code is good"
"Answer is correct"

Best Practices

Use Multiple Iterations: Run evaluations multiple times for statistical significance
Warmup Runs: Use warmup runs for performance benchmarks to avoid cold-start effects
Save Results: Always save results for tracking and comparison
Custom Criteria: Write specific, measurable criteria for criteria evaluations
Batch Testing: Use batch evaluation for regression testing
CI/CD Integration: Integrate evaluations into your CI/CD pipeline

Examples

See the examples directory for complete examples:

GitHub Advanced Test Rubrics

The github-advanced test suite uses specialized LLM judge rubrics for evaluating GitHub workflow quality:

Available Rubrics

Rubric	Description	Key Criteria
PR Quality	Evaluates pull request quality	Title clarity, body completeness, issue reference, branch naming
Code Quality	Evaluates code changes	Correctness, tests pass, coverage, type hints, no regressions
Workflow Correctness	Evaluates GitHub workflow	Repo created, issue created, PR links issue
CI/CD Quality	Evaluates CI configuration	Valid YAML, checkout step, setup step, triggers
Documentation	Evaluates docs changes	Links valid, content accurate, formatting correct
Multi-Agent	Evaluates agent collaboration	Handoff, task completion, context preservation

Rubric Structure

Each rubric contains weighted criteria:

from tests.live.interactive.github_advanced.judge_rubric import (
    PR_QUALITY_RUBRIC,
    evaluate_with_rubric,
)

# Get evaluation prompt
prompt = PR_QUALITY_RUBRIC.get_prompt()

# Evaluate with context
result = evaluate_with_rubric(
    rubric=PR_QUALITY_RUBRIC,
    context={
        "pr_title": "Fix subtract sign bug",
        "pr_body": "Closes #1. Fixed the subtract function.",
        "branch": "fix/subtract-sign",
    },
    judge_model="gpt-4o-mini",
)

print(result["overall_score"])  # 0-10
print(result["passed"])  # True/False

Scenario to Rubric Mapping

Scenario	Rubrics Applied
GH_01	PR Quality, Code Quality, Workflow Correctness
GH_02	PR Quality, CI/CD Quality, Workflow Correctness
GH_03	PR Quality, Code Quality, Workflow Correctness
GH_04	PR Quality, Documentation, Workflow Correctness
GH_05	PR Quality, Multi-Agent, Workflow Correctness

CLI

​Agent Evaluation

​Evaluation Types

​Installation

​Python Usage

​Accuracy Evaluation

​Performance Evaluation

​Reliability Evaluation

​Criteria Evaluation

​Failure Callbacks

​Evaluate Pre-generated Outputs

​Saving Results

​CLI Usage

​Accuracy Evaluation

​Performance Evaluation

​Reliability Evaluation

​Criteria Evaluation

​Batch Evaluation

​CLI Options Reference

​Common Options

​Accuracy Options

​Performance Options

​Reliability Options

​Criteria Options

​Result Data Structures

​AccuracyResult

​PerformanceResult

​ReliabilityResult

​CriteriaResult

​LLM Judge in Interactive Tests

​Using Judge in CSV Tests

​Judge Configuration

​CLI Options for Judge

​Judge Output

​Writing Effective Rubrics

​Best Practices

​Examples

​GitHub Advanced Test Rubrics

​Available Rubrics

​Rubric Structure

​Scenario to Rubric Mapping

Agent Evaluation

Evaluation Types

Installation

Python Usage

Accuracy Evaluation

Performance Evaluation

Reliability Evaluation

Criteria Evaluation

Failure Callbacks

Evaluate Pre-generated Outputs

Saving Results

CLI Usage

Accuracy Evaluation

Performance Evaluation

Reliability Evaluation

Criteria Evaluation

Batch Evaluation

CLI Options Reference

Common Options

Accuracy Options

Performance Options

Reliability Options

Criteria Options

Result Data Structures

AccuracyResult

PerformanceResult

ReliabilityResult

CriteriaResult

LLM Judge in Interactive Tests

Using Judge in CSV Tests

Judge Configuration

CLI Options for Judge

Judge Output

Writing Effective Rubrics

Best Practices

Examples

GitHub Advanced Test Rubrics

Available Rubrics

Rubric Structure

Scenario to Rubric Mapping