Skip to main content
Evaluation uses LLM-as-Judge to assess AI outputs with human-like reasoning, providing scores, feedback, and improvement suggestions.

How Evaluation Works

LLM as Judge

AI evaluates AI with human-like reasoning

Accuracy

Compare output vs expected result

Performance

Measure speed and memory usage

Reliability

Verify tools are called correctly

LLM as Judge

The Judge class uses an LLM to evaluate outputs with human-like reasoning. This is the recommended approach for most evaluations.
from praisonaiagents.eval import Judge

# Evaluate any output
result = Judge().run(
    output="The capital of France is Paris.",
    expected="Paris is the capital of France."
)

print(f"Score: {result.score}/10")
print(f"Reasoning: {result.reasoning}")

Judge Types

Compares output against expected output.
from praisonaiagents.eval import AccuracyJudge

judge = AccuracyJudge()
result = judge.run(
    output="Paris",
    expected="Paris",
    input="What is the capital of France?"
)
# Score: 10/10 - Perfect match
Evaluates against custom criteria.
from praisonaiagents.eval import CriteriaJudge

judge = CriteriaJudge(criteria="Response is professional and helpful")
result = judge.run(output="Hello! How can I assist you today?")
# Score: 9/10 - Professional and helpful
Evaluates multi-agent workflow outputs.
from praisonaiagents.eval import RecipeJudge

judge = RecipeJudge()
result = judge.run(
    output=workflow_output,
    expected="Complete research report"
)

Judge Registry

Register and retrieve custom judges:
from praisonaiagents.eval import add_judge, get_judge, list_judges

# Register a custom judge
add_judge("my_judge", MyCustomJudge)

# List all judges
print(list_judges())  # ['accuracy', 'criteria', 'recipe', 'my_judge']

# Get a judge by name
judge = get_judge("my_judge")

Evaluation Types

Accuracy Evaluation

Compare agent output against expected output using LLM-as-judge.
from praisonaiagents.eval import AccuracyEvaluator

evaluator = AccuracyEvaluator(
    agent=my_agent,
    input_text="What is 2+2?",
    expected_output="4"
)

result = evaluator.run(print_summary=True)
print(f"Average Score: {result.avg_score}/10")

Performance Evaluation

Measure runtime and memory usage.
from praisonaiagents.eval import PerformanceEvaluator

evaluator = PerformanceEvaluator(
    agent=my_agent,
    input_text="Hello!",
    num_iterations=10,
    warmup_runs=2
)

result = evaluator.run(print_summary=True)
print(f"Avg Time: {result.avg_run_time:.3f}s")
print(f"Avg Memory: {result.avg_memory_usage:.2f}MB")
MetricDescription
avg_run_timeAverage execution time
min_run_timeFastest execution
max_run_timeSlowest execution
std_dev_run_timeStandard deviation
median_run_timeMedian execution time
p95_run_time95th percentile
avg_memory_usageAverage memory (MB)

Reliability Evaluation

Verify that expected tools are called.
from praisonaiagents.eval import ReliabilityEvaluator

evaluator = ReliabilityEvaluator(
    agent=my_agent,
    input_text="Search for AI news",
    expected_tools=["search_web", "summarize"]
)

result = evaluator.run(print_summary=True)
print(f"Status: {result.status}")  # PASSED or FAILED
print(f"Pass Rate: {result.pass_rate}%")

Criteria Evaluation

Evaluate against custom criteria with numeric or binary scoring.
from praisonaiagents.eval import CriteriaEvaluator

evaluator = CriteriaEvaluator(
    criteria="Response is helpful, accurate, and professional",
    agent=my_agent,
    input_text="How do I reset my password?",
    scoring_type="numeric",
    threshold=7.0
)

result = evaluator.run(print_summary=True)
print(f"Score: {result.avg_score}/10")
print(f"Passed: {result.all_passed}")

Evaluation Flow


Async Evaluation

All evaluators support async execution:
import asyncio
from praisonaiagents.eval import AccuracyEvaluator

async def evaluate():
    evaluator = AccuracyEvaluator(
        agent=my_agent,
        input_text="Hello",
        expected_output="Hi there!"
    )
    
    result = await evaluator.run_async(print_summary=True)
    return result

result = asyncio.run(evaluate())

Saving Results

Save evaluation results for later analysis:
from praisonaiagents.eval import AccuracyEvaluator

evaluator = AccuracyEvaluator(
    agent=my_agent,
    input_text="Test",
    expected_output="Expected",
    save_results_path="./eval_results/accuracy_{timestamp}.json"
)

result = evaluator.run()
# Results automatically saved to file

Evaluation Packages

Run multiple test cases as a batch:
from praisonaiagents.eval import EvalPackage, EvalCase, Judge

# Define test cases
cases = [
    EvalCase(name="math", input="2+2", expected="4"),
    EvalCase(name="geography", input="Capital of France?", expected="Paris"),
    EvalCase(name="greeting", input="Hello", expected="Hi"),
]

# Create package
package = EvalPackage(
    name="Math and Geography Tests",
    cases=cases
)

# Run cases with Judge
judge = Judge()
for case in package.cases:
    result = judge.run(
        agent=my_agent,
        input=case.input,
        expected=case.expected
    )
    print(f"{case.name}: {result.score}/10")

Quick Reference

Judge

from praisonaiagents.eval import Judge
result = Judge().run(output="...", expected="...")

Accuracy

from praisonaiagents.eval import AccuracyEvaluator
result = AccuracyEvaluator(agent=a, input_text="...", expected_output="...").run()

Performance

from praisonaiagents.eval import PerformanceEvaluator
result = PerformanceEvaluator(func=f, num_iterations=10).run()

Reliability

from praisonaiagents.eval import ReliabilityEvaluator
result = ReliabilityEvaluator(agent=a, expected_tools=["..."]).run()

Best Practices

Use Judge for most evaluations - it provides human-like reasoning.
Be specific: “Response is under 100 words and includes a greeting” is better than “Response is good”.
Run evaluations multiple times to account for LLM non-determinism.
Use save_results_path to track evaluation history over time.
Use Accuracy for correctness, Performance for speed, Reliability for tool usage.
LLM Costs: Each evaluation makes LLM API calls. Use num_iterations wisely and consider caching for repeated evaluations.

Guardrails

Protect agents with input/output validation

Hooks

Intercept and modify agent behavior