Agent Evaluation
PraisonAI provides a comprehensive evaluation framework for testing and benchmarking AI agents. The evaluation system supports multiple evaluation types with zero performance impact when not in use.Evaluation Types

| Type | Description | Use Case |
|---|---|---|
| Accuracy | Compare output against expected output using LLM-as-judge | Verify correctness |
| Performance | Measure runtime and memory usage | Benchmark speed |
| Reliability | Verify expected tool calls are made | Test tool usage |
| Criteria | Evaluate against custom criteria | Quality assessment |
Installation
The evaluation framework is included inpraisonaiagents:
Python Usage
Accuracy Evaluation
Compare agent outputs against expected results using an LLM judge:Performance Evaluation
Benchmark agent runtime and memory usage:Reliability Evaluation
Verify that agents call the expected tools:Criteria Evaluation
Evaluate outputs against custom criteria using LLM-as-judge:Failure Callbacks
Handle evaluation failures with callbacks:Evaluate Pre-generated Outputs
Evaluate outputs without running the agent:Saving Results
Save evaluation results to files:CLI Usage
Accuracy Evaluation
Performance Evaluation
Reliability Evaluation
Criteria Evaluation
Batch Evaluation
Run multiple test cases from a JSON file:CLI Options Reference
Common Options
| Option | Short | Description |
|---|---|---|
--agent | -a | Path to agents.yaml file |
--output | -o | Output file for results |
--verbose | -v | Enable verbose output |
--quiet | -q | Suppress JSON output |
Accuracy Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input text for the agent |
--expected | -e | Expected output |
--iterations | -n | Number of iterations |
--model | -m | LLM model for judging |
Performance Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input text for the agent |
--iterations | -n | Number of benchmark iterations |
--warmup | -w | Number of warmup runs |
--memory | Track memory usage |
Reliability Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input text for the agent |
--expected-tools | -t | Expected tools (comma-separated) |
--forbidden-tools | -f | Forbidden tools (comma-separated) |
Criteria Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input text for the agent |
--criteria | -c | Evaluation criteria |
--scoring | -s | Scoring type (numeric/binary) |
--threshold | Pass threshold for numeric scoring | |
--iterations | -n | Number of iterations |
--model | -m | LLM model for judging |
Result Data Structures
AccuracyResult
PerformanceResult
ReliabilityResult
CriteriaResult
LLM Judge in Interactive Tests
The interactive test runner integrates LLM-as-judge evaluation for automated response quality assessment. This allows you to validate not just tool calls and file outputs, but also the quality of agent responses.Using Judge in CSV Tests
Add ajudge_rubric column to your CSV test file:
Judge Configuration
| Option | Default | Description |
|---|---|---|
judge_rubric | (empty) | Evaluation criteria for the judge |
judge_threshold | 7.0 | Minimum score to pass (1-10 scale) |
judge_model | gpt-4o-mini | Model used for evaluation |
CLI Options for Judge
Judge Output
When judge evaluation is enabled, results include:- Score: 1-10 rating based on rubric
- Passed: Whether score meets threshold
- Reasoning: Detailed explanation of the score
judge_result.json):
Writing Effective Rubrics
Good rubrics are:- Specific: “Response includes code example” vs “Response is good”
- Measurable: “Explains at least 3 benefits” vs “Comprehensive”
- Relevant: Focus on what matters for the test case
Best Practices
- Use Multiple Iterations: Run evaluations multiple times for statistical significance
- Warmup Runs: Use warmup runs for performance benchmarks to avoid cold-start effects
- Save Results: Always save results for tracking and comparison
- Custom Criteria: Write specific, measurable criteria for criteria evaluations
- Batch Testing: Use batch evaluation for regression testing
- CI/CD Integration: Integrate evaluations into your CI/CD pipeline
Examples
See the examples directory for complete examples:- Accuracy Evaluation
- Performance Evaluation
- Reliability Evaluation
- Criteria Evaluation
- Batch Evaluation
GitHub Advanced Test Rubrics
Thegithub-advanced test suite uses specialized LLM judge rubrics for evaluating GitHub workflow quality:
Available Rubrics
| Rubric | Description | Key Criteria |
|---|---|---|
| PR Quality | Evaluates pull request quality | Title clarity, body completeness, issue reference, branch naming |
| Code Quality | Evaluates code changes | Correctness, tests pass, coverage, type hints, no regressions |
| Workflow Correctness | Evaluates GitHub workflow | Repo created, issue created, PR links issue |
| CI/CD Quality | Evaluates CI configuration | Valid YAML, checkout step, setup step, triggers |
| Documentation | Evaluates docs changes | Links valid, content accurate, formatting correct |
| Multi-Agent | Evaluates agent collaboration | Handoff, task completion, context preservation |
Rubric Structure
Each rubric contains weighted criteria:Scenario to Rubric Mapping
| Scenario | Rubrics Applied |
|---|---|
| GH_01 | PR Quality, Code Quality, Workflow Correctness |
| GH_02 | PR Quality, CI/CD Quality, Workflow Correctness |
| GH_03 | PR Quality, Code Quality, Workflow Correctness |
| GH_04 | PR Quality, Documentation, Workflow Correctness |
| GH_05 | PR Quality, Multi-Agent, Workflow Correctness |

