Editing AI-Generated Evals
How to review and improve your generated test cases
Why you should review generated evals
agentura generate creates a strong starting point, but AI-generated cases usually need a short review. The two common issues are overly specific expected values and expected values that are too vague to catch regressions.
Spend 5 minutes reviewing evals/accuracy.jsonl after generation. It pays off quickly.
Understanding the JSONL format
{"input": "how do I reset my password?", "expected": "password reset"}
{"input": "what payment methods do you accept?", "expected": "credit card"}Each line is one test case. input is sent to your agent. expected is what the response must contain to pass.
Scorer types and when to use each
| Scorer | How it works | Best for |
|---|---|---|
| exact_match | Response must equal expected exactly | Structured outputs, numbers, yes/no answers |
| fuzzy | Response must contain expected as a substring | Most use cases - recommended default |
| semantic_similarity | Scores semantic closeness | When wording varies but meaning should match |
The default scorer is fuzzy. Change it in agentura.yaml:
evals:
- name: accuracy
type: golden_dataset
dataset: ./evals/accuracy.jsonl
scorer: fuzzy # or exact_match, semantic_similarity
threshold: 0.8Common edits to make
Expected value too rigid
Before: {"input": "how do I cancel?", "expected": "to cancel your subscription, navigate to settings"}
After: {"input": "how do I cancel?", "expected": "cancel"}
With fuzzy scoring, check that the key concept appears.
Expected value too vague
Before: {"input": "what is 2+2", "expected": "number"}
After: {"input": "what is 2+2", "expected": "4"}
Be specific enough to catch incorrect answers.
Add edge cases manually
Generated cases cover happy paths well but often miss weird inputs. Add a few:
{"input": "asdfjkl;", "expected": "I don't understand"}
{"input": "", "expected": "please provide a message"}
Editing the quality rubric
evals/quality-rubric.md defines what "good" means for llm_judge. Keep criteria specific and testable.
Threshold tuning
Threshold is the minimum passing score (0.0 to 1.0). Start with 0.8 and tune from real runs:
- Too many false failures: lower to 0.7
- Missing real regressions: raise to 0.9
Running locally to validate
agentura runRun this after editing to confirm scores match your expectations before pushing.