Editing AI-Generated Evals

How to review and improve your generated test cases

Why you should review generated evals

agentura generate creates a strong starting point, but AI-generated cases usually need a short review. The two common issues are overly specific expected values and expected values that are too vague to catch regressions.

Spend 5 minutes reviewing evals/accuracy.jsonl after generation. It pays off quickly.

Understanding the JSONL format

jsonl

{"input": "how do I reset my password?", "expected": "password reset"}
{"input": "what payment methods do you accept?", "expected": "credit card"}

Each line is one test case. input is sent to your agent. expected is what the response must contain to pass.

Scorer types and when to use each

Scorer	How it works	Best for
exact_match	Response must equal expected exactly	Structured outputs, numbers, yes/no answers
fuzzy	Response must contain expected as a substring	Most use cases - recommended default
semantic_similarity	Scores semantic closeness	When wording varies but meaning should match

The default scorer is fuzzy. Change it in agentura.yaml:

yaml

evals:
  - name: accuracy
    type: golden_dataset
    dataset: ./evals/accuracy.jsonl
    scorer: fuzzy        # or exact_match, semantic_similarity
    threshold: 0.8

Common edits to make

Expected value too rigid

Before: {"input": "how do I cancel?", "expected": "to cancel your subscription, navigate to settings"}
After: {"input": "how do I cancel?", "expected": "cancel"}

With fuzzy scoring, check that the key concept appears.

Expected value too vague

Before: {"input": "what is 2+2", "expected": "number"}
After: {"input": "what is 2+2", "expected": "4"}

Be specific enough to catch incorrect answers.

Add edge cases manually

Generated cases cover happy paths well but often miss weird inputs. Add a few:
{"input": "asdfjkl;", "expected": "I don't understand"}
{"input": "", "expected": "please provide a message"}

Editing the quality rubric

evals/quality-rubric.md defines what "good" means for llm_judge. Keep criteria specific and testable.

Threshold tuning

Threshold is the minimum passing score (0.0 to 1.0). Start with 0.8 and tune from real runs:

Too many false failures: lower to 0.7
Missing real regressions: raise to 0.9

Running locally to validate

bash

agentura run

Run this after editing to confirm scores match your expectations before pushing.