Eval Strategies
Scorers and eval patterns for single-turn, tool-calling, and conversation workflows
Scorers
| Scorer | How it works | Best for |
|---|---|---|
| exact_match | String must match exactly | Structured outputs, codes, IDs |
| contains | Output must contain the expected string | Partial matches, key phrases |
| fuzzy_match | Token overlap score | Most general cases |
| semantic_similarity | Embedding cosine similarity | When wording can vary |
Scorer auto-detects your LLM provider for semantic_similarity: set ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY, or GROQ_API_KEY. Falls back to Ollama if running locally. Falls back to fuzzy_match if no provider is available.
golden_dataset
BEST FOR
Known-answer tasks like factual QA, classification, and structured output checks.
Runs your input/expected pairs and scores each response with your selected scorer. Use fuzzy_match for most general cases, exact_match for strict outputs, contains for key phrases, and semantic_similarity when wording can vary.
Example config
- name: accuracy
type: golden_dataset
dataset: ./evals/accuracy.jsonl
scorer: fuzzy_match
threshold: 0.8Example dataset
{"input": "what is 2+2", "expected": "4"}
{"input": "what is the capital of France", "expected": "Paris"}llm_judge
BEST FOR
Subjective quality checks where there is no single exact answer (tone, helpfulness, completeness).
Uses an LLM judge to score outputs against your rubric on a 0.0–1.0 scale. This is ideal for evaluating behavior and communication quality.
Example config
- name: quality
type: llm_judge
dataset: ./evals/quality.jsonl
rubric: ./evals/quality-rubric.md
threshold: 0.7Example rubric
# Quality Rubric
Score 1.0 if the answer is correct and concise.
Score 0.5 if the answer is correct but verbose.
Score 0.0 if the answer is incorrect.performance
BEST FOR
Latency budgets and speed regression detection when response time and cost ceilings matter.
Measures per-case latency and cost, then compares suite-level results against your thresholds. Performance suites catch slowdowns and cost spikes before users feel them.
Example config
- name: speed
type: performance
dataset: ./evals/accuracy.jsonl
max_p95_ms: 3000
max_cost_per_call_usd: 0.01
threshold: 0.8tool_use
BEST FOR
Agents that call tools or functions — validates that the right tool was called with the right arguments.
Checks that your agent calls the expected tool, passes the expected arguments, and returns the expected output. Useful for agents that route to different tools based on user intent.
Example config
- name: tool-routing
type: tool_use
dataset: ./evals/tools.jsonl
threshold: 0.9Example dataset
{"input": "book a flight to Paris", "expected_tool": "search_flights", "expected_args": {"destination": "Paris"}}
{"input": "what is the weather in London", "expected_tool": "get_weather", "expected_args": {"city": "London"}}Multi-turn conversation eval
BEST FOR
Conversational agents where behavior across multiple turns matters — not just single responses.
Most eval tools only test single questions. Agentura tests whether your agent behaves consistently across a full conversation, including whether it remembers context from earlier turns.
Example dataset
{
"conversation": [
{"role": "user", "content": "I am on the Pro plan, what storage do I get?"},
{"role": "assistant", "expected": "Pro plan includes 100GB storage"},
{"role": "user", "content": "Can I upgrade individual team members?"},
{"role": "assistant", "expected": "Yes, you can manage seats in Settings > Team"}
],
"eval_turns": [2, 4]
}eval_turns specifies which assistant turns to score. Turns not listed are used as conversation context but not scored.
Example config
Multi-turn cases work with golden_dataset — no special config needed. Just include conversation-format entries in your dataset file alongside regular single-turn entries.
- name: conversation
type: golden_dataset
dataset: ./evals/conversation.jsonl
scorer: semantic_similarity
threshold: 0.8