Eval Strategies

Scorers and eval patterns for single-turn, tool-calling, and conversation workflows

Scorers

Scorer	How it works	Best for
exact_match	String must match exactly	Structured outputs, codes, IDs
contains	Output must contain the expected string	Partial matches, key phrases
fuzzy_match	Token overlap score	Most general cases
semantic_similarity	Embedding cosine similarity	When wording can vary

Scorer auto-detects your LLM provider for semantic_similarity: set ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY, or GROQ_API_KEY. Falls back to Ollama if running locally. Falls back to fuzzy_match if no provider is available.

golden_dataset

BEST FOR

Known-answer tasks like factual QA, classification, and structured output checks.

Runs your input/expected pairs and scores each response with your selected scorer. Use fuzzy_match for most general cases, exact_match for strict outputs, contains for key phrases, and semantic_similarity when wording can vary.

Example config

yaml

- name: accuracy
  type: golden_dataset
  dataset: ./evals/accuracy.jsonl
  scorer: fuzzy_match
  threshold: 0.8

Example dataset

jsonl

{"input": "what is 2+2", "expected": "4"}
{"input": "what is the capital of France", "expected": "Paris"}

llm_judge

BEST FOR

Subjective quality checks where there is no single exact answer (tone, helpfulness, completeness).

Uses an LLM judge to score outputs against your rubric on a 0.0–1.0 scale. This is ideal for evaluating behavior and communication quality.

Example config

yaml

- name: quality
  type: llm_judge
  dataset: ./evals/quality.jsonl
  rubric: ./evals/quality-rubric.md
  threshold: 0.7

Example rubric

markdown

# Quality Rubric

Score 1.0 if the answer is correct and concise.
Score 0.5 if the answer is correct but verbose.
Score 0.0 if the answer is incorrect.

performance

BEST FOR

Latency budgets and speed regression detection when response time and cost ceilings matter.

Measures per-case latency and cost, then compares suite-level results against your thresholds. Performance suites catch slowdowns and cost spikes before users feel them.

Example config

yaml

- name: speed
  type: performance
  dataset: ./evals/accuracy.jsonl
  max_p95_ms: 3000
  max_cost_per_call_usd: 0.01
  threshold: 0.8

tool_use

BEST FOR

Agents that call tools or functions — validates that the right tool was called with the right arguments.

Checks that your agent calls the expected tool, passes the expected arguments, and returns the expected output. Useful for agents that route to different tools based on user intent.

Example config

yaml

- name: tool-routing
  type: tool_use
  dataset: ./evals/tools.jsonl
  threshold: 0.9

Example dataset

jsonl

{"input": "book a flight to Paris", "expected_tool": "search_flights", "expected_args": {"destination": "Paris"}}
{"input": "what is the weather in London", "expected_tool": "get_weather", "expected_args": {"city": "London"}}

Multi-turn conversation eval

BEST FOR

Conversational agents where behavior across multiple turns matters — not just single responses.

Most eval tools only test single questions. Agentura tests whether your agent behaves consistently across a full conversation, including whether it remembers context from earlier turns.

Example dataset

jsonl

{
  "conversation": [
    {"role": "user", "content": "I am on the Pro plan, what storage do I get?"},
    {"role": "assistant", "expected": "Pro plan includes 100GB storage"},
    {"role": "user", "content": "Can I upgrade individual team members?"},
    {"role": "assistant", "expected": "Yes, you can manage seats in Settings > Team"}
  ],
  "eval_turns": [2, 4]
}

eval_turns specifies which assistant turns to score. Turns not listed are used as conversation context but not scored.

Example config

Multi-turn cases work with golden_dataset — no special config needed. Just include conversation-format entries in your dataset file alongside regular single-turn entries.

yaml

- name: conversation
  type: golden_dataset
  dataset: ./evals/conversation.jsonl
  scorer: semantic_similarity
  threshold: 0.8