Agentura Docs

agentura.yaml Reference

All configuration options explained

version

versionnumberRequired

Configuration schema version. Must be set to 1.

version: 1

agent

agent.typehttp | cli | sdkRequired

Agent type. Most teams use http.

type: http
agent.endpointstringRequired for http

HTTP endpoint that accepts { input } and returns { output }.

endpoint: https://your-agent.example.com/api/agent
agent.timeout_msnumberOptional

Per-case timeout in milliseconds. Default is 10000.

timeout_ms: 10000
agent.headersmap<string,string>Optional

Optional headers sent with each request to your agent endpoint.

headers:
  Authorization: Bearer your-token

evals

evals[].namestringRequired

Unique suite name.

name: accuracy
evals[].typegolden_dataset | llm_judge | performanceRequired

Suite scoring strategy.

type: golden_dataset
evals[].datasetstringRequired

Path to JSONL test cases relative to repo root.

dataset: ./evals/accuracy.jsonl
evals[].scorerfuzzy | exact_match | semantic_similarity | containsOptional (golden_dataset)

Golden dataset scorer. Recommended default is fuzzy.

scorer: fuzzy
evals[].rubricstringRequired for llm_judge

Path to rubric markdown file used by judge scoring.

rubric: ./evals/quality-rubric.md
evals[].latency_threshold_msnumberRequired for performance

Maximum acceptable latency per case in milliseconds.

latency_threshold_ms: 5000
evals[].thresholdnumber (0-1)Optional

Minimum score required for suite pass. Typical value is 0.8.

threshold: 0.8

ci

ci.block_on_regressionbooleanOptional

If true, PR checks fail when regression is detected.

block_on_regression: false
ci.regression_thresholdnumber (0-1)Optional (default 0.05)

Allowed score drop before counting as regression.

regression_threshold: 0.05
ci.compare_tostringOptional (default main)

Baseline branch for comparisons.

compare_to: main
ci.post_commentbooleanOptional (default true)

Post and update PR comments with suite results.

post_comment: true
ci.fail_on_new_suitebooleanOptional (default false)

If true, fail checks when a new suite has no baseline yet.

fail_on_new_suite: false

Complete example

yaml
version: 1

agent:
  type: http
  endpoint: https://your-agent.example.com/api/agent
  timeout_ms: 10000

evals:
  - name: accuracy
    type: golden_dataset
    dataset: ./evals/accuracy.jsonl
    scorer: fuzzy
    threshold: 0.8

  - name: quality
    type: llm_judge
    dataset: ./evals/quality.jsonl
    rubric: ./evals/quality-rubric.md
    threshold: 0.7

  - name: speed
    type: performance
    dataset: ./evals/accuracy.jsonl
    latency_threshold_ms: 5000
    threshold: 0.8

ci:
  block_on_regression: false
  compare_to: main
  post_comment: true