agentura.yaml Reference
All configuration options explained
version
versionnumberRequiredConfiguration schema version. Must be set to 1.
version: 1agent
agent.typehttp | cli | sdkRequiredAgent type. Most teams use http.
type: httpagent.endpointstringRequired for httpHTTP endpoint that accepts { input } and returns { output }.
endpoint: https://your-agent.example.com/api/agentagent.timeout_msnumberOptionalPer-case timeout in milliseconds. Default is 10000.
timeout_ms: 10000agent.headersmap<string,string>OptionalOptional headers sent with each request to your agent endpoint.
headers:
Authorization: Bearer your-tokenevals
evals[].namestringRequiredUnique suite name.
name: accuracyevals[].typegolden_dataset | llm_judge | performanceRequiredSuite scoring strategy.
type: golden_datasetevals[].datasetstringRequiredPath to JSONL test cases relative to repo root.
dataset: ./evals/accuracy.jsonlevals[].scorerfuzzy | exact_match | semantic_similarity | containsOptional (golden_dataset)Golden dataset scorer. Recommended default is fuzzy.
scorer: fuzzyevals[].rubricstringRequired for llm_judgePath to rubric markdown file used by judge scoring.
rubric: ./evals/quality-rubric.mdevals[].latency_threshold_msnumberRequired for performanceMaximum acceptable latency per case in milliseconds.
latency_threshold_ms: 5000evals[].thresholdnumber (0-1)OptionalMinimum score required for suite pass. Typical value is 0.8.
threshold: 0.8ci
ci.block_on_regressionbooleanOptionalIf true, PR checks fail when regression is detected.
block_on_regression: falseci.regression_thresholdnumber (0-1)Optional (default 0.05)Allowed score drop before counting as regression.
regression_threshold: 0.05ci.compare_tostringOptional (default main)Baseline branch for comparisons.
compare_to: mainci.post_commentbooleanOptional (default true)Post and update PR comments with suite results.
post_comment: trueci.fail_on_new_suitebooleanOptional (default false)If true, fail checks when a new suite has no baseline yet.
fail_on_new_suite: falseComplete example
version: 1
agent:
type: http
endpoint: https://your-agent.example.com/api/agent
timeout_ms: 10000
evals:
- name: accuracy
type: golden_dataset
dataset: ./evals/accuracy.jsonl
scorer: fuzzy
threshold: 0.8
- name: quality
type: llm_judge
dataset: ./evals/quality.jsonl
rubric: ./evals/quality-rubric.md
threshold: 0.7
- name: speed
type: performance
dataset: ./evals/accuracy.jsonl
latency_threshold_ms: 5000
threshold: 0.8
ci:
block_on_regression: false
compare_to: main
post_comment: true