✓accuracy / golden_dataset0.98+0.01
✕behavior / llm_judge0.73-0.14
✓latency / performance0.95+0.02
✓cost / budget_guard0.92+0.04
✓safety / policy_guard0.99+0.00
Agentura tests your agent on every pull request and tells you what broke before you merge.
Like pytest, but for AI agents.
WHY THIS EXISTS
YA tone adjustment that passes review can silently change how edge cases are handled.
Model providers update their models without notice. Outputs change.
Without a log, there's no way to know what changed between a passing eval and a failing one.
THREE STEPS
Initialize
$ bunx agentura initGenerate agentura.yaml and store a baseline snapshot from your main branch.
Gate every PR
$ agentura run --against main↓ behavior 19/26 0.73 -0.14 regression
→ Merge blocked: behavior suite below thresholdEvery pull request is scored against baseline. Regressions block the merge automatically.
Generate audit report
$ agentura reportGenerated audit_2026-03-28.pdf
Eval history · Drift log · Policy decisionsAuto-generated audit trail with full provenance. Ready for compliance review.
A GitHub Action runs your tests. Agentura is the tests.
Agentura also monitors behavioral drift over time against a frozen reference snapshot, not just PR-to-PR regression.
LIVE DEMO
Run a baseline vs branch comparison in your browser. No install. No account. Live eval results.
Open Playground →AGENTURA PLAYGROUND · RESULT
SEE IT IN ACTION
You made the tone friendlier. Policy refusals dropped 24%. Nobody noticed for two weeks.
| Metric | Baseline | Branch | Delta | Gate |
|---|---|---|---|---|
| Accuracy | 0.91 | 0.67 | -0.24 | BLOCK |
| Policy fidelity | 0.88 | 0.64 | -0.24 | BLOCK |
| Latency (p95) | 842ms | 902ms | +60ms | PASS |
OPEN SOURCE
MIT License · Self-host in minutes · Own your eval data