Make sure your AI agent still works
after every change.

Agentura tests your agent on every pull request and tells you what broke before you merge.

Like pytest, but for AI agents.

Catch regressions in accuracy, safety, cost, and guardrails.

What a pull request looks like with Agentura installed.

accuracy / golden_dataset0.98+0.01
behavior / llm_judge0.73-0.14
latency / performance0.95+0.02
cost / budget_guard0.92+0.04
safety / policy_guard0.99+0.00

1 critical regression detected. Merge gate active.

See what changed, what failed, and whether merge should be blocked.

Standard tests miss agent behavior changes.

A prompt change shifts behavior downstream

YA tone adjustment that passes review can silently change how edge cases are handled.

Your provider updated the model

Model providers update their models without notice. Outputs change.

No record of what changed or when

Without a log, there's no way to know what changed between a passing eval and a failing one.

Set a baseline. Test every change. Ship with evidence.

Initialize

$ bunx agentura init

Generate agentura.yaml and store a baseline snapshot from your main branch.

Gate every PR

$ agentura run --against main
↓ behavior  19/26  0.73  -0.14  regression
→ Merge blocked: behavior suite below threshold

Every pull request is scored against baseline. Regressions block the merge automatically.

Generate audit report

$ agentura report
Generated audit_2026-03-28.pdf
  Eval history · Drift log · Policy decisions

Auto-generated audit trail with full provenance. Ready for compliance review.

A GitHub Action runs your tests. Agentura is the tests.

Agentura also monitors behavioral drift over time against a frozen reference snapshot, not just PR-to-PR regression.

agentura quorum — consensus across model families.Independent error distributions.

See a regression get caught.

Run a baseline vs branch comparison in your browser. No install. No account. Live eval results.

Open Playground →

AGENTURA PLAYGROUND · RESULT

Branch change: Model swap (70B → 8B)
Suite: golden_dataset
Accuracy0.94 → 0.71-0.23BLOCK
Tone0.87 → 0.82-0.05PASS
Policy0.92 → 0.78-0.14BLOCK
MERGE BLOCKED

Live results from the playground ↗

Four ways agents break in production.

You made the tone friendlier. Policy refusals dropped 24%. Nobody noticed for two weeks.

MetricBaselineBranchDeltaGate
Accuracy0.910.67-0.24BLOCK
Policy fidelity0.880.64-0.24BLOCK
Latency (p95)842ms902ms+60msPASS

Free to use. Free to self-host.

$ bunx agentura init # generate config + baseline
$ git checkout -b my-branch
$ git push # eval checks run automatically on PR

MIT License · Self-host in minutes · Own your eval data