Make sure your AI agent still works
after every change.

Agentura tests your agent on every pull request and tells you what broke before you merge.

Like pytest, but for AI agents.

Try the Playground →View on GitHub

WHY THIS EXISTS

Standard tests miss agent behavior changes.

Prompt change shifts behavior downstream

A tone adjustment that passes review can silently change how edge cases are handled.

Anthropic / OpenAI / provider updated model

Model providers update their models without notice. Outputs change.

No record of what changed or when

Without a log, there's no way to know what changed between a passing eval and a failing one.

THREE STEPS

Set a baseline. Test every change. Ship with evidence.

Initialize

$ bunx agentura init

Generate agentura.yaml and store a baseline snapshot from your main branch.

Gate every PR

$ agentura run --against main

↓ behavior  19/26  0.73  -0.14  regression
→ Merge blocked: behavior suite below threshold

Every pull request is scored against baseline. Regressions block the merge automatically.

Generate audit report

$ agentura report

Generated audit_2026-03-28.pdf
  Eval history · Drift log · Policy decisions

Auto-generated audit trail with full provenance. Ready for compliance review.

A GitHub Action runs your tests. Agentura is the tests.

Agentura also monitors behavioral drift over time against a frozen reference snapshot, not just PR-to-PR regression.

agentura quorum — consensus across model families.Independent error distributions.

LIVE DEMO

See a regression get caught.

Run a baseline vs branch comparison in your browser. No install. No account. Live eval results.

Open Playground →

AGENTURA PLAYGROUND · RESULT

Branch change: Model swap (70B → 8B)

Suite: golden_dataset

Accuracy0.94 → 0.71-0.23BLOCK

Tone0.87 → 0.82-0.05PASS

Policy0.92 → 0.78-0.14BLOCK

MERGE BLOCKED

Live results from the playground ↗

SEE IT IN ACTION

Five ways agents break production

You made the tone friendlier. Policy refusals dropped 24%. Nobody noticed for two weeks.

Metric	Baseline	Branch	Delta	Gate
Accuracy	0.91	0.67	-0.24	BLOCK
Policy fidelity	0.88	0.64	-0.24	BLOCK
Latency (p95)	842ms	902ms	+60ms	PASS

OPEN SOURCE

Free to use. Free to self-host.

$ bunx agentura init # generate config + baseline

$ git checkout -b my-branch

$ git push # eval checks run automatically on PR

GitHub Repo Read the Docs

MIT License · Self-host in minutes · Own your eval data

Make sure your AI agent still worksafter every change.

Prompt change shifts behavior downstream

Anthropic / OpenAI / provider updated model

No record of what changed or when

See a regression get caught.

Make sure your AI agent still works
after every change.