Baseline Comparison
How Agentura knows if a PR made things worse
Every completed run on the baseline branch (usually main) becomes the reference point. On each PR, Agentura compares current suite scores against that baseline and computes a delta.
- Baseline scores are stored per suite after runs on main
- PR runs are compared suite-by-suite against baseline
- Delta is shown in PR comments (`+0.05` improved, `-0.05` regressed)
- `regression_threshold` (default 0.05) controls sensitivity before a score drop counts as regression
- `block_on_regression: true` fails the GitHub Check when regression is detected
Example PR delta table
| Suite | Baseline | Current | Delta | Status |
|---|---|---|---|---|
| accuracy | 0.90 | 0.84 | -0.06 | Regression |
| quality | 0.78 | 0.82 | +0.04 | Improved |
CI configuration
yaml
ci:
block_on_regression: true # fail PR if regression detected
regression_threshold: 0.05 # 5% drop triggers failure
compare_to: main
post_comment: true