Agentura Docs

Baseline Comparison

How Agentura knows if a PR made things worse

Every completed run on the baseline branch (usually main) becomes the reference point. On each PR, Agentura compares current suite scores against that baseline and computes a delta.

  • Baseline scores are stored per suite after runs on main
  • PR runs are compared suite-by-suite against baseline
  • Delta is shown in PR comments (`+0.05` improved, `-0.05` regressed)
  • `regression_threshold` (default 0.05) controls sensitivity before a score drop counts as regression
  • `block_on_regression: true` fails the GitHub Check when regression is detected

Example PR delta table

SuiteBaselineCurrentDeltaStatus
accuracy0.900.84-0.06Regression
quality0.780.82+0.04Improved

CI configuration

yaml
ci:
  block_on_regression: true   # fail PR if regression detected
  regression_threshold: 0.05  # 5% drop triggers failure
  compare_to: main
  post_comment: true