Detect behavior drift and structural changes across agent executions. Compare baselines to identify root causes.
Review each regression, compare the baseline vs current behavior, and use experiments to validate fixes.