GovernSegment: All traffic

Regressions

Detect behavior drift and structural changes across agent executions. Compare baselines to identify root causes.

Error detected

2 critical regressions detected

Review each regression, compare the baseline vs current behavior, and use experiments to validate fixes.

Investigate traces Validate with experiment

Total5

Critical2

Warning2

Foxhound suggests

Claude Agent SDKAdd a "use only provided refund-policy context" guard to the system prompt.

Expected impact: faithful-to-context evaluator score expected to lift from 0.71 → ≥ 0.92 within 100 traces.

Returns Resolution Copilot regressed after the compressed v18 rolloutcritical

The flagship refund workflow became cheaper but incorrectly denied a damaged-shipment exception that the baseline handled correctly.

Prompt: support-reply

Trace Diff

Policy Grounding Reviewer hallucinated unsupported denial textcritical

The stricter policy branch fabricated a denial rule instead of asking for more evidence or escalating.

Prompt: refund-policy-check

Trace Diff

Premium Escalation Triage missed chargeback escalationwarning

A VIP billing case stayed on the cheap self-serve path instead of handing off to the premium queue.

Prompt: support-reply

Trace Diff

Shipping Delay Resolution degraded during a logistics timeout clusterwarning

A live-status timeout raised latency and lowered answer quality during a high-volume shipping-support period.

Prompt: shipping-delay-triage

Trace Diff

Returns recovery path validated on the weekly exception datasethealthy

The v19 recovery path currently looks safe to promote across the refund exception cohort.

Prompt: support-reply

Trace Diff

GovernSegment: All traffic

Regressions

Detect behavior drift and structural changes across agent executions. Compare baselines to identify root causes.

Error detected

2 critical regressions detected

Review each regression, compare the baseline vs current behavior, and use experiments to validate fixes.

Investigate traces Validate with experiment

Total5

Critical2

Warning2

Foxhound suggests

Claude Agent SDKAdd a "use only provided refund-policy context" guard to the system prompt.

Expected impact: faithful-to-context evaluator score expected to lift from 0.71 → ≥ 0.92 within 100 traces.

Returns Resolution Copilot regressed after the compressed v18 rolloutcritical

The flagship refund workflow became cheaper but incorrectly denied a damaged-shipment exception that the baseline handled correctly.

Prompt: support-reply

Trace Diff

Policy Grounding Reviewer hallucinated unsupported denial textcritical

The stricter policy branch fabricated a denial rule instead of asking for more evidence or escalating.

Prompt: refund-policy-check

Trace Diff

Premium Escalation Triage missed chargeback escalationwarning

A VIP billing case stayed on the cheap self-serve path instead of handing off to the premium queue.

Prompt: support-reply

Trace Diff

Shipping Delay Resolution degraded during a logistics timeout clusterwarning

A live-status timeout raised latency and lowered answer quality during a high-volume shipping-support period.

Prompt: shipping-delay-triage

Trace Diff

Returns recovery path validated on the weekly exception datasethealthy

The v19 recovery path currently looks safe to promote across the refund exception cohort.

Prompt: support-reply

Trace Diff