returns_policy_correctness Flags the v18 refund rollout as incorrect on exception-heavy return scenarios.
categorical
gpt-4o
Tracks whether premium and risk-sensitive cases are routed to humans early enough.
categorical
gpt-4o-mini
Measures how often policy-heavy answers remain grounded in approved support rules.
numeric
gpt-4o
shipping_latency_guardrail Tracks whether cheaper logistics paths remain inside latency expectations during carrier instability.
numeric
gpt-4o-mini
Checks whether operator-visible summaries are specific enough to support real investigations.
numeric
gpt-4o