ImproveSegment: All traffic

Evaluators

Configure scoring criteria for agent outputs, then use evaluators to grade dataset cases and experiment runs.

Regression

5 evaluators active, 1 healthy

4 evaluators showing degraded health. Review the affected evaluators and check recent scoring trends.

Create evaluator Datasets Experiments

Total5

Healthy1

Degraded4

EvaluatorTypeModelHealthActions

returns_policy_correctness

Flags the v18 refund rollout as incorrect on exception-heavy return scenarios.

categorical

gpt-4o

View Run

escalation_correctness

Tracks whether premium and risk-sensitive cases are routed to humans early enough.

categorical

gpt-4o-mini

View Run

policy_groundedness

Measures how often policy-heavy answers remain grounded in approved support rules.

numeric

gpt-4o

View Run

shipping_latency_guardrail

Tracks whether cheaper logistics paths remain inside latency expectations during carrier instability.

numeric

gpt-4o-mini

View Run

operator_summary_quality

Checks whether operator-visible summaries are specific enough to support real investigations.

numeric

gpt-4o

View Run

5 evaluators

ImproveSegment: All traffic

Evaluators

Configure scoring criteria for agent outputs, then use evaluators to grade dataset cases and experiment runs.

Regression

5 evaluators active, 1 healthy

4 evaluators showing degraded health. Review the affected evaluators and check recent scoring trends.

Create evaluator Datasets Experiments

Total5

Healthy1

Degraded4

EvaluatorTypeModelHealthActions

returns_policy_correctness

Flags the v18 refund rollout as incorrect on exception-heavy return scenarios.

categorical

gpt-4o

View Run

escalation_correctness

Tracks whether premium and risk-sensitive cases are routed to humans early enough.

categorical

gpt-4o-mini

View Run

policy_groundedness

Measures how often policy-heavy answers remain grounded in approved support rules.

numeric

gpt-4o

View Run

shipping_latency_guardrail

Tracks whether cheaper logistics paths remain inside latency expectations during carrier instability.

numeric

gpt-4o-mini

View Run

operator_summary_quality

Checks whether operator-visible summaries are specific enough to support real investigations.

numeric

gpt-4o

View Run

5 evaluators