Model evaluation & data quality
Agent trajectory trace review
Established a human-first review program for agent evaluation traces focused on sample quality rather than isolated model scoring. Reviewers traverse each trace under the same visibility as the model, record structured judgments on task success, environment or state transitions, response quality, and instruction adherence, classify pathologies against a controlled taxonomy, and assign a taxonomy-aligned verdict prior to any comparison with automation.
Deliverables
- Per-trace fields for correctness, reasonableness, and failure attribution
- Sample-level pathology patterns with cross-trace risk annotations
- Final sample-quality verdicts governed by taxonomy rules
- Documented human-verifier agreement, discrepancy labels, and supporting evidence
- Instruction-following guidance where rubric criteria exceed what instructions strictly imply
Outcomes
- Trace- and sample-level labels suitable for audit and longitudinal analysis
- Systematic separation of model underperformance from ambiguous or defective scenarios
- Automated checks triangulated against human judgment rather than accepted by default
- Review cadence tightened versus the prior 5-day benchmark without lowering sampling rigor
1,750+
Traces reviewed
93%
Human-verifier concordance
36h
Median review cycle