Selected projects
Each engagement pairs acceptance criteria with time-to-value: tighter cycles, favorable variance versus estimate, or ahead-of-milestone delivery, without diluting review rigor. Figures are representative of contracted scope.
Model evaluation & data quality
Agent trajectory trace review
Established a human-first review program for agent evaluation traces focused on sample quality rather than isolated model scoring. Reviewers traverse each trace under the same visibility as the model, record structured judgments on task success, environment or state transitions, response quality, and instruction adherence, classify pathologies against a controlled taxonomy, and assign a taxonomy-aligned verdict prior to any comparison with automation.
Deliverables
- Per-trace fields for correctness, reasonableness, and failure attribution
- Sample-level pathology patterns with cross-trace risk annotations
- Final sample-quality verdicts governed by taxonomy rules
- Documented human-verifier agreement, discrepancy labels, and supporting evidence
- Instruction-following guidance where rubric criteria exceed what instructions strictly imply
Outcomes
- Trace- and sample-level labels suitable for audit and longitudinal analysis
- Systematic separation of model underperformance from ambiguous or defective scenarios
- Automated checks triangulated against human judgment rather than accepted by default
- Review cadence tightened versus the prior 5-day benchmark without lowering sampling rigor
Multimodal evaluation
Adversarial failure-mode evaluation suite
Designed and executed targeted prompts to elicit defined failure modes across multiple frontier language models. Produced a multimodal evaluation suite emphasizing edge-case reasoning, adversarial robustness, and failure taxonomy coverage suitable for model comparison and regression monitoring.
Deliverables
- Structured adversarial prompt corpus with documented intent per scenario
- Cross-model failure analysis and comparative reporting
- Gold or reference-checked items across four model families
- Hierarchical failure categorization aligned to client taxonomy
Outcomes
- Shipped ahead of the baseline schedule while maintaining recall targets under review
- High coverage of intended failure modes under quality gates
- Phase-two scope extension following interim review
- Engagement priced below prior vendor benchmarks for comparable volume
4,100+
Evaluation prompts
91%
Targeted failure recall
−31%
Vs. estimated calendar
STEM reasoning
Expert-level STEM assessment corpus
Authored graduate-level STEM items spanning mathematics, physics, chemistry, and biology with multi-step reasoning requirements. Ground truth, solution sketches, and verification protocols were produced under PhD-level review to support training and evaluation use cases.
Deliverables
- Single-domain and cross-domain items with specified difficulty bands
- Verified reference answers and worked explanations
- Contributor guidelines and adjudication workflow
- Coverage matrix by subdomain and cognitive skill
Outcomes
- End-to-end delivery in a shorter window than competing proposals without relaxing verification
- Item verification passed under defined accuracy criteria
- External subject-matter review completed without blocking findings
- Subsequent engagement for adjacent domains
2,850
Items delivered
24
Subdomains in matrix
9 wk
End-to-end delivery
Finance and economics
Financial reasoning benchmark
Developed financial analysis and reasoning tasks including valuation, market context, and investment judgment. Content authored and reviewed by practitioners with relevant professional credentials and checked against stated industry conventions.
Deliverables
- Scenario-based items requiring synthesis across sources
- Difficulty and modality tags for downstream stratification
- Reviewer rubrics and sign-off workflow
- Documentation suitable for procurement and compliance review
Outcomes
- Production cycle compressed versus the initial 28-day estimate with milestone sign-off unchanged
- Dual independent review on a stratified sample
- Quality bar consistent with institutional evaluation expectations
- Total cost of ownership below prior outsourcing quotes
Software engineering
Multi-language coding benchmark
Built a suite of programming tasks spanning algorithms, system design, and production-style scenarios. Reference implementations, automated test harnesses, and editorial guidelines support consistent difficulty and objective scoring across languages.
Deliverables
- Versioned task specifications and reference solutions
- Automated tests with deterministic scoring
- Language coverage and parity matrix
- Contributor onboarding and editorial quality gates
Outcomes
- First production-ready drop delivered under the six-week target with transparent variance reporting
- Automated test suites passing under CI for released items
- Formal acceptance against agreed acceptance criteria
- Integration into the client’s public evaluation workflow
680
Published tasks
6
Languages covered
6 wk
Time to first release
CLI and DevOps
Terminal and environment reasoning tasks
Authored multi-step command-line scenarios requiring systems reasoning in containerized environments. Reference solutions, reproducible Docker contexts, and automated graders support evaluation of frontier models on realistic DevOps-style workloads.
Deliverables
- Scenario scripts with deterministic environment setup
- Graded rubrics tied to observable filesystem and process outcomes
- Hard-tier subset calibrated to expected low pass rates
- Operational runbook for environment refresh and troubleshooting
Outcomes
- Milestone beat by eleven days with buffer retained for hard-tier expansion
- Multi-step verification suitable for high-complexity items
- Container hygiene and security review completed
- Expanded scope in a follow-on statement of work
265
Graded scenarios
<36%
Pass rate (hard tier)
11d
Ahead of milestone