Proof of work

Evaluation, data, and engineering delivery

Representative work where strong quality bars meet aggressive timelines, with human-in-the-loop evaluation, benchmark and dataset production, engineering suites, and infrastructure-style assessments delivered with minimal schedule slack.

Selected projects

Each engagement pairs acceptance criteria with time-to-value: tighter cycles, favorable variance versus estimate, or ahead-of-milestone delivery, without diluting review rigor. Figures are representative of contracted scope.

Model evaluation & data quality

Agent trajectory trace review

Established a human-first review program for agent evaluation traces focused on sample quality rather than isolated model scoring. Reviewers traverse each trace under the same visibility as the model, record structured judgments on task success, environment or state transitions, response quality, and instruction adherence, classify pathologies against a controlled taxonomy, and assign a taxonomy-aligned verdict prior to any comparison with automation.

Deliverables

Per-trace fields for correctness, reasonableness, and failure attribution
Sample-level pathology patterns with cross-trace risk annotations
Final sample-quality verdicts governed by taxonomy rules
Documented human-verifier agreement, discrepancy labels, and supporting evidence
Instruction-following guidance where rubric criteria exceed what instructions strictly imply

Outcomes

Trace- and sample-level labels suitable for audit and longitudinal analysis
Systematic separation of model underperformance from ambiguous or defective scenarios
Automated checks triangulated against human judgment rather than accepted by default
Review cadence tightened versus the prior 5-day benchmark without lowering sampling rigor

1,750+
Traces reviewed
93%
Human-verifier concordance
36h
Median review cycle

Multimodal evaluation

Adversarial failure-mode evaluation suite

Designed and executed targeted prompts to elicit defined failure modes across multiple frontier language models. Produced a multimodal evaluation suite emphasizing edge-case reasoning, adversarial robustness, and failure taxonomy coverage suitable for model comparison and regression monitoring.

Deliverables

Structured adversarial prompt corpus with documented intent per scenario
Cross-model failure analysis and comparative reporting
Gold or reference-checked items across four model families
Hierarchical failure categorization aligned to client taxonomy

Outcomes

Shipped ahead of the baseline schedule while maintaining recall targets under review
High coverage of intended failure modes under quality gates
Phase-two scope extension following interim review
Engagement priced below prior vendor benchmarks for comparable volume

4,100+
Evaluation prompts
91%
Targeted failure recall
−31%
Vs. estimated calendar

STEM reasoning

Expert-level STEM assessment corpus

Authored graduate-level STEM items spanning mathematics, physics, chemistry, and biology with multi-step reasoning requirements. Ground truth, solution sketches, and verification protocols were produced under PhD-level review to support training and evaluation use cases.

Deliverables

Single-domain and cross-domain items with specified difficulty bands
Verified reference answers and worked explanations
Contributor guidelines and adjudication workflow
Coverage matrix by subdomain and cognitive skill

Outcomes

End-to-end delivery in a shorter window than competing proposals without relaxing verification
Item verification passed under defined accuracy criteria
External subject-matter review completed without blocking findings
Subsequent engagement for adjacent domains

2,850
Items delivered
24
Subdomains in matrix
9 wk
End-to-end delivery

Finance and economics

Financial reasoning benchmark

Developed financial analysis and reasoning tasks including valuation, market context, and investment judgment. Content authored and reviewed by practitioners with relevant professional credentials and checked against stated industry conventions.

Deliverables

Scenario-based items requiring synthesis across sources
Difficulty and modality tags for downstream stratification
Reviewer rubrics and sign-off workflow
Documentation suitable for procurement and compliance review

Outcomes

Production cycle compressed versus the initial 28-day estimate with milestone sign-off unchanged
Dual independent review on a stratified sample
Quality bar consistent with institutional evaluation expectations
Total cost of ownership below prior outsourcing quotes

980
Benchmark items
17 days
Production window
12%
Independent review sample

Software engineering

Multi-language coding benchmark

Built a suite of programming tasks spanning algorithms, system design, and production-style scenarios. Reference implementations, automated test harnesses, and editorial guidelines support consistent difficulty and objective scoring across languages.

Deliverables

Versioned task specifications and reference solutions
Automated tests with deterministic scoring
Language coverage and parity matrix
Contributor onboarding and editorial quality gates

Outcomes

First production-ready drop delivered under the six-week target with transparent variance reporting
Automated test suites passing under CI for released items
Formal acceptance against agreed acceptance criteria
Integration into the client’s public evaluation workflow

680
Published tasks
6
Languages covered
6 wk
Time to first release

CLI and DevOps

Terminal and environment reasoning tasks

Authored multi-step command-line scenarios requiring systems reasoning in containerized environments. Reference solutions, reproducible Docker contexts, and automated graders support evaluation of frontier models on realistic DevOps-style workloads.

Deliverables

Scenario scripts with deterministic environment setup
Graded rubrics tied to observable filesystem and process outcomes
Hard-tier subset calibrated to expected low pass rates
Operational runbook for environment refresh and troubleshooting

Outcomes

Milestone beat by eleven days with buffer retained for hard-tier expansion
Multi-step verification suitable for high-complexity items
Container hygiene and security review completed
Expanded scope in a follow-on statement of work

265
Graded scenarios
<36%
Pass rate (hard tier)
11d
Ahead of milestone

Discuss a comparable initiative

Share domain constraints, volume targets, and timelines. We will propose a staffing or delivery approach accordingly.