Proof of work

Evaluation, data, and engineering delivery

Representative work where strong quality bars meet aggressive timelines, with human-in-the-loop evaluation, benchmark and dataset production, engineering suites, and infrastructure-style assessments delivered with minimal schedule slack.

Selected projects

Each engagement pairs acceptance criteria with time-to-value: tighter cycles, favorable variance versus estimate, or ahead-of-milestone delivery, without diluting review rigor. Figures are representative of contracted scope.

Model evaluation & data quality

Agent trajectory trace review

Established a human-first review program for agent evaluation traces focused on sample quality rather than isolated model scoring. Reviewers traverse each trace under the same visibility as the model, record structured judgments on task success, environment or state transitions, response quality, and instruction adherence, classify pathologies against a controlled taxonomy, and assign a taxonomy-aligned verdict prior to any comparison with automation.

Deliverables

  • Per-trace fields for correctness, reasonableness, and failure attribution
  • Sample-level pathology patterns with cross-trace risk annotations
  • Final sample-quality verdicts governed by taxonomy rules
  • Documented human-verifier agreement, discrepancy labels, and supporting evidence
  • Instruction-following guidance where rubric criteria exceed what instructions strictly imply

Outcomes

  • Trace- and sample-level labels suitable for audit and longitudinal analysis
  • Systematic separation of model underperformance from ambiguous or defective scenarios
  • Automated checks triangulated against human judgment rather than accepted by default
  • Review cadence tightened versus the prior 5-day benchmark without lowering sampling rigor
  • 1,750+

    Traces reviewed

  • 93%

    Human-verifier concordance

  • 36h

    Median review cycle

Multimodal evaluation

Adversarial failure-mode evaluation suite

Designed and executed targeted prompts to elicit defined failure modes across multiple frontier language models. Produced a multimodal evaluation suite emphasizing edge-case reasoning, adversarial robustness, and failure taxonomy coverage suitable for model comparison and regression monitoring.

Deliverables

  • Structured adversarial prompt corpus with documented intent per scenario
  • Cross-model failure analysis and comparative reporting
  • Gold or reference-checked items across four model families
  • Hierarchical failure categorization aligned to client taxonomy

Outcomes

  • Shipped ahead of the baseline schedule while maintaining recall targets under review
  • High coverage of intended failure modes under quality gates
  • Phase-two scope extension following interim review
  • Engagement priced below prior vendor benchmarks for comparable volume
  • 4,100+

    Evaluation prompts

  • 91%

    Targeted failure recall

  • −31%

    Vs. estimated calendar

STEM reasoning

Expert-level STEM assessment corpus

Authored graduate-level STEM items spanning mathematics, physics, chemistry, and biology with multi-step reasoning requirements. Ground truth, solution sketches, and verification protocols were produced under PhD-level review to support training and evaluation use cases.

Deliverables

  • Single-domain and cross-domain items with specified difficulty bands
  • Verified reference answers and worked explanations
  • Contributor guidelines and adjudication workflow
  • Coverage matrix by subdomain and cognitive skill

Outcomes

  • End-to-end delivery in a shorter window than competing proposals without relaxing verification
  • Item verification passed under defined accuracy criteria
  • External subject-matter review completed without blocking findings
  • Subsequent engagement for adjacent domains
  • 2,850

    Items delivered

  • 24

    Subdomains in matrix

  • 9 wk

    End-to-end delivery

Finance and economics

Financial reasoning benchmark

Developed financial analysis and reasoning tasks including valuation, market context, and investment judgment. Content authored and reviewed by practitioners with relevant professional credentials and checked against stated industry conventions.

Deliverables

  • Scenario-based items requiring synthesis across sources
  • Difficulty and modality tags for downstream stratification
  • Reviewer rubrics and sign-off workflow
  • Documentation suitable for procurement and compliance review

Outcomes

  • Production cycle compressed versus the initial 28-day estimate with milestone sign-off unchanged
  • Dual independent review on a stratified sample
  • Quality bar consistent with institutional evaluation expectations
  • Total cost of ownership below prior outsourcing quotes
  • 980

    Benchmark items

  • 17 days

    Production window

  • 12%

    Independent review sample

Software engineering

Multi-language coding benchmark

Built a suite of programming tasks spanning algorithms, system design, and production-style scenarios. Reference implementations, automated test harnesses, and editorial guidelines support consistent difficulty and objective scoring across languages.

Deliverables

  • Versioned task specifications and reference solutions
  • Automated tests with deterministic scoring
  • Language coverage and parity matrix
  • Contributor onboarding and editorial quality gates

Outcomes

  • First production-ready drop delivered under the six-week target with transparent variance reporting
  • Automated test suites passing under CI for released items
  • Formal acceptance against agreed acceptance criteria
  • Integration into the client’s public evaluation workflow
  • 680

    Published tasks

  • 6

    Languages covered

  • 6 wk

    Time to first release

CLI and DevOps

Terminal and environment reasoning tasks

Authored multi-step command-line scenarios requiring systems reasoning in containerized environments. Reference solutions, reproducible Docker contexts, and automated graders support evaluation of frontier models on realistic DevOps-style workloads.

Deliverables

  • Scenario scripts with deterministic environment setup
  • Graded rubrics tied to observable filesystem and process outcomes
  • Hard-tier subset calibrated to expected low pass rates
  • Operational runbook for environment refresh and troubleshooting

Outcomes

  • Milestone beat by eleven days with buffer retained for hard-tier expansion
  • Multi-step verification suitable for high-complexity items
  • Container hygiene and security review completed
  • Expanded scope in a follow-on statement of work
  • 265

    Graded scenarios

  • <36%

    Pass rate (hard tier)

  • 11d

    Ahead of milestone

Discuss a comparable initiative

Share domain constraints, volume targets, and timelines. We will propose a staffing or delivery approach accordingly.