Methodology
We publish how Cortex Engine is measured because the headline number on /benchmarks is only as good as the process behind it. Everything below is reproducible from the public repo; the eval harness lives at research/eval/.
What we measure
Three axes per (system, chart) pair:
- Cost — input/output/cache tokens × per-model prices, summed to dollars per chart.
- Speed — wall-clock total + per-lane latency.
- Accuracy — precision, recall, and F1 against a judge-generated gold list. Recall is the headline: “% of human-missed errors caught.”
Corpus
De-identified primary-care charts captured from a real Practice Fusion environment by the Cortex Lens extension. Captures pass through a HIPAA Safe Harbor 18-identifier auto-redaction pipeline (extension/src/popup/safeHarbor.ts); each fixture carries an audit field listing every applied rule and the categories that need human review. The audit is a checklist, not a compliance certification — the corpus is research data only and never includes production PHI.
The corpus is small by design; we'd rather show 11 honest numbers than 247 numbers from synthetic data nobody trusts. As we grow, we publish per-chart precision and recall so you can see where Cortex helps and where it doesn't.
Judge
We use the current state-of-the-art non-Anthropic reasoning model as our judge. The pick is auto-discovered: pnpm eval:refresh-judge spins up an agent (Claude with web_search) that finds the leading reasoning model on the Artificial Analysis intelligence leaderboard, surveys medical benchmarks (HealthBench, MedQA), and writes its choice plus citations into research/eval/judges.json. Every results row records the exact judge model and version that would score it, so historical comparisons stay honest as the leaderboard shifts.
Non-Anthropic on purpose: we test Anthropic Cortex outputs and don't want same-org judging bias. The judge is itself an LLM and itself imperfect; see caveats below.
Scoring
Two judge calls per (chart, system) pair:
- Gold generation. The judge reads the chart cold and lists every coding/documentation issue a competent reviewer should flag. Cached per (chart, judge config) so we don't repay across systems. The prompt restricts the judge to lanes the system-under-test would actually run — when no encounter note is available, audit-risks and E/M-level matches are excluded so chart-only systems aren't penalized for categories they cannot fairly be judged on.
- Match scoring. The judge compares the system's flat output (chart-insights, suggested codes, audit risks, E/M level) against the gold list and emits a matches array. We compute precision, recall, and F1 deterministically from the matches in code; we do not trust the judge's arithmetic.
Self-consistency runs (pnpm eval --runs N) replicate each pair N times and produce stability numbers (insight-id Jaccard, fingerprint Jaccard, rationale Levenshtein similarity) — surfaces how much the model wobbles run-to-run on the same chart.
The "human baseline"
The corpus contains the original coding and documentation mistakes humans made when authoring those charts. A reviewer who looked at the chart untouched would catch some fraction of those mistakes; the remainder ships. Recall against the judge's gold list equals the share of mistakes Cortex catches — i.e., the share of human-missed errors that Cortex would surface for fix-before-sign-off. That's what the headline number on /benchmarks measures.
Caveats
- Small n. Today's corpus is 11 charts. Confidence intervals on the medians are wide. Don't draw strong conclusions from a single chart's number.
- Specialty bias. Primary care only. Numbers don't directly transfer to specialty workflows until we evaluate against specialty corpora.
- Single-judge bias. A panel-of-judges run would reduce systematic bias. We default to single-judge for cost reasons; expanding to a panel is a future Phase 4 sub-bullet.
- Corpus coverage. Captures so far are summary views (no encounter-note text). Lanes that depend on note text (E/M level, suggested codes from note, audit risks) don't run on those charts; recall on richer captures is likely different.
- Tiny golds inflate recall. A chart whose gold list has one issue can hit 100% recall trivially. We publish per-chart numbers so you can see when this happens.
Privacy posture
Corpus capture happens client-side in the Cortex Lens extension and runs through the Safe Harbor 18-identifier pipeline before anything leaves the browser. Notes that depend on a real chart are scrubbed and reviewed by a human before commit. The corpus and benchmark results live in this public repo; production PHI never flows through this evaluation pipeline. See /privacy for the broader Cortex EHR privacy posture.
Reproduce: pnpm eval:refresh-judge (refresh judge if stale) → pnpm eval --system <id> → pnpm eval:summarize. The summary file at public/benchmarks/summary.json is what this site renders.