Opus is the current accuracy ceiling.
Cortex Engine v0.4 (Opus 4.7) posted the highest mean recall at 54%. It is not cheap, but it sets the bar for what the prompt stack can catch.
Cortex Engine Benchmarks
These benchmarks score Cortex against de-identified primary-care charts with real coding and documentation gaps. The useful question is not which model sounds smartest. It is which engine catches more clinically relevant issues without becoming slow, expensive, or noisy.
Cortex Engine v0.4 (Opus 4.7) posted the highest mean recall at 54%. It is not cheap, but it sets the bar for what the prompt stack can catch.
Cortex Engine v0.5 (Cerebras GPT-OSS 120B) ran at 1.1s median per chart, but the low-cost/speed candidates still give up too much recall for a physician-facing default.
Cortex Engine v0.5 (OpenAI GPT-5.5) is the strongest OpenAI row by overall accuracy in this sweep. Its precision is useful, but cost and latency make it a comparison point, not the obvious production default.
Cortex Engine v0.5 (Cerebras GPT-OSS 120B)
Cortex Engine v0.5 (OpenAI GPT-5.4 Nano)
Cortex Engine v0.4 (Opus 4.7)
All evaluated systems
Default sort is overall accuracy descending: the smartest row first. Click any column title to sort.
claude-opus-4-7 · prompts v0.4
gpt-5.5 · prompts v0.4
gemini-3.1-pro-preview · prompts v0.4
claude-haiku-4-5-20251001 · prompts v0.4
gpt-5.4 · prompts v0.4
claude-sonnet-4-6 · prompts v0.4
gemini-3.5-flash · prompts v0.4
gpt-5.4-nano · prompts v0.4
gpt-oss-120b · prompts v0.4
Cortex Engine v0.4 (Opus 4.7) claude-opus-4-7 · prompts v0.4 | Accuracy ceiling | 11 | 51% | 54% | 58% | 14.0¢ | 15.9s | Highest recall in this sweep |
Cortex Engine v0.5 (OpenAI GPT-5.5) gpt-5.5 · prompts v0.4 | OpenAI candidate | 11 | 47% | 41% | 87% | 7.5¢ | 37.6s | High precision, expensive |
Cortex Engine v0.5 (Gemini 3.1 Pro Preview, thinking high) gemini-3.1-pro-preview · prompts v0.4 | Gemini candidate | 11 | 45% | 39% | 81% | 3.8¢ | 22.5s | Fast Gemini candidate |
Cortex Engine v0.4 (Haiku 4.5) claude-haiku-4-5-20251001 · prompts v0.4 | Anthropic low-cost | 11 | 38% | 45% | 40% | 0.80¢ | 8.5s | Comparison point |
Cortex Engine v0.5 (OpenAI GPT-5.4) gpt-5.4 · prompts v0.4 | OpenAI candidate | 11 | 38% | 35% | 50% | 1.7¢ | 12.0s | Comparison point |
Cortex Engine v0.4 (Sonnet 4.6) claude-sonnet-4-6 · prompts v0.4 | Production baseline | 11 | 35% | 47% | 34% | 2.8¢ | 26.8s | Current production |
Cortex Engine v0.5 (Gemini 3.5 Flash, thinking 2048) gemini-3.5-flash · prompts v0.4 | Gemini candidate | 11 | 32% | 29% | 71% | 1.5¢ | 6.3s | Fast Gemini candidate |
Cortex Engine v0.5 (OpenAI GPT-5.4 Nano) gpt-5.4-nano · prompts v0.4 | OpenAI candidate | 11 | 32% | 27% | 44% | 0.11¢ | 6.1s | Cheapest OpenAI candidate |
Cortex Engine v0.5 (Cerebras GPT-OSS 120B) gpt-oss-120b · prompts v0.4 | Speed floor | 11 | 22% | 21% | 53% | 0.17¢ | 1.1s | Fastest, misses more |
Lowest overall-accuracy rows stay visible so a high mean cannot hide weak chart types.
| Chart | Gold | Flagged | Recall | Overall |
|---|---|---|---|---|
| chart-005 | 12 | 7 | 0% | 0% |
| chart-003 | 4 | 5 | 25% | 22% |
| chart-001 | 7 | 8 | 29% | 27% |
| chart-006 | 1 | 6 | 100% | 29% |
| chart-007 | 12 | 9 | 25% | 29% |
| chart-010 | 8 | 7 | 33% | 38% |
Lowest overall-accuracy rows stay visible so a high mean cannot hide weak chart types.
| Chart | Gold | Flagged | Recall | Overall |
|---|---|---|---|---|
| chart-007 | 12 | 6 | 17% | 22% |
| chart-005 | 12 | 5 | 17% | 24% |
| chart-006 | 1 | 5 | 100% | 33% |
| chart-004 | 10 | 6 | 30% | 37% |
| chart-010 | 8 | 6 | 33% | 40% |
| chart-001 | 7 | 6 | 43% | 46% |
Benchmark method
The published leaderboard currently uses the primary GPT-5.5 judge to create gold issues and match Cortex output. Code computes precision, recall, and overall accuracy from the match array; overall accuracy is F1 under the hood. The benchmark does not trust judge arithmetic.
We do not treat any single model as canonical truth. Consensus oracle v1 keeps GPT-5.5 as the published score for now, records Opus 4.7 disagreement audits, and is designed to add newer frontier judges as they become clearly stronger.
The corpus contains de-identified primary-care charts captured from real Practice Fusion workflows. Some charts have no encounter note, so note-dependent lanes are skipped for those rows.
The public summary ignores single-chart smoke runs, archived prompt controls, and superseded variants; raw result files stay in the repo. Lane-level provider failures are tracked separately from total row failures.
This is still a small, primary-care-only corpus. Treat the ranking as a decision aid, not a clinical validation claim. Product-debatable and single-judge disagreements are flagged in the repo before they affect leaderboard scoring.
Judge: GPT-5.5 (high). Reproduce with pnpm eval --system <id>, then pnpm eval:summarize. Summary file: public/benchmarks/summary.json. Oracle policy: research/eval/consensus/v1/manifest.json.