Cortex Engine Benchmarks

Cortex Engine v0.4 (Sonnet 4.6) catches 61% of the coding and documentation errors a senior reviewer would flag in primary-care charts.

Mean recall across 11 de-identified primary-care charts that contain real coding and documentation mistakes. The judge is a frontier reasoning model with no Cortex affiliation; we score Cortex's flags against the judge's gold list and compute precision, recall, and F1 per chart. See methodology for the full setup and caveats.

Median recall
50%
Median precision
40%
Median F1
44%
Median $/chart
$0.033
Median sec/chart
25.4s

All evaluated systems

SystemChartsRecall (mean)Precision (mean)F1 (mean)$ / chartsec / chart
Cortex Engine (current production)
claude-sonnet-4-6 · prompts v0.4-brevity
1$0.05729.0s
Cortex Engine v0.4 (Haiku 4.5)
claude-haiku-4-5-20251001 · prompts v0.4
3$0.44¢4.6s
Cortex Engine v0.4 (Opus 4.7)
claude-opus-4-7 · prompts v0.4
1$0.30625.6s
Cortex Engine v0.4 (Sonnet 4.6)
claude-sonnet-4-6 · prompts v0.4
1161%44%47%$0.03325.4s
Cortex Engine v0.3 (Sonnet 4.6, no brevity rule)
claude-sonnet-4-6 · prompts v0.3
1$0.07141.8s

Per-chart breakdown · Cortex Engine v0.4 (Sonnet 4.6)

ChartGold issuesCortex flaggedRecallPrecisionF1
chart-007121017%20%18%
chart-00616100%17%29%
chart-00512827%38%32%
chart-0017950%33%40%
chart-00914100%25%40%
chart-0034550%40%44%
chart-00410844%50%47%
chart-008141146%55%50%
example-001101362%62%62%
chart-00235100%60%75%
chart-0108775%86%80%