Cortex Engine Benchmarks
Cortex Engine v0.4 (Sonnet 4.6) catches 61% of the coding and documentation errors a senior reviewer would flag in primary-care charts.
Mean recall across 11 de-identified primary-care charts that contain real coding and documentation mistakes. The judge is a frontier reasoning model with no Cortex affiliation; we score Cortex's flags against the judge's gold list and compute precision, recall, and F1 per chart. See methodology for the full setup and caveats.
Median recall
50%
Median precision
40%
Median F1
44%
Median $/chart
$0.033
Median sec/chart
25.4s