Cortex Engine Benchmarks

Cortex Engine v0.4 (Sonnet 4.6) catches 61% of the coding and documentation errors a senior reviewer would flag in primary-care charts.

Mean recall across 11 de-identified primary-care charts that contain real coding and documentation mistakes. The judge is a frontier reasoning model with no Cortex affiliation; we score Cortex's flags against the judge's gold list and compute precision, recall, and F1 per chart. See methodology for the full setup and caveats.

Median recall

50%

Median precision

40%

Median F1

44%

Median $/chart

$0.033

Median sec/chart

25.4s

All evaluated systems

System	Charts	Recall (mean)	Precision (mean)	F1 (mean)	$ / chart	sec / chart
Cortex Engine (current production) claude-sonnet-4-6 · prompts v0.4-brevity	1	—	—	—	$0.057	29.0s
Cortex Engine v0.4 (Haiku 4.5) claude-haiku-4-5-20251001 · prompts v0.4	3	—	—	—	$0.44¢	4.6s
Cortex Engine v0.4 (Opus 4.7) claude-opus-4-7 · prompts v0.4	1	—	—	—	$0.306	25.6s
Cortex Engine v0.4 (Sonnet 4.6) claude-sonnet-4-6 · prompts v0.4	11	61%	44%	47%	$0.033	25.4s
Cortex Engine v0.3 (Sonnet 4.6, no brevity rule) claude-sonnet-4-6 · prompts v0.3	1	—	—	—	$0.071	41.8s

Per-chart breakdown · Cortex Engine v0.4 (Sonnet 4.6)

Chart	Gold issues	Cortex flagged	Recall	Precision	F1
chart-007	12	10	17%	20%	18%
chart-006	1	6	100%	17%	29%
chart-005	12	8	27%	38%	32%
chart-001	7	9	50%	33%	40%
chart-009	1	4	100%	25%	40%
chart-003	4	5	50%	40%	44%
chart-004	10	8	44%	50%	47%
chart-008	14	11	46%	55%	50%
example-001	10	13	62%	62%	62%
chart-002	3	5	100%	60%	75%
chart-010	8	7	75%	86%	80%