CortexCharts

Cortex Engine Benchmarks

Measured on the mistakes that matter.

These benchmarks score Cortex against de-identified primary-care charts with real coding and documentation gaps. The useful question is not which model sounds smartest. It is which engine catches more clinically relevant issues without becoming slow, expensive, or noisy.

The decision, in plain terms

01

Opus is the current accuracy ceiling.

Cortex Engine v0.4 (Opus 4.7) posted the highest mean recall at 54%. It is not cheap, but it sets the bar for what the prompt stack can catch.

02

The fastest engines are not ready to replace production.

Cortex Engine v0.5 (Cerebras GPT-OSS 120B) ran at 1.1s median per chart, but the low-cost/speed candidates still give up too much recall for a physician-facing default.

03

GPT-5.5 is precise, but not the fast candidate.

Cortex Engine v0.5 (OpenAI GPT-5.5) is the strongest OpenAI row by overall accuracy in this sweep. Its precision is useful, but cost and latency make it a comparison point, not the obvious production default.

Fastest

1.1s

Cortex Engine v0.5 (Cerebras GPT-OSS 120B)

Cheapest

0.11¢

Cortex Engine v0.5 (OpenAI GPT-5.4 Nano)

Best overall accuracy

51%

Cortex Engine v0.4 (Opus 4.7)

All evaluated systems

Leaderboard

Default sort is overall accuracy descending: the smartest row first. Click any column title to sort.

Cortex Engine v0.4 (Opus 4.7)

claude-opus-4-7 · prompts v0.4

11 charts
Accuracy ceiling · Highest recall in this sweep
Overall Accuracy
51%
Recall
54%
Precision
58%
$ / chart
14.0¢
sec / chart
15.9s
Lane errors
0

Cortex Engine v0.5 (OpenAI GPT-5.5)

gpt-5.5 · prompts v0.4

11 charts
OpenAI candidate · High precision, expensive
Overall Accuracy
47%
Recall
41%
Precision
87%
$ / chart
7.5¢
sec / chart
37.6s
Lane errors
0

Cortex Engine v0.5 (Gemini 3.1 Pro Preview, thinking high)

gemini-3.1-pro-preview · prompts v0.4

11 charts
Gemini candidate · Fast Gemini candidate
Overall Accuracy
45%
Recall
39%
Precision
81%
$ / chart
3.8¢
sec / chart
22.5s
Lane errors
0

Cortex Engine v0.4 (Haiku 4.5)

claude-haiku-4-5-20251001 · prompts v0.4

11 charts
Anthropic low-cost · Comparison point
Overall Accuracy
38%
Recall
45%
Precision
40%
$ / chart
0.80¢
sec / chart
8.5s
Lane errors
0

Cortex Engine v0.5 (OpenAI GPT-5.4)

gpt-5.4 · prompts v0.4

11 charts
OpenAI candidate · Comparison point
Overall Accuracy
38%
Recall
35%
Precision
50%
$ / chart
1.7¢
sec / chart
12.0s
Lane errors
0

Cortex Engine v0.4 (Sonnet 4.6)

claude-sonnet-4-6 · prompts v0.4

11 charts
Production baseline · Current production
Overall Accuracy
35%
Recall
47%
Precision
34%
$ / chart
2.8¢
sec / chart
26.8s
Lane errors
0

Cortex Engine v0.5 (Gemini 3.5 Flash, thinking 2048)

gemini-3.5-flash · prompts v0.4

11 charts
Gemini candidate · Fast Gemini candidate
Overall Accuracy
32%
Recall
29%
Precision
71%
$ / chart
1.5¢
sec / chart
6.3s
Lane errors
0

Cortex Engine v0.5 (OpenAI GPT-5.4 Nano)

gpt-5.4-nano · prompts v0.4

11 charts
OpenAI candidate · Cheapest OpenAI candidate
Overall Accuracy
32%
Recall
27%
Precision
44%
$ / chart
0.11¢
sec / chart
6.1s
Lane errors
0

Cortex Engine v0.5 (Cerebras GPT-OSS 120B)

gpt-oss-120b · prompts v0.4

11 charts
Speed floor · Fastest, misses more
Overall Accuracy
22%
Recall
21%
Precision
53%
$ / chart
0.17¢
sec / chart
1.1s
Lane errors
0

Weakest charts · Cortex Engine v0.4 (Sonnet 4.6)

Lowest overall-accuracy rows stay visible so a high mean cannot hide weak chart types.

ChartGoldFlaggedRecallOverall
chart-0051270%0%
chart-0034525%22%
chart-0017829%27%
chart-00616100%29%
chart-00712925%29%
chart-0108733%38%

Weakest charts · Cortex Engine v0.4 (Opus 4.7)

Lowest overall-accuracy rows stay visible so a high mean cannot hide weak chart types.

ChartGoldFlaggedRecallOverall
chart-00712617%22%
chart-00512517%24%
chart-00615100%33%
chart-00410630%37%
chart-0108633%40%
chart-0017643%46%

Benchmark method

Methodology

Accuracy

The published leaderboard currently uses the primary GPT-5.5 judge to create gold issues and match Cortex output. Code computes precision, recall, and overall accuracy from the match array; overall accuracy is F1 under the hood. The benchmark does not trust judge arithmetic.

Oracle

We do not treat any single model as canonical truth. Consensus oracle v1 keeps GPT-5.5 as the published score for now, records Opus 4.7 disagreement audits, and is designed to add newer frontier judges as they become clearly stronger.

Corpus

The corpus contains de-identified primary-care charts captured from real Practice Fusion workflows. Some charts have no encounter note, so note-dependent lanes are skipped for those rows.

Publishing guardrails

The public summary ignores single-chart smoke runs, archived prompt controls, and superseded variants; raw result files stay in the repo. Lane-level provider failures are tracked separately from total row failures.

Caveats

This is still a small, primary-care-only corpus. Treat the ranking as a decision aid, not a clinical validation claim. Product-debatable and single-judge disagreements are flagged in the repo before they affect leaderboard scoring.

Judge: GPT-5.5 (high). Reproduce with pnpm eval --system <id>, then pnpm eval:summarize. Summary file: public/benchmarks/summary.json. Oracle policy: research/eval/consensus/v1/manifest.json.