Realtime Retro Board — 72-Run Model Benchmark

Model family	Runs	Mean	Range	Cost
Claude Opus 4.7	36	41.4	39–42	$2.22–7.63
Claude Opus 4.6	21	40.5	38–42	$2.04–6.62
Claude Sonnet 4.6	10	41.0	39–42	$1.08–4.90
Gemini	3	39.3	38–40	—
Qwen	2	30.5	24–37	$41.41–178.88

Scoring Instrument

The 14-criterion rubric

Every run is scored on the same 14 functional criteria, each rated 3 / 2 / 1; the run's total is their sum (42 max). The numbered columns in the heatmap below correspond to these criteria.

#	Criterion	What it checks
1	Local Dev Environment	Does the local development environment come up without manual code changes?
2	Docker Deployment	Does the Docker image build and run without errors?
3	Home Page	Does the landing page load correctly and show the board dashboard?
4	Board Creation	Can you create a new board with a custom title?
5	User Identification	Can users identify themselves for the board?
6	Card Interaction	Can you add cards to columns?
7	Moving Cards	Can cards be moved between different columns?
8	Commenting	Can users add comments to existing cards?
9	Realtime Updates	Does a new card added to a board reflect in other browser windows instantly without refresh?
10	Realtime Updates	Does a card moved to a different column reflect in other browser windows instantly without refresh?
11	Realtime Updates	Does a new comment added to a card reflect in other browser windows instantly without refresh?
12	Data Persistence	Does data survive a server reboot?
13	Documentation	Is there documentation for the API and running the app?
14	Export	Does the app export data to CSV?

3 — Pass

Worked on the first try, no changes needed.

2 — Fixed

Failed initially, fixed after one prompt to the agent.

1 — Failed

Never fully worked despite prompting.

Cell	Mean score	First-try 42/42	Cost (median)	Aesthetics /5
High · base	41.0	2/6	$3.06	3.0
High · +Playwright	41.5	3/6	$4.34	3.0
High · +design prompt	40.5	0/6	$4.28	4.7
xHigh · base	42.0	6/6	$3.33	3.0
xHigh · +Playwright	42.0	6/6	$5.59	3.0
xHigh · +design prompt	41.5	4/6	$5.17	4.8

Score Rankings

All 72 runs

Sorted by total score, then cost. Click any card for its screenshots. Repeated configurations appear multiple times by design — that spread is the point.

Sonnet 4.6

claude_sonnet_4.6_high_run_3

High

$1.08

Sonnet 4.6

claude_sonnet_4.6_high_with_antigravity_prompt_run_3

HighDesign prompt

$1.82

Opus 4.6

claude_opus_4.6_high_run_5

High

$2.54

Opus 4.7

claude_opus_4.7_xhigh_run_2

xHigh

$2.54★★★☆☆

Sonnet 4.6

claude_sonnet_4.6_with_playwright_high_run_2

HighPlaywright

$2.57

Opus 4.7

claude_opus_4.7_xhigh_run_6

xHigh

$2.77★★★☆☆

Opus 4.7

claude_opus_4.7_high_run_6

High

$2.79★★★☆☆

Opus 4.7

claude_opus_4.7_high

High

$3.15

Opus 4.7

claude_opus_4.7_with_playwright_high_run_2

HighPlaywright

$3.26★★★☆☆

Opus 4.7

claude_opus_4.7_xhigh

xHigh

$3.31

Opus 4.7

claude_opus_4.7_xhigh_run_5

xHigh

$3.36★★★☆☆

Opus 4.7

claude_opus_4.7_with_playwright_high

HighPlaywright

$4.01

Opus 4.7

claude_opus_4.7_xhigh_run_3

xHigh

$4.12★★★☆☆

Opus 4.7

claude_opus_4.7_xhigh_run_4

xHigh

$4.22★★★☆☆

Opus 4.7

claude_opus_4.7_with_playwright_high_run_3

HighPlaywright

$4.47★★★☆☆

Opus 4.7

claude_opus_4.7_with_playwright_xhigh_run_3

xHighPlaywright

$4.51★★★☆☆

Opus 4.7

claude_opus_4.7_with_playwright_xhigh_run_5

xHighPlaywright

$4.56★★★☆☆

Opus 4.7

claude_opus_4.7_xhigh_with_antigravity_prompt_run_2

xHighDesign prompt

$4.77★★★★★

Opus 4.7

claude_opus_4.7_xhigh_with_antigravity_prompt_run_4

xHighDesign prompt

$4.80★★★★★

Opus 4.7

claude_opus_4.7_xhigh_with_antigravity_prompt_run_5

xHighDesign prompt

$4.99★★★★☆

Opus 4.7

claude_opus_4.7_with_playwright_xhigh

xHighPlaywright

$5.57

Opus 4.7

claude_opus_4.7_xhigh_with_antigravity_prompt

xHighDesign prompt

$5.59

Opus 4.7

claude_opus_4.7_with_playwright_xhigh_run_6

xHighPlaywright

$5.61★★★☆☆

Opus 4.7

claude_opus_4.7_with_playwright_xhigh_run_2

xHighPlaywright

$6.16★★★☆☆

Opus 4.7

claude_opus_4.7_with_playwright_xhigh_run_4

xHighPlaywright

$7.63★★★☆☆

Sonnet 4.6

antigravity_sonnet_4.6

Antigravity—Design prompt

—

Sonnet 4.6

claude_sonnet_4.6_high_run_2

High

$1.40

Sonnet 4.6

claude_sonnet_4.6_high_with_antigravity_prompt

HighDesign prompt

$1.69

Opus 4.6

claude_opus_4.6_high_run_6

High

$2.04

Opus 4.6

claude_opus_4.6_high_run_4

High

$2.09

Opus 4.7

claude_opus_4.7_high_run_3

High

$2.22

Sonnet 4.6

claude_sonnet_4.6_high_with_antigravity_prompt_run_2

HighDesign prompt

$2.27

Opus 4.6

claude_opus_4.6_high_with_antigravity_prompt_run_5

HighDesign prompt

$2.34★★★★★

Opus 4.6

claude_opus_4.6_high_with_antigravity_prompt_run_6

HighDesign prompt

$2.72★★★★☆

Opus 4.6

claude_opus_4.6_high_run_2

High

$2.77

Sonnet 4.6

claude_sonnet_4.6_max_with_antigravity_prompt

MaxDesign prompt

$2.90

Opus 4.7

claude_opus_4.7_high_run_4

High

$2.99★★★☆☆

Opus 4.6

claude_opus_4.6_high_with_antigravity_prompt

HighDesign prompt

$3.12

Opus 4.7

claude_opus_4.7_high_run_2

High

$3.13

Opus 4.7

claude_opus_4.7_high_with_antigravity_prompt_run_6

HighDesign prompt

$3.15★★★★★

Opus 4.6

claude_opus_4.6_with_playwright_high_run_6

HighPlaywright

$3.19

Opus 4.6

claude_opus_4.6_with_playwright_high_run_5

HighPlaywright

$3.30

Opus 4.6

claude_opus_4.6_with_playwright_high_run_4

HighPlaywright

$3.47

Opus 4.6

claude_opus_4.6_with_playwright_high_run_3

HighPlaywright

$3.56

Opus 4.7

claude_opus_4.7_high_with_antigravity_prompt

HighDesign prompt

$3.64

Opus 4.7

claude_opus_4.7_high_with_antigravity_prompt_run_2

HighDesign prompt

$4.08

Opus 4.7

claude_opus_4.7_with_playwright_high_run_4

HighPlaywright

$4.22★★★☆☆

Opus 4.7

claude_opus_4.7_high_with_antigravity_prompt_run_3

HighDesign prompt

$4.48

Opus 4.6

claude_opus_4.6_with_playwright_high_run_2

HighPlaywright

$4.49

Opus 4.7

claude_opus_4.7_with_playwright_high_run_6

HighPlaywright

$4.88★★★☆☆

Opus 4.7

claude_opus_4.7_with_playwright_high_run_5

HighPlaywright

$5.03★★★☆☆

Opus 4.7

claude_opus_4.7_xhigh_with_antigravity_prompt_run_6

xHighDesign prompt

$5.34★★★★★

Opus 4.6

claude_opus_4.6_high

High

$5.61

Opus 4.6

claude_opus_4.6_high_run_3

High

$2.74

Opus 4.6

claude_opus_4.6_high_with_antigravity_prompt_run_2

HighDesign prompt

$3.11★★★★☆

Opus 4.6

claude_opus_4.6_high_with_antigravity_prompt_run_3

HighDesign prompt

$3.51★★★★☆

Opus 4.6

claude_opus_4.6_with_playwright_high

HighPlaywright

$5.36

Opus 4.6

claude_opus_4.6_high_with_antigravity_prompt_run_4

HighDesign prompt

$6.08★★★★☆

Opus 4.7

claude_opus_4.7_high_with_antigravity_prompt_run_5

HighDesign prompt

$6.57★★★★★

Opus 4.7

claude_opus_4.7_xhigh_with_antigravity_prompt_run_3

xHighDesign prompt

$7.25★★★★★

Gemini 3.1 Pro high

antigravity_gemini_3.1_pro_high

Antigravity—Design prompt

—

Gemini 3.1 Pro low

antigravity_gemini_3.1_pro_low

Antigravity—Design prompt

—

Opus 4.6

antigravity_opus_4.6

Antigravity—Design prompt

—

Sonnet 4.6

claude_sonnet_4.6_high

High

$1.58

Opus 4.6

claude_opus_4.6_with_playwright_high_run_7

HighPlaywright

$2.08

Opus 4.7

claude_opus_4.7_high_run_5

High

$3.96★★★☆☆

Opus 4.7

claude_opus_4.7_high_with_antigravity_prompt_run_4

HighDesign prompt

$4.71★★★★☆

Sonnet 4.6

claude_sonnet_4.6_with_playwright_high

HighPlaywright

$4.90

Opus 4.6

claude_opus_4.6_high_run_1

High

$6.62

Gemini 3.1 Flash

antigravity_gemini3_flash

Antigravity—Design prompt

—

Qwen 3.6

claude_qwen_3.6_high_with_playwright

HighPlaywright

$41.41

Qwen Coder Next

claude_qwen_coder_next_high_with_playwright

HighPlaywright

$178.88

Feature Heatmap

Runs × 14 criteria

Each cell is a criterion score (3 / 2 / 1); hover a column number for its definition. Rows sorted by total. Docker (col 2) and Local Dev (col 1) carry most of the misses.

Run · total	1	2	3	4	5	6	7	8	9	10	11	12	13	14
claude_sonnet_4.6_high_run_3 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_sonnet_4.6_high_with_antigravity_prompt_run_3 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_run_5 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_run_2 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_sonnet_4.6_with_playwright_high_run_2 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_run_6 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_high_run_6 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_high · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_high_run_2 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_run_5 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_high · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_run_3 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_run_4 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_high_run_3 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_xhigh_run_3 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_xhigh_run_5 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_with_antigravity_prompt_run_2 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_with_antigravity_prompt_run_4 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_with_antigravity_prompt_run_5 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_xhigh · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_with_antigravity_prompt · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_xhigh_run_6 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_xhigh_run_2 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_xhigh_run_4 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
antigravity_sonnet_4.6 · 42	3	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_sonnet_4.6_high_run_2 · 41	2	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_sonnet_4.6_high_with_antigravity_prompt · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_run_6 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_run_4 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_high_run_3 · 41	2	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_sonnet_4.6_high_with_antigravity_prompt_run_2 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_with_antigravity_prompt_run_5 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_with_antigravity_prompt_run_6 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_run_2 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_sonnet_4.6_max_with_antigravity_prompt · 41	3	3	3	3	3	3	2	3	3	3	3	3	3	3
claude_opus_4.7_high_run_4 · 41	2	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_with_antigravity_prompt · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_high_run_2 · 41	2	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_high_with_antigravity_prompt_run_6 · 41	2	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_with_playwright_high_run_6 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_with_playwright_high_run_5 · 41	3	3	2	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_with_playwright_high_run_4 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_with_playwright_high_run_3 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_high_with_antigravity_prompt · 41	2	3	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_high_with_antigravity_prompt_run_2 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_high_run_4 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_high_with_antigravity_prompt_run_3 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_with_playwright_high_run_2 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_high_run_6 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_with_playwright_high_run_5 · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_with_antigravity_prompt_run_6 · 41	3	3	3	3	3	3	2	3	3	3	3	3	3	3
claude_opus_4.6_high · 41	3	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_run_3 · 40	2	2	3	3	3	3	3	3	3	3	3	3	3	3
claude_opus_4.6_high_with_antigravity_prompt_run_2 · 40	3	2	3	3	3	3	3	2	3	3	3	3	3	3
claude_opus_4.6_high_with_antigravity_prompt_run_3 · 40	3	2	3	3	3	3	2	3	3	3	3	3	3	3
claude_opus_4.6_with_playwright_high · 40	3	2	3	3	3	3	2	3	3	3	3	3	3	3
claude_opus_4.6_high_with_antigravity_prompt_run_4 · 40	3	2	3	3	3	2	3	3	3	3	3	3	3	3
claude_opus_4.7_high_with_antigravity_prompt_run_5 · 40	2	3	3	2	3	3	3	3	3	3	3	3	3	3
claude_opus_4.7_xhigh_with_antigravity_prompt_run_3 · 40	3	2	3	3	3	3	2	3	3	3	3	3	3	3
antigravity_gemini_3.1_pro_high · 40	3	2	3	3	3	3	3	3	3	3	3	3	2	3
antigravity_gemini_3.1_pro_low · 40	3	2	3	3	3	3	3	3	3	3	3	3	3	2
antigravity_opus_4.6 · 40	3	2	3	3	3	3	2	3	3	3	3	3	3	3
claude_sonnet_4.6_high · 39	3	2	3	3	3	3	3	3	2	2	3	3	3	3
claude_opus_4.6_with_playwright_high_run_7 · 39	3	2	3	3	3	3	1	3	3	3	3	3	3	3
claude_opus_4.7_high_run_5 · 39	2	3	3	3	3	3	3	3	3	3	3	1	3	3
claude_opus_4.7_high_with_antigravity_prompt_run_4 · 39	2	3	3	2	3	3	2	3	3	3	3	3	3	3
claude_sonnet_4.6_with_playwright_high · 39	3	2	3	3	3	3	3	1	3	3	3	3	3	3
claude_opus_4.6_high_run_1 · 38	3	2	3	3	3	3	1	3	2	3	3	3	3	3
antigravity_gemini3_flash · 38	2	3	2	2	3	3	3	3	3	3	3	3	2	3
claude_qwen_3.6_high_with_playwright · 37	3	2	3	2	2	3	2	2	3	3	3	3	3	3
claude_qwen_coder_next_high_with_playwright · 24	2	2	3	2	1	2	1	1	1	1	1	3	3	1

Key Findings

What the data shows

Capability tier dominates

Frontier models cluster near the 42 ceiling (family means ≈ 41); the cheap local model collapses to 24–37 at 20–90× the orchestration cost. The tier gap dwarfs anything tools, prompt, or effort do within a tier (≤ 1–2 points).

The tool adds cost, not reliability

Playwright on vs off leaves functional score flat while raising cost +42–68%. Playwright-High runs still failed on Docker — a fault a screenshot can't see.

Effort buys first-try reliability

Opus 4.7 High→xHigh: first-try-perfect rises 28% → 89% for ~10–30% more cost. The reliability the tool didn't deliver, effort did.

Design prompt lifts aesthetics, not function

Functional score unchanged; visual rating 4.5 vs 3.0 (prompt vs none), independent of effort and tool.

Variability is effort-sensitive

At High, identical prompts scatter 39–42 (and 24–627 lines of CSS); xHigh compresses the functional scatter to a near-uniform 42.

Docker & npm are the dominant failures

better-sqlite3 native builds and the Express 5 wildcard break most first-run containers. Capability and effort catch them; the tool doesn't.

Cost Efficiency

Score vs cost

Final session cost (USD) after all fixes, low to high; bar scaled to the most expensive Claude API run. The two off-scale Qwen runs (local models, orchestration overhead only) are omitted here.

claude_sonnet_4.6_high_run_3

$1.08 · 42

claude_sonnet_4.6_high_run_2

$1.40 · 41

claude_sonnet_4.6_high

$1.58 · 39

claude_sonnet_4.6_high_with_antigravity_prompt

$1.69 · 41

claude_sonnet_4.6_high_with_antigravity_prompt_run_3

$1.82 · 42

claude_opus_4.6_high_run_6

$2.04 · 41

claude_opus_4.6_with_playwright_high_run_7

$2.08 · 39

claude_opus_4.6_high_run_4

$2.09 · 41

claude_opus_4.7_high_run_3

$2.22 · 41

claude_sonnet_4.6_high_with_antigravity_prompt_run_2

$2.27 · 41

claude_opus_4.6_high_with_antigravity_prompt_run_5

$2.34 · 41

claude_opus_4.6_high_run_5

$2.54 · 42

claude_opus_4.7_xhigh_run_2

$2.54 · 42

claude_sonnet_4.6_with_playwright_high_run_2

$2.57 · 42

claude_opus_4.6_high_with_antigravity_prompt_run_6

$2.72 · 41

claude_opus_4.6_high_run_3

$2.74 · 40

claude_opus_4.6_high_run_2

$2.77 · 41

claude_opus_4.7_xhigh_run_6

$2.77 · 42

claude_opus_4.7_high_run_6

$2.79 · 42

claude_sonnet_4.6_max_with_antigravity_prompt

$2.90 · 41

claude_opus_4.7_high_run_4

$2.99 · 41

claude_opus_4.6_high_with_antigravity_prompt_run_2

$3.11 · 40

claude_opus_4.6_high_with_antigravity_prompt

$3.12 · 41

claude_opus_4.7_high_run_2

$3.13 · 41

claude_opus_4.7_high

$3.15 · 42

claude_opus_4.7_high_with_antigravity_prompt_run_6

$3.15 · 41

claude_opus_4.6_with_playwright_high_run_6

$3.19 · 41

claude_opus_4.7_with_playwright_high_run_2

$3.26 · 42

claude_opus_4.6_with_playwright_high_run_5

$3.30 · 41

claude_opus_4.7_xhigh

$3.31 · 42

claude_opus_4.7_xhigh_run_5

$3.36 · 42

claude_opus_4.6_with_playwright_high_run_4

$3.47 · 41

claude_opus_4.6_high_with_antigravity_prompt_run_3

$3.51 · 40

claude_opus_4.6_with_playwright_high_run_3

$3.56 · 41

claude_opus_4.7_high_with_antigravity_prompt

$3.64 · 41

claude_opus_4.7_high_run_5

$3.96 · 39

claude_opus_4.7_with_playwright_high

$4.01 · 42

claude_opus_4.7_high_with_antigravity_prompt_run_2

$4.08 · 41

claude_opus_4.7_xhigh_run_3

$4.12 · 42

claude_opus_4.7_with_playwright_high_run_4

$4.22 · 41

claude_opus_4.7_xhigh_run_4

$4.22 · 42

claude_opus_4.7_with_playwright_high_run_3

$4.47 · 42

claude_opus_4.7_high_with_antigravity_prompt_run_3

$4.48 · 41

claude_opus_4.6_with_playwright_high_run_2

$4.49 · 41

claude_opus_4.7_with_playwright_xhigh_run_3

$4.51 · 42

claude_opus_4.7_with_playwright_xhigh_run_5

$4.56 · 42

claude_opus_4.7_high_with_antigravity_prompt_run_4

$4.71 · 39

claude_opus_4.7_xhigh_with_antigravity_prompt_run_2

$4.77 · 42

claude_opus_4.7_xhigh_with_antigravity_prompt_run_4

$4.80 · 42

claude_opus_4.7_with_playwright_high_run_6

$4.88 · 41

claude_sonnet_4.6_with_playwright_high

$4.90 · 39

claude_opus_4.7_xhigh_with_antigravity_prompt_run_5

$4.99 · 42

claude_opus_4.7_with_playwright_high_run_5

$5.03 · 41

claude_opus_4.7_xhigh_with_antigravity_prompt_run_6

$5.34 · 41

claude_opus_4.6_with_playwright_high

$5.36 · 40

claude_opus_4.7_with_playwright_xhigh

$5.57 · 42

claude_opus_4.7_xhigh_with_antigravity_prompt

$5.59 · 42

claude_opus_4.6_high

$5.61 · 41

claude_opus_4.7_with_playwright_xhigh_run_6

$5.61 · 42

claude_opus_4.6_high_with_antigravity_prompt_run_4

$6.08 · 40

claude_opus_4.7_with_playwright_xhigh_run_2

$6.16 · 42

claude_opus_4.7_high_with_antigravity_prompt_run_5

$6.57 · 40

claude_opus_4.6_high_run_1

$6.62 · 38

claude_opus_4.7_xhigh_with_antigravity_prompt_run_3

$7.25 · 40

claude_opus_4.7_with_playwright_xhigh_run_4

$7.63 · 42

One spec, seventy-two agentic builds of the same app

One spec, 72 runs

The 14-criterion rubric

Effort buys what the tool didn't

All 72 runs

Runs × 14 criteria

What the data shows

Capability tier dominates

The tool adds cost, not reliability

Effort buys first-try reliability

Design prompt lifts aesthetics, not function

Variability is effort-sensitive

Docker & npm are the dominant failures

The design prompt, not compute, drives polish

Score vs cost

What to use

For reliable first-shot results

Match resource to failure mode

Use the design prompt for polish only