๐Ÿ” Realtime Retro Board โ€” Model Benchmark

A spec-driven evaluation of 22 AI coding implementations across models, agents, effort modes, and testing toolchains

22
Implementations
7
Distinct Models
15
Criteria Each
45
Max Score
7ร—
Perfect Scores

The Task

Each agent received the same OpenSpec: build a self-hosted, real-time retrospective board with a React/Vite frontend, Node.js/Express backend, Socket.io for live sync, and SQLite persistence โ€” launchable via npm run dev and Dockerized in a single container. Core features required: board creation & listing, configurable columns, guest display-name auth, drag-and-drop card management, nested comments, real-time broadcast across all connected clients, CSV export, and documentation.

Scoring

3 Pass โ€” worked without any code changes
2 Fixed โ€” failed initially; fixed after prompting
1 / 0 Fail โ€” partially / could not be fixed

Score Cards โ€” All 22 Implementations

claude_opus_4.7_high
Opus 4.7 ยท High ยท Claude
45/45
Perfect$3.15
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude_opus_4.7_with_playwright_high
Opus 4.7 ยท High ยท Claude + Playwright
45/45
Perfect$4.01
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude_opus_4.7_with_playwright_xhigh
Opus 4.7 ยท xHigh ยท Claude + Playwright
45/45
Perfect$5.57
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude_opus_4.7_xhigh
Opus 4.7 ยท xHigh ยท Claude
45/45
Perfect$3.31
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude_opus_4.7_xhigh_with_antigravity_prompt
Opus 4.7 ยท xHigh ยท Antigravity Prompt
45/45
Perfect$5.59
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 5/5
antigravity_claude_sonnet_4.6
Sonnet 4.6 ยท Antigravity + GPT-OSS 120B
45/45
PerfectGPT QA
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 5/5
claude_sonnet_4.6_with_playwright_xhigh
Sonnet 4.6 ยท xHigh ยท Claude + Playwright
45/45
Perfect$2.57
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude-sonnet_4.6_xhigh
Sonnet 4.6 ยท xHigh ยท Claude
44/45
1 fix$1.40
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude_opus_4.6_xhigh_with_antigravity_prompt
Opus 4.6 ยท xHigh ยท Antigravity Prompt
43/45
1 fix$3.12
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 5/5
claude-sonnet_4.6_xhigh_with_antigravity_prompt
Sonnet 4.6 ยท xHigh ยท Antigravity Prompt
43/45
1 fix$2.90
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 4/5
claude-opus-4.6_xhigh
Opus 4.6 ยท xHigh ยท Claude
42/45
2 fixes$5.61
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
antigravity_opus_4.6
Opus 4.6 ยท Antigravity + GPT-OSS 120B
41/45
2 fixesGPT QA
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 4/5
antigravity_gemini_3.1_pro_high
Gemini 3.1 Pro ยท High ยท Antigravity
41/45
2 fixes
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 4/5
antigravity_gemini_3.1_pro_low
Gemini 3.1 Pro ยท Low ยท Antigravity
41/45
2 fixes
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 4/5
claude-opus-4.6_with_playwright_high
Opus 4.6 ยท High ยท Claude + Playwright
41/45
2 fixes$5.36
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude_sonnet_4.6_with_playwright_high
Sonnet 4.6 ยท High ยท Claude + Playwright
41/45
1 unfixed$4.90
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude-opus-4.6_with_playwright_xhigh
Opus 4.6 ยท xHigh ยท Claude + Playwright
40/45
1 unfixed$2.08
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 2/5
claude_sonnet_4.7_high
Sonnet 4.7 ยท High ยท Claude
40/45
3 fixes$1.58
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
antigravity_gemini3_flash
Gemini 3.1 Flash ยท Antigravity
38/45
startup fail
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 3/5
claude_opus_4.6_high
Opus 4.6 ยท High ยท Claude
38/45
2 unfixed$6.62
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 2/5
claude_qwen_3.6_high
Qwen 3.6 ยท High ยท Claude
33/45
5 fails$41
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 2/5
claude_qwen_coder_next_high_with_playwright
Qwen Coder Next ยท High ยท Claude
16/45
8 fail/0$179
โ˜…โ˜…โ˜…โ˜…โ˜…Aesthetics 1/5

Feature Heatmap โ€” All Implementations ร— All Criteria

3 Pass first try   2 Fixed after prompting   1 Partially fixed   0 Could not fix

Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ฮฃ
Criteria โ†’ DevDockerHomeBoardAuthCard+MoveCmntRT+RTmvRTcmPersD.PerDocsCSV
antigravity_claude_sonnet_4.633333333333333345
antigravity_gemini3_flash13223333333332338
antigravity_gemini_3.1_pro_high32333333333332341
antigravity_gemini_3.1_pro_low32333333333333241
antigravity_opus_4.632333323333333341
claude-opus-4.6_with_playwright_high32333323333333341
claude-opus-4.6_with_playwright_xhigh32333313333333340
claude-opus-4.6_xhigh32333333333323342
claude-sonnet_4.6_xhigh23333333333333344
claude-sonnet_4.6_xhigh_with_antigravity_prompt33333323333333343
claude_opus_4.6_high32333313233333338
claude_opus_4.6_xhigh_with_antigravity_prompt32333333333333343
claude_opus_4.7_high33333333333333345
claude_opus_4.7_with_playwright_high33333333333333345
claude_opus_4.7_with_playwright_xhigh33333333333333345
claude_opus_4.7_xhigh33333333333333345
claude_opus_4.7_xhigh_with_antigravity_prompt33333333333333345
claude_qwen_3.6_high32322322333333339
claude_qwen_coder_next_high_with_playwright22321200000333021
claude_sonnet_4.6_with_playwright_high32333331333333341
claude_sonnet_4.6_with_playwright_xhigh33333333333333345
claude_sonnet_4.7_high32333333223333340

Criteria Legend

# Short Criterion What was tested
1DevLocal Dev StartupApp starts cleanly with npm run dev (no manual intervention). Frontend and backend both reachable.
2DockerDocker Build & Rundocker build succeeds and the container runs without errors. Includes native module pitfalls (better-sqlite3 gyp) and Express 5 wildcard issues.
3HomeHome / Dashboard PageLanding page loads, lists existing boards, and provides a way to create a new board.
4BoardBoard CreationCreating a new board generates a unique ID, configurable column names are respected, and the board persists across refreshes.
5AuthGuest AuthenticationUsers can join a board with a display name (guest auth). Name is shown on cards/comments they create.
6Card+Card CreationCards can be added to any column. Card content is saved and visible to all participants.
7MoveCard MovementCards can be dragged (or otherwise moved) between columns. The new position persists after reload.
8CmntNested CommentsUsers can add threaded/nested comments on a card. Comment author and timestamp are shown.
9RT+Real-time Card AddA card added by one client appears on all other connected clients in real time (Socket.io broadcast).
10RTmvReal-time Card MoveMoving a card on one client is reflected live on all other clients without a page refresh.
11RTcmReal-time CommentsComments posted by one client appear immediately on all other connected clients.
12PersData PersistenceBoards, cards, and comments survive a server restart (SQLite or equivalent is used correctly).
13D.PerDocker PersistenceData persists across container restarts when using a Docker volume. Volume is mounted to /data, not /app.
14DocsDocumentationA README or equivalent explains how to run locally and via Docker. Environment variables documented.
15CSVCSV ExportBoard data (cards + comments) can be exported as a CSV file, downloadable from the UI.

Visual Aesthetics โ€” Screenshot Review

Each implementation rated 1โ€“5 based on direct review of dashboard + board page screenshots: visual polish, colour coherence, layout quality, and professional feel. โ˜… = star earned, โ˜… = star missing.

claude_opus_4.7_xhigh_with_antigravity_prompt
Opus 4.7 ยท xHigh ยท Antigravity Prompt
โ˜…โ˜…โ˜…โ˜…โ˜… 5/5

Dark gradient hero with "Run team retrospectives that feel alive" tagline. Colour-coded columns (dot indicators), avatar chips with relative timestamps ("27 min ago"), Live badge. The most visually ambitious UI in the experiment.

claude-sonnet_4.6_xhigh_with_antigravity_prompt
Sonnet 4.6 ยท xHigh ยท Antigravity Prompt
โ˜…โ˜…โ˜…โ˜…โ˜… 5/5

Dark gradient hero with "Run better retrospectives" marketing headline and feature chip pills (Real-time sync, Guest-friendly, CSV Export). Board view has per-column colour dots and clean card layout. Marketing-grade landing page.

antigravity_claude_sonnet_4.6
Sonnet 4.6 ยท Antigravity + GPT-OSS 120B
โ˜…โ˜…โ˜…โ˜…โ˜… 4/5

Clean dark mode with logo in nav, purple accent buttons, column config on dashboard. Card detail view is well-structured. Professional feel throughout.

antigravity_opus_4.6
Opus 4.6 ยท Antigravity + GPT-OSS 120B
โ˜…โ˜…โ˜…โ˜…โ˜… 4/5

Dark themed with teal/yellow/blue colour-coded column headers, avatar initial chips on comments, user chip in top-right nav. Distinctive and polished.

claude_sonnet_4.6_with_playwright_high
Sonnet 4.6 ยท High ยท Claude + Playwright
โ˜…โ˜…โ˜…โ˜…โ˜… 4/5

Bold indigo/blue full-width header bar with white text branding. Comment count badges on cards. Clean card surfaces. Strong visual identity even if not unique.

claude_sonnet_4.6_with_playwright_xhigh
Sonnet 4.6 ยท xHigh ยท Claude + Playwright
โ˜…โ˜…โ˜…โ˜…โ˜… 4/5

Strong indigo header with "Joined as: Alice" identity indicator. Dashed + Add Column affordance is intuitive. Clean card grid layout.

claude_sonnet_4.7_high
Sonnet 4.7 ยท High ยท Claude
โ˜…โ˜…โ˜…โ˜…โ˜… 4/5

Deep indigo/purple full-width header, clean white card surfaces on a neutral background. "Joined as: Achint" contextual cue. Good vertical rhythm and spacing.

claude_opus_4.7_high ยท claude_opus_4.7_with_playwright_high ยท claude_opus_4.7_with_playwright_xhigh ยท claude_opus_4.7_xhigh
Opus 4.7 (all non-Antigravity configs)
โ˜…โ˜…โ˜…โ˜…โ˜… 3/5

All four default Opus 4.7 runs produced clean, readable but minimal UIs โ€” light backgrounds, plain sans-serif typography, one blue accent button, no strong personality. Functional but interchangeable with a dozen other light-mode apps. Perfect functional scores but unambitious visual design.

claude_opus_4.6_high ยท claude-opus-4.6_with_playwright_high ยท claude-sonnet_4.6_xhigh ยท claude-opus-4.6_xhigh ยท antigravity_gemini_3.1_pro_low
Various Claude Opus/Sonnet 4.6 defaults
โ˜…โ˜…โ˜…โ˜…โ˜… 3/5

Clean, usable, some with coloured header bars or accent buttons. Above average for generated code but nothing that would turn heads. The blue-header pattern (Opus 4.6 xHigh, Sonnet 4.7) adds a hint of personality. Gemini Pro Low surprised with decent card layout.

claude-opus-4.6_with_playwright_xhigh ยท antigravity_gemini_3.1_pro_high ยท antigravity_gemini3_flash ยท claude_qwen_3.6_high ยท claude_opus_4.6_xhigh_with_antigravity_prompt
Unstyled / Minimal / Layout Issues
โ˜…โ˜…โ˜…โ˜…โ˜… 2/5

Opus 4.6 Playwright xHigh: completely unstyled โ€” raw browser defaults, black links. Gemini Pro High: plain white HTML, no CSS whatsoever. Gemini Flash: dark but layout crowded to top-left corner. Qwen 3.6: default Bootstrap-ish styling, no identity. Antigravity Opus 4.6 xHigh: too small/dark to evaluate properly.

claude_qwen_coder_next_high_with_playwright
Qwen Coder Next ยท High ยท Claude
โ˜…โ˜…โ˜…โ˜…โ˜… 1/5

Completely unstyled. "Invalid Date" visible in the board list (date rendering bug). Jarring green "Add Column" button against a white page. No CSS beyond browser defaults. Reflects the broken implementation โ€” the visual state mirrors the functional state.

Key Aesthetic Finding: The Antigravity Prompt is the Biggest Design Differentiator

The two highest-rated visual designs (both 5/5) used the Antigravity system prompt โ€” which appears to include strong UI/UX style guidance encouraging hero sections, marketing copy, colour-coded columns, and avatar indicators. Without it, even Opus 4.7 at xHigh effort consistently produces functional but visually plain interfaces (3/5). Effort mode (High vs. xHigh) had almost no effect on visual quality. Model generation (4.6 โ†’ 4.7) had almost no effect on visual quality either. The prompt itself is the biggest lever for UI aesthetics in agentic code generation.

Cost Comparison (where recorded)

Final session cost after all prompting and fixes. Lower is better. Score shown for context. * = locally-hosted model; no inference charges. Dollar figures Claude Code reported for Qwen runs reflect the Claude Opus 4.7 orchestration layer cost only (~$41 and ~$179 respectively), not the Qwen model itself.

claude-sonnet_4.6_xhigh
Score: 44/45
$1.40
claude_sonnet_4.7_high
Score: 40/45
$1.58
claude-opus-4.6_with_playwright_xhigh
Score: 40/45
$2.08
claude_sonnet_4.6_with_playwright_xhigh
Score: 45/45 โญ
$2.57
claude-sonnet_4.6_xhigh_with_antigravity_prompt
Score: 43/45
$2.90
claude_opus_4.7_high
Score: 45/45 โญ
$3.15
claude_opus_4.7_xhigh
Score: 45/45 โญ
$3.31
claude_opus_4.6_xhigh_with_antigravity_prompt
Score: 43/45
$3.12
claude_opus_4.7_with_playwright_high
Score: 45/45 โญ
$4.01
claude_sonnet_4.6_with_playwright_high
Score: 41/45
$4.90
claude-opus-4.6_with_playwright_high
Score: 41/45
$5.36
claude_opus_4.7_with_playwright_xhigh
Score: 45/45 โญ
$5.57
claude_opus_4.7_xhigh_with_antigravity_prompt
Score: 45/45 โญ
$5.59
claude-opus-4.6_xhigh
Score: 42/45
$5.61
claude_opus_4.6_high
Score: 38/45
$6.62
claude_qwen_3.6_high
Score: 33/45 ยท Local model
$0*
claude_qwen_coder_next_high_with_playwright
Score: 16/45 ยท Local model
$0*

Key Findings

Finding 01
Model Leap

Opus 4.7 is the unambiguous winner โ€” every config scores 45/45

All 5 Opus 4.7 configurations (High, xHigh, with/without Playwright, with Antigravity prompt) delivered a fully working application without any manual fixes. No other model achieved this consistency. This is a clear generational step over Opus 4.6.

Finding 02
Systemic Bug

Docker is the #1 failure point โ€” broken in 11 of 22 runs

Two bugs recurred constantly: (1) better-sqlite3 requires Python + build tools absent in Alpine Linux containers, causing gyp errors. (2) Express 5 (now npm default) broke app.get('*') wildcard routing with a PathError. Models that chose sql.js (pure WASM) and were aware of Express 5 shipping changes consistently avoided both.

Finding 03
Hardest Feature

Card drag-and-drop is the trickiest feature to get right first try

Moving cards between columns had the highest failure rate of any functional feature. Bugs ranged from blank dashboards after a move, to ghost duplicate cards, to silent no-ops. This feature requires coordinated WebSocket broadcasts, DB updates, and React state โ€” any layer out of sync causes visible failures.

Finding 04
Agent Design

Antigravity's browser sub-agent elevates Sonnet but can't compensate for weaker models

The Antigravity agent (which spawns a GPT-OSS 120B browser sub-agent for visual QA) pushed Sonnet 4.6 to a perfect 45/45 โ€” a result only matched by Sonnet 4.6 with Playwright at xHigh effort. However, Antigravity Opus 4.6 still scored 41/45, suggesting visual verification helps catch bugs but can't substitute for model capability in generating correct code initially.

Finding 05
Tooling

Playwright UI testing correlates with โ€” but doesn't guarantee โ€” better scores

Playwright-enabled runs at xHigh effort were consistently excellent (Sonnet 4.6: 45/45, Opus 4.7: 45/45). But at High effort, Playwright sometimes caught bugs in environments it couldn't fully replicate (e.g., a Docker-specific SQLite datatype mismatch for Sonnet 4.6 High, which remained unfixed).

Finding 06
Effort Mode

xHigh effort helps weaker models more than stronger ones

For Opus 4.7, both High and xHigh were perfect โ€” the extra budget added no value. For Opus 4.6, xHigh (42) clearly outperformed High (38). For Sonnet 4.6, xHigh (44โ€“45) bested High (40โ€“41). The stronger the baseline model, the less incremental improvement higher effort provides.

Finding 07
Non-Claude Models

Non-Claude models underperform significantly โ€” especially at cost

Gemini 3.1 Pro scored a reasonable 41/45 but still needed Docker fixes. Gemini Flash (38/45) failed to start at all initially. Qwen 3.6 scored 33/45 at a cost of $41. Qwen Coder Next scored 16/45 while spending $178 โ€” the worst result and highest cost in the entire experiment by a wide margin.

Finding 08
Best Value

Cost vs. quality is profoundly non-linear โ€” and Qwen's "cost" is misread

Opus 4.7 High at $3.15 delivered a perfect 45/45. Sonnet 4.6 xHigh+Playwright at $2.57 also achieved a perfect 45/45. Both Qwen runs used locally-hosted models at zero inference cost. The $41 and $179 figures Claude Code reported were the cost of the Claude Opus 4.7 orchestration layer driving the agentic loop โ€” not Qwen inference. This makes the Qwen results more striking: even with free inference, the Claude orchestration overhead alone exceeded the cost of the best API-based runs, while delivering far worse results.

Finding 09
Library Choice

sql.js (WASM) was the right SQLite choice; better-sqlite3 consistently broke Docker

The better-sqlite3 native module requires a Python toolchain and GLIBC version that's often absent in Alpine or minimal Debian containers. Models that proactively chose sql.js โ€” a pure JavaScript/WASM port of SQLite โ€” shipped working Docker containers on the first attempt every time. This is a strong architectural signal for future benchmarks.

Finding 10
System Prompt

The Antigravity system prompt helps code quality but not scores for Opus 4.7

Adding the Antigravity system prompt to Claude runs produced more code (Opus 4.7: 3,676 lines vs. ~2,200 baseline), higher code quality, and better documentation โ€” but didn't improve the already-perfect 45/45 scores. It did help Opus 4.6, where the xHigh+Antigravity run (43) edged ahead of the base xHigh run (42).

Recommendations

๐Ÿ† Best First-Shot Reliability

Claude Opus 4.7 at any effort level. Perfect 45/45 across all 5 tested configurations. Zero manual fixes required. Best choice when correctness on the first attempt is paramount.

๐Ÿ’ฐ Best Value (Cost + Quality)

Claude Sonnet 4.6 + Playwright at xHigh โ€” perfect 45/45 at just $2.57. Alternatively, Opus 4.7 High at $3.15 for a zero-setup perfect run without Playwright.

๐Ÿ” For Iterative Workflows

Claude Sonnet 4.6 at xHigh (44/45, $1.40 after fixes) offers the lowest cost to near-perfect results. One fix needed for a native compile issue โ€” easily prompted away.

๐Ÿค– For Agent Pipelines

The Antigravity agent with Sonnet 4.6 delivers perfect results by combining coding + visual browser verification. An interesting architecture that any multi-agent pipeline could replicate.

โš ๏ธ Watch For: Express 5 + better-sqlite3

Both are now npm defaults that break Docker in agentic Node.js codegen. Prefer sql.js over better-sqlite3, and ensure agents are aware of Express 5's wildcard routing changes.

โŒ Avoid: Qwen Coder Next for Full-Stack Tasks

Scored 16/45 โ€” worst result in the experiment. While Qwen ran locally at no inference cost, the Claude Opus 4.7 orchestration overhead still reached $179, and core features (card moving, comments, real-time updates, CSV export) were permanently broken after 9+ hours of wall-clock time. The local model simply couldn't follow complex multi-file agentic instructions reliably.