The Task
Each agent received the same OpenSpec: build a self-hosted, real-time retrospective board with a React/Vite frontend, Node.js/Express backend, Socket.io for live sync, and SQLite persistence โ launchable via npm run dev and Dockerized in a single container.
Core features required: board creation & listing, configurable columns, guest display-name auth, drag-and-drop card management, nested comments, real-time broadcast across all connected clients, CSV export, and documentation.
Scoring
Score Cards โ All 22 Implementations
Feature Heatmap โ All Implementations ร All Criteria
3 Pass first try 2 Fixed after prompting 1 Partially fixed 0 Could not fix
| Implementation | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | ฮฃ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Criteria โ | Dev | Docker | Home | Board | Auth | Card+ | Move | Cmnt | RT+ | RTmv | RTcm | Pers | D.Per | Docs | CSV | |
| antigravity_claude_sonnet_4.6 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 45 |
| antigravity_gemini3_flash | 1 | 3 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 38 |
| antigravity_gemini_3.1_pro_high | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 41 |
| antigravity_gemini_3.1_pro_low | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 41 |
| antigravity_opus_4.6 | 3 | 2 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 41 |
| claude-opus-4.6_with_playwright_high | 3 | 2 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 41 |
| claude-opus-4.6_with_playwright_xhigh | 3 | 2 | 3 | 3 | 3 | 3 | 1 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 40 |
| claude-opus-4.6_xhigh | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 42 |
| claude-sonnet_4.6_xhigh | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 44 |
| claude-sonnet_4.6_xhigh_with_antigravity_prompt | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 43 |
| claude_opus_4.6_high | 3 | 2 | 3 | 3 | 3 | 3 | 1 | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 38 |
| claude_opus_4.6_xhigh_with_antigravity_prompt | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 43 |
| claude_opus_4.7_high | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 45 |
| claude_opus_4.7_with_playwright_high | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 45 |
| claude_opus_4.7_with_playwright_xhigh | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 45 |
| claude_opus_4.7_xhigh | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 45 |
| claude_opus_4.7_xhigh_with_antigravity_prompt | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 45 |
| claude_qwen_3.6_high | 3 | 2 | 3 | 2 | 2 | 3 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 39 |
| claude_qwen_coder_next_high_with_playwright | 2 | 2 | 3 | 2 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 0 | 21 |
| claude_sonnet_4.6_with_playwright_high | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 1 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 41 |
| claude_sonnet_4.6_with_playwright_xhigh | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 45 |
| claude_sonnet_4.7_high | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 40 |
Criteria Legend
| # | Short | Criterion | What was tested |
|---|---|---|---|
| 1 | Dev | Local Dev Startup | App starts cleanly with npm run dev (no manual intervention). Frontend and backend both reachable. |
| 2 | Docker | Docker Build & Run | docker build succeeds and the container runs without errors. Includes native module pitfalls (better-sqlite3 gyp) and Express 5 wildcard issues. |
| 3 | Home | Home / Dashboard Page | Landing page loads, lists existing boards, and provides a way to create a new board. |
| 4 | Board | Board Creation | Creating a new board generates a unique ID, configurable column names are respected, and the board persists across refreshes. |
| 5 | Auth | Guest Authentication | Users can join a board with a display name (guest auth). Name is shown on cards/comments they create. |
| 6 | Card+ | Card Creation | Cards can be added to any column. Card content is saved and visible to all participants. |
| 7 | Move | Card Movement | Cards can be dragged (or otherwise moved) between columns. The new position persists after reload. |
| 8 | Cmnt | Nested Comments | Users can add threaded/nested comments on a card. Comment author and timestamp are shown. |
| 9 | RT+ | Real-time Card Add | A card added by one client appears on all other connected clients in real time (Socket.io broadcast). |
| 10 | RTmv | Real-time Card Move | Moving a card on one client is reflected live on all other clients without a page refresh. |
| 11 | RTcm | Real-time Comments | Comments posted by one client appear immediately on all other connected clients. |
| 12 | Pers | Data Persistence | Boards, cards, and comments survive a server restart (SQLite or equivalent is used correctly). |
| 13 | D.Per | Docker Persistence | Data persists across container restarts when using a Docker volume. Volume is mounted to /data, not /app. |
| 14 | Docs | Documentation | A README or equivalent explains how to run locally and via Docker. Environment variables documented. |
| 15 | CSV | CSV Export | Board data (cards + comments) can be exported as a CSV file, downloadable from the UI. |
Visual Aesthetics โ Screenshot Review
Each implementation rated 1โ5 based on direct review of dashboard + board page screenshots: visual polish, colour coherence, layout quality, and professional feel. โ = star earned, โ = star missing.
Dark gradient hero with "Run team retrospectives that feel alive" tagline. Colour-coded columns (dot indicators), avatar chips with relative timestamps ("27 min ago"), Live badge. The most visually ambitious UI in the experiment.
Dark gradient hero with "Run better retrospectives" marketing headline and feature chip pills (Real-time sync, Guest-friendly, CSV Export). Board view has per-column colour dots and clean card layout. Marketing-grade landing page.
Clean dark mode with logo in nav, purple accent buttons, column config on dashboard. Card detail view is well-structured. Professional feel throughout.
Dark themed with teal/yellow/blue colour-coded column headers, avatar initial chips on comments, user chip in top-right nav. Distinctive and polished.
Bold indigo/blue full-width header bar with white text branding. Comment count badges on cards. Clean card surfaces. Strong visual identity even if not unique.
Strong indigo header with "Joined as: Alice" identity indicator. Dashed + Add Column affordance is intuitive. Clean card grid layout.
Deep indigo/purple full-width header, clean white card surfaces on a neutral background. "Joined as: Achint" contextual cue. Good vertical rhythm and spacing.
All four default Opus 4.7 runs produced clean, readable but minimal UIs โ light backgrounds, plain sans-serif typography, one blue accent button, no strong personality. Functional but interchangeable with a dozen other light-mode apps. Perfect functional scores but unambitious visual design.
Clean, usable, some with coloured header bars or accent buttons. Above average for generated code but nothing that would turn heads. The blue-header pattern (Opus 4.6 xHigh, Sonnet 4.7) adds a hint of personality. Gemini Pro Low surprised with decent card layout.
Opus 4.6 Playwright xHigh: completely unstyled โ raw browser defaults, black links. Gemini Pro High: plain white HTML, no CSS whatsoever. Gemini Flash: dark but layout crowded to top-left corner. Qwen 3.6: default Bootstrap-ish styling, no identity. Antigravity Opus 4.6 xHigh: too small/dark to evaluate properly.
Completely unstyled. "Invalid Date" visible in the board list (date rendering bug). Jarring green "Add Column" button against a white page. No CSS beyond browser defaults. Reflects the broken implementation โ the visual state mirrors the functional state.
Key Aesthetic Finding: The Antigravity Prompt is the Biggest Design Differentiator
The two highest-rated visual designs (both 5/5) used the Antigravity system prompt โ which appears to include strong UI/UX style guidance encouraging hero sections, marketing copy, colour-coded columns, and avatar indicators. Without it, even Opus 4.7 at xHigh effort consistently produces functional but visually plain interfaces (3/5). Effort mode (High vs. xHigh) had almost no effect on visual quality. Model generation (4.6 โ 4.7) had almost no effect on visual quality either. The prompt itself is the biggest lever for UI aesthetics in agentic code generation.
Cost Comparison (where recorded)
Final session cost after all prompting and fixes. Lower is better. Score shown for context. * = locally-hosted model; no inference charges. Dollar figures Claude Code reported for Qwen runs reflect the Claude Opus 4.7 orchestration layer cost only (~$41 and ~$179 respectively), not the Qwen model itself.
Key Findings
Opus 4.7 is the unambiguous winner โ every config scores 45/45
All 5 Opus 4.7 configurations (High, xHigh, with/without Playwright, with Antigravity prompt) delivered a fully working application without any manual fixes. No other model achieved this consistency. This is a clear generational step over Opus 4.6.
Docker is the #1 failure point โ broken in 11 of 22 runs
Two bugs recurred constantly: (1) better-sqlite3 requires Python + build tools absent in Alpine Linux containers, causing gyp errors. (2) Express 5 (now npm default) broke app.get('*') wildcard routing with a PathError. Models that chose sql.js (pure WASM) and were aware of Express 5 shipping changes consistently avoided both.
Card drag-and-drop is the trickiest feature to get right first try
Moving cards between columns had the highest failure rate of any functional feature. Bugs ranged from blank dashboards after a move, to ghost duplicate cards, to silent no-ops. This feature requires coordinated WebSocket broadcasts, DB updates, and React state โ any layer out of sync causes visible failures.
Antigravity's browser sub-agent elevates Sonnet but can't compensate for weaker models
The Antigravity agent (which spawns a GPT-OSS 120B browser sub-agent for visual QA) pushed Sonnet 4.6 to a perfect 45/45 โ a result only matched by Sonnet 4.6 with Playwright at xHigh effort. However, Antigravity Opus 4.6 still scored 41/45, suggesting visual verification helps catch bugs but can't substitute for model capability in generating correct code initially.
Playwright UI testing correlates with โ but doesn't guarantee โ better scores
Playwright-enabled runs at xHigh effort were consistently excellent (Sonnet 4.6: 45/45, Opus 4.7: 45/45). But at High effort, Playwright sometimes caught bugs in environments it couldn't fully replicate (e.g., a Docker-specific SQLite datatype mismatch for Sonnet 4.6 High, which remained unfixed).
xHigh effort helps weaker models more than stronger ones
For Opus 4.7, both High and xHigh were perfect โ the extra budget added no value. For Opus 4.6, xHigh (42) clearly outperformed High (38). For Sonnet 4.6, xHigh (44โ45) bested High (40โ41). The stronger the baseline model, the less incremental improvement higher effort provides.
Non-Claude models underperform significantly โ especially at cost
Gemini 3.1 Pro scored a reasonable 41/45 but still needed Docker fixes. Gemini Flash (38/45) failed to start at all initially. Qwen 3.6 scored 33/45 at a cost of $41. Qwen Coder Next scored 16/45 while spending $178 โ the worst result and highest cost in the entire experiment by a wide margin.
Cost vs. quality is profoundly non-linear โ and Qwen's "cost" is misread
Opus 4.7 High at $3.15 delivered a perfect 45/45. Sonnet 4.6 xHigh+Playwright at $2.57 also achieved a perfect 45/45. Both Qwen runs used locally-hosted models at zero inference cost. The $41 and $179 figures Claude Code reported were the cost of the Claude Opus 4.7 orchestration layer driving the agentic loop โ not Qwen inference. This makes the Qwen results more striking: even with free inference, the Claude orchestration overhead alone exceeded the cost of the best API-based runs, while delivering far worse results.
sql.js (WASM) was the right SQLite choice; better-sqlite3 consistently broke Docker
The better-sqlite3 native module requires a Python toolchain and GLIBC version that's often absent in Alpine or minimal Debian containers. Models that proactively chose sql.js โ a pure JavaScript/WASM port of SQLite โ shipped working Docker containers on the first attempt every time. This is a strong architectural signal for future benchmarks.
The Antigravity system prompt helps code quality but not scores for Opus 4.7
Adding the Antigravity system prompt to Claude runs produced more code (Opus 4.7: 3,676 lines vs. ~2,200 baseline), higher code quality, and better documentation โ but didn't improve the already-perfect 45/45 scores. It did help Opus 4.6, where the xHigh+Antigravity run (43) edged ahead of the base xHigh run (42).
Recommendations
๐ Best First-Shot Reliability
Claude Opus 4.7 at any effort level. Perfect 45/45 across all 5 tested configurations. Zero manual fixes required. Best choice when correctness on the first attempt is paramount.
๐ฐ Best Value (Cost + Quality)
Claude Sonnet 4.6 + Playwright at xHigh โ perfect 45/45 at just $2.57. Alternatively, Opus 4.7 High at $3.15 for a zero-setup perfect run without Playwright.
๐ For Iterative Workflows
Claude Sonnet 4.6 at xHigh (44/45, $1.40 after fixes) offers the lowest cost to near-perfect results. One fix needed for a native compile issue โ easily prompted away.
๐ค For Agent Pipelines
The Antigravity agent with Sonnet 4.6 delivers perfect results by combining coding + visual browser verification. An interesting architecture that any multi-agent pipeline could replicate.
โ ๏ธ Watch For: Express 5 + better-sqlite3
Both are now npm defaults that break Docker in agentic Node.js codegen. Prefer sql.js over better-sqlite3, and ensure agents are aware of Express 5's wildcard routing changes.
โ Avoid: Qwen Coder Next for Full-Stack Tasks
Scored 16/45 โ worst result in the experiment. While Qwen ran locally at no inference cost, the Claude Opus 4.7 orchestration overhead still reached $179, and core features (card moving, comments, real-time updates, CSV export) were permanently broken after 9+ hours of wall-clock time. The local model simply couldn't follow complex multi-file agentic instructions reliably.