Prompt
One initial prompt and a fixed number of repair turns.
Arena is optimized for watchable, reproducible visual tasks. It is intentionally narrower than a universal benchmark.
Each agent and model pair receives the same task package, time limit, prompt, and repair budget. Final results are judged through visible browser output, automated checks, and a human-readable rubric.
One initial prompt and a fixed number of repair turns.
V1 uses a 20-minute target per run.
Desktop and mobile render, console errors, and key interactions.
Screenshots, clips, logs, and result notes are attached to each run.
Starts, renders, and remains interactable.
Completes the actual requested behavior.
Looks coherent and communicates state.
Feels usable across expected inputs.
Avoids obvious brittle shortcuts.
Arena results are strongest for visual coding tasks such as games, Canvas tools, and Three.js scenes. They should be read as public evidence for specific case behavior, not as a global model ranking.