Intuitively, frontend work with coding agents feels different than backend work. There's more flipping between chat and browser, more eyeballing, and more manual tweaking and fine-tuning.
We can quantify that now with data. SWE-chat (paper, dataset) is a public dataset just released by SALT-NLP that captures real development work with coding agents on public GitHub repos — actual sessions, not benchmark tasks. We pulled the 2,333 Claude Code sessions that touched only frontend or only backend files and compared frontend vs backend. The finding: frontend work with agents needs more human intervention.
Session outcomes are similar
SWE-chat scores each session on a 0–100 session_success estimate — an LLM-derived measure of how well the work ended, given the conversation transcript and resulting commits. It's an overall score of "did this go well?"
On that session-level measure, frontend and backend Claude Code sessions land in the same place. The medians are identical at 82.0, and the full distributions almost completely overlap.
The session-level outcomes are essentially the same between frontend and backend. But when we dig deeper, the journeys to those outcomes are different.
Journeys are different: more "that doesn't work" moments
session_success summarizes each session into a single outcome score. When we drop down a level to the individual prompts inside the session, the picture changes. SWE-chat also tags every user prompt with a prompt_pushback label, including failure_report: prompts where the user explicitly tells the agent something didn't work.
There's a significant difference in the frequency of "that doesn't work" moments.
There are two obvious interpretations:
- Frontend just fails more. UI work mixes code generation with browser behaviour, rendering, interaction state, and design judgement. There are more places for it to look or behave incorrectly.
- Frontend bugs are easier to spot. A misaligned section is obvious in the browser tab. A subtly bug in an API endpoint can sit unnoticed until something pages oncall.
Either way, working with an agent on frontend means more visible errors that invite human intervention. The data shows it in two distinct ways: the user reprompts the agent more, and the user writes or edits more of the final code themselves.
Intervention 1: more reprompting
Examining prompt_pushback enables us to compute a reprompt-rate metric that measures corrections, rejections, and user takeovers — any prompt where the user is steering the agent rather than asking for a fresh thing.
There is a clear reprompt gap between frontend and backend. The gap survives even when you control for user persona (the dataset labels users as Expert Nitpickers, Vague Requesters, Mind Changers, etc.), so the difference is not due to a persona mix shift.
In other words: the agent is harder to steer on frontend, regardless of who is doing the steering.
Intervention 2: more direct editing
The other way humans intervene is by writing and editing the code themselves. SWE-chat captures this with agent_percentage: for each session, an estimate of the share of committed lines that came from the agent rather than the human. A higher number means more of the shipped code was authored by the agent; a lower number means the human wrote or rewrote more of it.
On that metric, frontend's median is clearly lower than backend's.
Whether humans are cleaning up agent errors or proactively writing parts of the code themselves, it's clear they are compelled to step back into their editor and directly author code much more frequently on frontend than backend.
Chat is not enough
Pull the findings together and a single picture emerges.
| Metric | Frontend | Backend | What it means |
|---|---|---|---|
| Median session success | 82 | 82 | Same final outcome quality |
| Failure-report rate per prompt | 10.3% | 7.7% | More "this is broken" moments |
| Reprompt rate per session | 44% | 36% | More steering and correction |
| Median agent-authored share | 57% | 83% | More human-authored committed code |
Frontend work with coding agents get to good final quality, but only because the cost is absorbed by human intervention. There is a gap between what coding agents can generate and what humans still have to inspect, correct, refine, and steer to make frontend work actually ship.
This is the thesis behind Handle: a frontend-specific control surface for coding agents. The agent produces a first draft quickly, while you use precision tools to refine — all against the live product.