Research · Handle

Is Claude Code better at backend than frontend? No. But also yes.

In real Claude Code usage, frontend work gets to good outcomes, but only because humans stay much closer to the work — with more reprompting, more correction, and more human steering.

5 min read

Intuitively, frontend work with coding agents feels different than backend work. There's more flipping between chat and browser, more eyeballing, and more manual tweaking and fine-tuning.

We can quantify that now with data. SWE-chat (paper, dataset) is a public dataset just released by SALT-NLP that captures real development work with coding agents on public GitHub repos — actual sessions, not benchmark tasks. We pulled the 2,333 Claude Code sessions that touched only frontend or only backend files and compared frontend vs backend. The finding: frontend work with agents needs more human intervention.

Session outcomes are similar

SWE-chat scores each session on a 0–100 session_success estimate — an LLM-derived measure of how well the work ended, given the conversation transcript and resulting commits. It's an overall score of "did this go well?"

On that session-level measure, frontend and backend Claude Code sessions land in the same place. The medians are identical at 82.0, and the full distributions almost completely overlap.

Frontend and backend session_success distributions overlap almost perfectly, both centred on a median of 82. n = 770 frontend, 1,563 backend sessions; p = 0.005.

The session-level outcomes are essentially the same between frontend and backend. But when we dig deeper, the journeys to those outcomes are different.

Journeys are different: more "that doesn't work" moments

session_success summarizes each session into a single outcome score. When we drop down a level to the individual prompts inside the session, the picture changes. SWE-chat also tags every user prompt with a prompt_pushback label, including failure_report: prompts where the user explicitly tells the agent something didn't work.

There's a significant difference in the frequency of "that doesn't work" moments.

Frontend prompts report failure 34% more often than backend prompts. n = 6,288 frontend, 19,616 backend prompts; p ≈ 5e-11.

There are two obvious interpretations:

  • Frontend just fails more. UI work mixes code generation with browser behaviour, rendering, interaction state, and design judgement. There are more places for it to look or behave incorrectly.
  • Frontend bugs are easier to spot. A misaligned section is obvious in the browser tab. A subtly bug in an API endpoint can sit unnoticed until something pages oncall.

Either way, working with an agent on frontend means more visible errors that invite human intervention. The data shows it in two distinct ways: the user reprompts the agent more, and the user writes or edits more of the final code themselves.

Intervention 1: more reprompting

Examining prompt_pushback enables us to compute a reprompt-rate metric that measures corrections, rejections, and user takeovers — any prompt where the user is steering the agent rather than asking for a fresh thing.

44% of frontend sessions contain at least one correction, rejection, or takeover prompt — versus 36% for backend. n = 770 frontend, 1,563 backend sessions; p = 0.001.

There is a clear reprompt gap between frontend and backend. The gap survives even when you control for user persona (the dataset labels users as Expert Nitpickers, Vague Requesters, Mind Changers, etc.), so the difference is not due to a persona mix shift.

In other words: the agent is harder to steer on frontend, regardless of who is doing the steering.

Intervention 2: more direct editing

The other way humans intervene is by writing and editing the code themselves. SWE-chat captures this with agent_percentage: for each session, an estimate of the share of committed lines that came from the agent rather than the human. A higher number means more of the shipped code was authored by the agent; a lower number means the human wrote or rewrote more of it.

On that metric, frontend's median is clearly lower than backend's.

In frontend sessions, agents author the median 57% of committed lines. In backend, that number jumps to 83% — a 26-point gap. n = 770 frontend, 1,563 backend sessions; p = 0.03.

Whether humans are cleaning up agent errors or proactively writing parts of the code themselves, it's clear they are compelled to step back into their editor and directly author code much more frequently on frontend than backend.

Chat is not enough

Pull the findings together and a single picture emerges.

MetricFrontendBackendWhat it means
Median session success8282Same final outcome quality
Failure-report rate per prompt10.3%7.7%More "this is broken" moments
Reprompt rate per session44%36%More steering and correction
Median agent-authored share57%83%More human-authored committed code

Frontend work with coding agents get to good final quality, but only because the cost is absorbed by human intervention. There is a gap between what coding agents can generate and what humans still have to inspect, correct, refine, and steer to make frontend work actually ship.

This is the thesis behind Handle: a frontend-specific control surface for coding agents. The agent produces a first draft quickly, while you use precision tools to refine — all against the live product.

Handle

Stay in the loop

Get updates on new features and releases.

GitHubX

© 2026 Tonkotsu AI
SOC 2 Type I