We Are Changing Our Developer Productivity Experiment Design

(metr.org)

51 points | by ej88 7 hours ago

10 comments

keeda 2 hours ago
> When surveyed, 30% to 50% of developers told us that they were choosing not to submit some tasks because they did not want to do them without AI. This implies we are systematically missing tasks which have high expected uplift from AI.
In fact, one of the developers in the original study later revealed on Twitter that he had already done exactly that during the study, i.e. filtered out tasks he prefered not to do without AI: https://xcancel.com/ruben_bloom/status/1943536052037390531
While this was only one developer (that we know of), given the N was 16 and he seems to have been one of the more AI-experienced devs, this could have had a non-trivial effect on the results.
The original study gets a lot of air-time from AI naysayers, let's see how much this follow-up gets ;-)
[-]
- sjaiisba 2 hours ago
  > 3. Regarding me specifically, I work on the LessWrong codebase which is technically open-source. I feel like calling myself an "open-source developer" has the wrong connotations, and makes it more sound like I contribute to a highly-used Python library or something as an upper-tier developer which I'm not
  That’s very interesting! This kinda matches what I see at work:
  - low performers love it. it really does make them output more (which includes bugs, etc. it’s causing some contention that’s yet to be resolved)
  - some high performers love it. these were guys who are more into greenfield stuff and ok with 90% good. very smart, but just not interested in anything outside of going fast
  - everyone else seems to be finding use out of it, but reviews are painful
atleastoptimal 3 hours ago
It's kind of funny that METR is known primarily for both the most bearish study on AI progress (the original 20% slowdown one), and the most bullish one on AI progress (the long-task horizon study showing exponential increase in duration of tasks AI models can accomplish with respect to date of release).
In either case, it seems people ended up bolstering their preexisting views on AI based on whichever study most affirmed them (for the former, that AI coding models didn't actually help and created a mirage of productivity that required more work to fix than was worth it, the latter that AI models were improving at an exponential rate and will invariably eclipse SWE's in all tasks in a deterministic amount of time.)
I think the truth is somewhere in the middle. Just anecdotally we've seen multi-million dollar fortunes being minted by small teams developing using 90% AI-assisted coding. Anthropic claims they solely use agents to code and don't modify any code manually.
[-]
- sjaiisba 2 hours ago
  > Anthropic claims they solely use agents to code and don't modify any code manually.
  Have you used CC? It shows. They did not make their fortune off this, and it’s at least lost me a customer because of how sloppy it is. The model is good, and it’s why they have to gate access to it. I’d much rather use a different harness.
  I do think you’re on to something though. As societal wealth further concentrates among the few, we’re going to get more and more slop for the rest of us because we have no money (relatively speaking). Agentic coding is here to stay because we as a society are forced more and more slop. It’s already rampant, this is just automating it.
  [-]
  - Wowfunhappy 33 minutes ago
    ...uh, I think Claude Code is great, actually. A lot of that is indeed just the strength of the underlying model, but the local client is great too. Plan mode, checkpoints, subagents... I've been using Claude Code for a year now, and I feel like Anthropic has steadily been eliminating pay points.
    It's certainly a lot better than the Gemini cli!
    [-]
    - Philpax 12 minutes ago
      Functionality-wise, it's great, but it's a buggy mess, and it seems to be getting worse with each release.
ej88 6 hours ago
Really interesting updates to their 2025 experiment.
Repeat devs from the original experiment went from 0-40% slowdown to now -10-40% speedup - and METR estimates this as a 'lower-bound'
more devs saying they dont even want to do 50% of their work without AI, even for 50/hr
30-50% of devs decided not to submit certain tasks without AI, missing the tasks with the highest uplift
it also seems like there is a skill gap - repeat devs from the first study are more productive with ai tools than newly recruited ones with variable experience
overall it seems like the high preference for devs to use AI is actually hurting METR's ability to judge their speedup, due to a refusal to do tasks without it. imo this is indirectly quite supportive for ai coding's productivity claims.
[-]
- roxolotl 3 hours ago
  The finding of the first study was people cannot judge their performance with these tools. So I don’t think the lack of individuals not willing to work without them is indicative of productivity improvements. I think it’s indicative of them being enjoyable to use.
  [-]
  - logicprog 18 minutes ago
    It was claimed to find that, but I don't think it did. It compared developers' beliefs about average speed up across tasks, measured by asking them once at the end, compared to the average comparative speed measured per task and then averaged. That's measuring two different things, and all kinds of things could mass up developers' fuzzy recollection of the gestalt of several tasks (such as recency bias and question/study framing) that wouldn't effect it if you asked them right after; moreover, when tasks were broken down by task type, the speed up/slow down results actually matched developers' qualitative comments.
arctic-true 4 hours ago
Those developer quotes are tough to read. Rate limits are going to hit like a truck when the labs eventually need to make a profit.
[-]
- simonw 3 hours ago
  At this point the AI labs would pretty much have to form an illegal price fixing cartel in order to jack the prices up, they've been competing to drive down prices for so long.
  They'd have to get the Chinese AI labs to go along with that price fixing too.
  [-]
  - arctic-true 3 hours ago
    They’d have an entire country of geniuses prepared to defend against the antitrust allegations, who’s to stop them? /s
daxfohl 3 hours ago
"I don't want to do this without AI" sounds like we're already well into the brain atrophy stage of this. Now what? (I'd think about it myself but....)
[-]
- marcosdumay 3 hours ago
  "I avoid issues like AI can finish things in just 2 hours, but I have to spend 20 hours. I will feel so painful if the task is decided as AI-disallowed."
  What really doesn't sound like the results they got where developers may get up to twice as productive on the best scenario.
  There's surely something scary there. And the lack of people ambivalent about AI isn't a certain indication it's well accepted as they think, it can just as easily be caused by polarization.
- falcor84 2 hours ago
  I'm pretty sure that this was exactly the response to the first generation of devs who insisted on coding with a terminal instead of submitting punch cards like "real programmers".
- bitwize 2 hours ago
  AI will soon be an intrinsic part of the job. Now what? "Get your thumb out of your ass and learn [how to use AI]." —Eric S. Raymond
camgunz 4 hours ago
Unless this measures the entire SDLC longitudinally (like say, over a year) I'm not interested. I too can tell Claude Code to do things all day every day, but unless we have data on the defect rate it doesn't matter at all.
[-]
- falcor84 2 hours ago
  Do any of those companies collect and share data on their defect rates to give you a baseline to compare against?
softwaredoug 6 hours ago
I'm a bit perplexed by the developer selection effects.
I get that developers want to use AI. But are they also claiming there's not still a no/low-AI population of developers? Or that their means of selection don't find these developers?
Are they worried that by splitting devs into groups of AI experience they might be measuring some confounder that causes people to choose AI / not AI in their careers?
[-]
- sgillen 4 hours ago
  The study was designed to have devs who are comfortable with AI perform 50% of tasks with AI and 50% without. So the problem is the population of "Developers who use AI regularly but are willing to do tasks without AI" is shrinking.
  >> Are they worried that by splitting devs into groups of AI experience they might be measuring some confounder that causes people to choose AI / not AI in their careers?
  The developer sample size was small (16 people in the original study) and the task sample size is larger (~250 tasks). I think the worry is variance in developer productivity would totally wash out any signal.
- selridge 4 hours ago
  Here is my read:
  Developers are refusing to complete the survey or selecting themselves out because they (apparently) don’t want to complete the non-AI task.
  The also saw selection effects from a large reduction in the pay for the study (which is an unfortunate confounder here), 150/hr -> 50/hr.
  They guess this makes their estimates lower bounds, but the selection effect is complicated (which they acknowledge).
  Overall this is a hard problem for them in the current state. It will be challenging to produce convincing year over year analysis under these conditions.
sgillen 4 hours ago
This is very interesting because I see a lot of AI detractors point to the original study as proof that AI is overhyped and nothing to worry about. In this new study the findings are essentially reversed (20% slowdown to 20% speedup).
[-]
- selridge 3 hours ago
  I think their old findings were hard to treat as gospel just due to the kind of comparison + the sample, but this new result is probably much noisier.
  It’s hard to make reliable, directional assumptions about the kind of self-selection and refusal they saw, even without worrying about the reward dropping 66%.
- fxwin 2 hours ago
  fwiw i think the interesting part about the original study wasn't so much the slowdowm part, but the discrepancy between perceived and measured speedup/slowdown (which is the part i used to bring up frequently when talking to other devs)
- ej88 4 hours ago
  not enough people look at the slope, just the coords
- simonw 3 hours ago
  AI detractors loved that previous study so much. It seems to have been brought up in the majority of conversations about AI productivity over the past six months.
  (Notable to me was how few other studies they cited, which I think is because studies showing AI productivity loss are quite uncommon.)
  [-]
  - smohare 2 hours ago
    Or maybe there’s just not that many good studies, period?
    A lot of them barely rise above the level of collected anecdote, nor explore long term or more elusive factors (such as cross-system entropy). They’re also targeting an area that is fairly difficult to measure and control for.
tonymet 1 hour ago
> "AI tools lead to worse productivity"
> The subjects are using ChatGPT 2.5 and copy-pasting code.
The reason AI hype seems to be so bipolar is that "AI" isn't one thing. Hundreds of models, dozens of tools. And to get something done well, a seasoned engineer needs to master half a dozen at a time.
Bnjoroge 3 hours ago
never been a better time to be a swe who doesnt or significantly limits the use of AI agents
[-]
- Krei-se 1 hour ago
  great to see that wisdom and sanity is still found on yc