The 43-Point Gap

The 43-Point Gap

I’ve been following the METR research on AI-assisted development for a while now, and one number keeps coming back to me.

Before the trial, experienced developers expected AI tools to make them 24% faster. After completing the work, and being measurably slowed down in the process, they still believed AI had sped them up by 20%. A controlled trial found they were actually 19% slower.

That’s a 43-point gap between what those developers expected and what was actually happening. And by February 2026, many of those same developers refused to participate in AI-free work studies. They’d built a dependency before they’d built an understanding of what the dependency was actually doing.

I think that gap is worth paying attention to because it reveals something about what AI supervision actually requires, and what most AI governance frameworks are missing.

The METR trial wasn’t a casual survey. Experienced developers, real tasks. The performance degradation happened despite high comfort with the tools and genuine belief in their own speed.

Anthropic ran a separate randomized controlled trial. Junior engineers using AI scored 17% lower on mastery quizzes. The biggest deficits showed up in debugging, which is precisely the skill you need to evaluate whether AI-generated output is correct.

Part of me wonders if we’ve been thinking about this backwards. The concern has mostly been about AI replacing developers but what seems to be happening instead is that developers are using AI to skip the parts of the work that build judgment, and not noticing, because the output still ships.

GitClear analyzed repositories with high AI adoption and found:

  • Code volume up 10%
  • Refactoring down 60%
  • Copy-paste code up 48%

That’s what it looks like when AI generates output and humans approve it without really reviewing it. The codebase grows and the hard integrating and questioning gets skipped. And the people working in it are, by some measures, less equipped to notice than they were a year ago.

Faros AI studied 10,000 developers across teams with high and low AI adoption. High-adoption teams completed 21% more tasks and pull request review time increased 91%. I think that review time increase is worth thinking about because I think it’s the system trying to compensate: more code, more errors, more eyes needed. And the eyes themselves are less trained than they used to be.

METR’s merge analysis found that 50% of AI-generated code that passes automated tests would be rejected by human reviewers. Average fix time when someone catches it: 42 minutes. For code that makes it through review, the fix time gets measured in incidents.

In December 2025, an AI agent with operator-level AWS permissions was tasked with making targeted changes to a cloud environment. It deleted and recreated the entire environment instead, taking down Amazon’s Cost Explorer for thirteen hours. In March 2026, engineers at Amazon retail followed AI-generated advice that turned out to be built on an outdated internal wiki. Six hours later, 22,000 users were affected.

Neither of these happened because someone was being reckless. Both happened inside organizations with AI governance policies, security reviews, and experienced engineering teams. The policies existed but the capability to execute on them (questioning the output, recognizing when something plausible was actually wrong) wasn’t there when it needed to be.

What both incidents share is a gap between accountability on paper and judgment in practice.

The Layer Most AI Governance Frameworks Miss

Most AI governance frameworks focus on access controls, usage policies, audit trails, and approved vendor lists. All of that is necessary. But it addresses a different problem than the one these incidents describe.

The gap they miss: do the people responsible for reviewing AI output have the skills to catch what AI gets wrong?

Anthropic’s own engineers now spend more than 70% of their time reviewing and revising code rather than writing it. That ratio is probably about right. The work has shifted from generation to evaluation. But evaluation requires understanding why the code does what it does, not just whether it compiles. Catching an AI’s confident mistake requires knowing what a non-mistake looks like.

I think the same dynamic is showing up across any domain where AI generates substantive output: legal analysis, financial modeling, clinical decision support, strategic memos. The people approving that output need enough depth to recognize when it’s wrong in ways that matter. That depth is maintained by use and it degrades when the work that built it stops.

A governance policy that assigns review responsibility to someone who has stopped doing the work that built their reviewing ability isn’t really governance. It’s documentation of accountability without the substance behind it.

The Question That Keeps Coming Back

I’m still turning this over. I don’t have a clean answer for how organizations thread this needle, keeping the speed advantages of AI tools while not quietly degrading the judgment needed to supervise them.

The question I keep coming back to is simple (but the answer is not): are the people signing off on AI output getting better at catching what AI gets wrong?

That question lives in training design, task rotation, how you structure career ladders, and whether you’re deliberately preserving work that humans need to keep doing, not to be inefficient, but to keep the judgment sharp.

The 43-point gap between what developers expected and what was actually happening is one data point. The refusal to work without AI tools is another. The debugging skill deficits in junior engineers are a third. Taken together, they point to something that access controls and audit trails weren’t built to address.

Share

Get weekly insights on technology leadership

One idea per issue. No spam. Plus a free guide on measuring AI initiatives when the old metrics don't work.

Or download the free guide directly →