The Judgment Problem

A close-up of a pocket watch on a table — Photo by Ruben Caldera on Unsplash

I’ve been reading Anthropic’s piece on recursive self-improvement, When AI Builds Itself . The argument is that AI is getting good enough at writing AI that the systems may soon design and train their own successors with little human direction. Anthropic puts a rough timeline on it of years, not decades, lays out three scenarios for how it could play out, and argues for some kind of coordinated slowdown before the last of those scenarios arrives.

My first reaction was the one I have to most timeline claims in this business. We’ve heard it before: self-driving cars were a couple of years out for about a decade, AGI has been imminent since before most of us were paying attention. Reasonable people still disagree hard about whether an intelligence explosion is even possible, with serious researchers lined up on both sides. So I read the prediction, note it, and mostly set it aside.

I don’t think the timeline is what we should all really be worried about, though. What’s more interesting is a different admission buried in the piece, one that connects to what I’ve been writing about lately.

Even in an article arguing that AI is about to automate AI research, Anthropic is honest about what the models still can’t do. They write that “large performance gaps persist when it comes to Claude exercising judgment in choosing goals in both engineering and research.” The systems are getting very good at executing a task once someone has decided it’s the right task, but they’re still not good at deciding which task is worth doing.

Anthropic is careful about its own headline number, too. They mention that their engineers shipped roughly eight times more code per quarter than they did from 2021 to 2025, then immediately note that this “is almost certainly an overstatement of the true productivity gain,” because lines of code measure volume, not value.

That gap between volume and value is the whole story, and it isn’t only Anthropic’s internal accounting. METR ran a controlled study of experienced open-source developers using AI tools in early 2025. The developers were 19% slower with the tools than without them. Stranger still, those same developers believed the AI had made them about 20% faster. The felt experience and the actual output have come apart, and the thing papering over the gap is how productive the work looks.

This is the third time I’ve ended up in the same place from a different door.

A couple of weeks ago, I wrote about what happens to the person doing the work . Lean on the tools too hard, and you get faster at producing things while getting worse at understanding them. The artifacts ship, but the judgment never gets built.

After that, I wrote about what happens on the receiving end . AI sounds equally confident whether it’s right or wrong, so the reader can’t use tone as a signal anymore. You have to verify final products on something other than how ‘finished’ or ‘confident’ the answer looks.

The Anthropic piece pulls me into the same spot again, this time at the level of the whole organization. As execution gets cheaper, what stays scarce is the judgment about what to execute on. Each time the tools absorb another layer of the work, the part left standing is the same one: deciding what’s worth doing and recognizing whether the result is any good.

So set the recursive-self-improvement debate aside. You don’t need a position on the intelligence explosion to take something useful from this.

If execution keeps getting cheaper, and it has and will continue to do so, both the bottleneck and the value move to the same place. They land on judgment: deciding which problems are worth solving, and knowing whether the result is any good. That holds whether the aggressive timeline is correct or whether this turns out to be another decade-long “couple of years away.” It doesn’t depend on the prediction at all. The cheaper the execution gets, the more the scarce and valuable work is the deciding factor, not the doing.

Most organizations are not built for that. They’re built to maximize execution. Throughput, velocity, tickets closed, features shipped. That’s what the org chart optimizes for, what the dashboards measure, and what people get promoted on. Almost none of it measures judgment, and a fair amount of what the modern organization is built for today quietly trains judgment out of people and processes.

If the execution layer keeps thinning, you find out fairly quickly whether you have people who can tell good work from bad. This connects back to the craft problem: when a team leans entirely on tools to produce, it creates a gap, with fewer people who understand the systems well enough to judge them. You can run on that gap for a while. It shows up later, when something breaks, and nobody can say why, or when there are ten plausible directions, and no one has the standing to pick one.

None of this requires believing AI will build its own successors next year. It only requires noticing where things are already heading. The companies that come out of this in reasonable shape will probably be the ones that treated judgment as something worth building on purpose.

The Judgment Problem

Get weekly insights on technology leadership