The Confidence Problem

An antique map of Europe — confident lines drawn over territory that was only partially understood
Photo by Jakob Braun on Unsplash

I use AI tools constantly: writing, research, code, analysis. They’re part of how I work now and have been for a while. The more I’ve used them, the more I’ve noticed something about how they communicate that I didn’t pay enough attention to early on: they sound the same whether they’re right or wrong.

Ask a model to summarize a document it has in front of it, and you get a fluent, measured answer. Ask it to identify who said something it can’t actually verify, and you get the same fluent, measured answer. There’s no visible difference; no hedging, no “I’m not certain but,” and no “you might want to check this.” You get the output with complete-sounding responses that are ready to act on.

This isn’t an accident. A language model’s job is to predict what comes next: pattern-completion at a scale that makes it look like reasoning. When you ask a question, the model doesn’t query a database or work from evidence to a conclusion. It generates what an answer to this kind of question typically sounds like, drawing on the statistical weight of everything it trained on. When the surrounding context is rich and the patterns are dense, the output is fluent and confident. The model doesn’t have an epistemic state; there’s no internal “I’m not sure about this.” It has strong patterns and weaker patterns, and both produce the same tone.

What makes that concrete is a pattern that keeps showing up in research on model calibration. When models generate incorrect information, they tend to use more confident language, not less. Researchers examining the internal mechanics of LLMs found that when a model produces a wrong answer, specific circuits reliably inflate the expressed confidence ; not randomly, but as a structural feature of how the models are trained. Words and phrases like “definitely,” “certainly” and “without doubt.” These markers appear more often in wrong answers: the model sounds most certain exactly when it’s doing the most pure pattern-filling with nothing real underneath it.

This gets mentioned in AI safety discussions but I don’t think the implication has fully landed outside of them. With a human source, confidence is at least a weak signal. People hedge when they’re uncertain. They say “I think” or “I’m pretty sure” or “you’d want to check this.” It’s imperfect, but it tracks something real: the person’s actual sense of what they know. With AI, that signal runs backwards: the more certain it sounds, the more skeptical you probably should be, because high-confidence outputs are often the ones furthest from anything the model can actually verify.

The data on what this looks like in practice is pretty compelling. A Columbia Journalism Review study tested ChatGPT’s ability to identify quotes from popular journalism sources. The model misattributed 76% of them. Of the 153 cases where it was wrong, it flagged any uncertainty in 7. Stanford researchers studying legal queries found hallucination rates of 58 to 88 percent across major models with every instance expressed with the same fluent, professional tone a correct answer would carry. Both involved normal queries, not adversarial testing; and in both cases, a wrong answer arrived sounding exactly like a right one.

This is a different argument from the one I made last week about vibe coding and craft . That piece was about what AI does to the person producing work: the developer who stops engaging with what they’re building. This is about what it does to the person receiving work: the reader, the analyst, the decision-maker. The failure is quieter because the content looks fine and reads confidently. There’s nothing in the output itself that tells you whether it’s grounded in evidence or assembled from pattern-weight.

That’s the harder problem. Most of the conversation about AI accuracy has focused on hallucination rates, and those have genuinely come down. Frontier models today sit somewhere between 3 and 19 percent error rates across task types, depending on what you’re measuring. That’s real improvement, but lower error rates don’t solve the calibration problem, which is that the confidence in an output doesn’t correlate with its accuracy. The tools we naturally use to evaluate human communication (tone, certainty level, the presence or absence of hedging) actively mislead us when we try to apply them to AI. We’re pattern-matchers too, and we’re pattern-matching on signals the model didn’t generate in good faith.

I’ve come to think that the verification habit has to come from somewhere other than how an output sounds. The useful question isn’t “does this sound right?” It’s “what would I need to see to verify this independently?” Those are different questions, and the second one requires actually staying engaged with the content rather than just receiving it. It means being willing to track a claim back to a source, check whether the source says what the model says it says, and do that even when the output is fluent enough to make it feel unnecessary.

That isn’t a huge burden most of the time. Most AI outputs are operating on facts that are either checkable quickly or in domains where errors are obvious. But the minority matters, and the outputs that matter most tend to be exactly the ones where the patterns are plausible and the underlying claim is specific, contextual, and not something the model can actually know. The model can’t distinguish between those situations. You can, but only if you’re still actually checking.

What I find harder to resolve is where the habit actually comes from and whether the environment around these tools makes it easier or harder to maintain. Every AI response feels complete. Every output is finished, tidy, ready to use. The friction that might ordinarily make you stop and check something (a gap, a hedge, a visible uncertainty) is exactly the kind of friction that good AI output has been designed to smooth over. Asking “is this actually true?” in the presence of confident, complete-sounding information requires swimming against a current.

Stulberg’s point from last week keeps coming back to me: the cost of skipping the hard part doesn’t appear until you need what you didn’t build. With craft, what you skip is judgment. With information, what you skip is calibration. Which means the calibration work falls to you. Not constantly, not for everything, but in the moments that matter, which tend to be exactly the moments when the output is most complete and confident and ready to act on.

Share

Get weekly insights on technology leadership

One idea per issue. No spam. Plus a free guide on measuring AI initiatives when the old metrics don't work.

Or download the free guide directly →