Ask a language model who painted the Sistine Chapel ceiling and it'll tell you Michelangelo without hesitation. Ask it for the name of the third person to walk on the moon and it might say, with identical conviction, someone who never existed.

Both answers arrive the same way. The model predicts the most probable next token given everything before it, draws from a learned probability distribution over its entire vocabulary, and moves on. Repeat until done. At no point does anything inside the machine verify whether the output is true.

This is worth sitting with for a second. There's no lookup table. No internal encyclopedia being consulted. No module that compares a candidate answer against stored facts and rejects the ones that fail. The architecture is a sequence of matrix multiplications that transform input tokens into a probability distribution over what should come next. "Should" here means statistically likely, not factually correct. The training objective, predicting the next token across billions of documents, rewards fluency and plausibility. Truth is a side effect that shows up when plausible and true happen to overlap.

When they don't, you get hallucination.

The word implies malfunction, but the mechanical reality is quieter than that. The model hits a prompt that pushes it into territory where its learned correlations stop tracking reality. Rare facts. Multi-step reasoning. Dates, numbers, proper nouns that barely appeared in training data. The highest-probability continuation is still a plausible-sounding string of tokens. It just happens to be wrong. And because the model has no way to flag its own uncertainty, it delivers the wrong answer with the same smooth confidence as the right one.

That's what makes hallucination structural rather than incidental. The entire system is optimized to produce the most plausible continuation, and plausibility is not truth. You can't patch that out.

Not everyone agrees this is permanent. A 2025 paper from OpenAI argues the problem is incentive-based, not architectural: models hallucinate because benchmarks reward guessing over abstention, and changing how we score could fix it. On the other side, Xu et al. published a formal impossibility proof showing any computable language model used as a general problem solver will inevitably hallucinate, regardless of training data or design. It's a diagonalization argument from learning theory. The debate is genuinely unsettled.

Retrieval-augmented generation helps. Give the model verified documents to condition on and it hallucinates less. But the generation step still runs through the same probability distribution. The model doesn't process retrieved text differently from anything else. It has better context. That's all.

There's a parallel in eyewitness testimony, actually. Witnesses who give the most confident accounts in court are not reliably more accurate than hesitant ones, at least not by the time testimony reaches the stand. Confidence is a performance, not a verification. We've known this about humans for decades and still struggle with it.

With an LLM the disconnect is starker. When the probabilities are high, next-token prediction just sounds like someone who knows what they're talking about. A model trained on enough text will produce fluent, assured reasoning that looks indistinguishable from understanding, right up until you check the facts and find they were never part of the process.

Sources: