No System Can Verify Its Own Blind Spots

I have spent considerable time thinking about a question that recurs in nearly every serious discussion of AI safety: can a large language model police itself? The answer, I believe, is no — and the reasons why illuminate something important about the nature of intelligence, accountability, and the limits of self-knowledge.

The appeal of self-policing AI is obvious. If we could build systems that monitor their own outputs, detect their own errors, and correct their own behaviour, we would have solved one of the most difficult problems in AI safety. We could deploy increasingly capable systems without proportionally increasing human oversight. The machines would watch themselves. The mathematics of scale would work in our favour rather than against us.

However, this vision collapses under scrutiny. The fundamental problem is epistemic: an LLM has no privileged access to truth. It does not possess an internal oracle that distinguishes correct outputs from incorrect ones. What it has instead is a vast pattern-matching apparatus trained on human-generated text — a system that infers probable responses based on statistical regularities in its training data. When the model evaluates its own output, it does so using the same apparatus that generated the output in the first place. The blind spots that produced the error are the same blind spots that evaluate the result.

This limitation runs deeper than it might initially appear. Consider what happens when a model attempts self-critique. The critique emerges from the same learned distributions, the same embedded assumptions, the same correlated errors. If the model's training data contained systematic biases — and all training data contains systematic biases — those biases will appear in both the original output and the evaluation of that output. The model cannot see what it was never shown. It cannot correct for patterns it does not recognise as patterns. A self-evaluation loop that uses the same flawed instrument to assess itself does not reduce error. It amplifies it.

I find it useful to distinguish between what models can actually do and what we might wish they could do. Models can compare outputs against predefined rules. They can check whether a response violates explicit constraints, matches specified formats, or contains forbidden content. This is procedural compliance — following instructions, not making judgments. The model does not decide what counts as harmful; it executes rules that humans wrote. The safety comes from the human-authored constraints, not from any capacity for moral or epistemic evaluation within the system itself.

Models can also engage in model-on-model critique, where a separate system evaluates the output of the primary model. This architecture reduces certain error modes and catches some failures that would otherwise slip through. However, it does not escape the fundamental limitation. Both models derive from similar training distributions. Both share overlapping blind spots. The critic model may catch errors that differ from its own systematic biases, but it will miss errors that align with them. We have added a filter, not achieved genuine oversight.

The most robust form of model self-regulation I have encountered is uncertainty estimation — systems that express confidence levels and defer to humans when confidence is low. This approach has genuine value, as Stuart Russell argues in his case for machines that doubt themselves. A model that knows when it does not know, and that refuses to act in conditions of high uncertainty, provides a meaningful safety buffer. Yet even here, the limitation persists. Uncertainty calibration degrades under distribution shift. The model may be confidently wrong precisely when the situation differs most from its training data — which is exactly when accurate uncertainty estimation matters most. And regardless of how well calibrated the uncertainty signal becomes, someone must decide what to do when the model defers. That someone cannot be the model itself.

The comparison to humans clarifies both the limitation and its implications. Humans make mistakes constantly. We hold contradictory beliefs, act against our stated values, and rationalise failures with impressive creativity. In this respect, LLMs are not worse than humans — they exhibit similar failure modes. However, humans operate within corrective systems that do not apply to machines. We receive physical feedback from the environment. We face social and legal consequences for our actions. We experience direct, embodied costs when we err. These feedback mechanisms do not guarantee good behaviour, but they provide external pressure that shapes behaviour over time.

LLMs lack intrinsic stakes. Nothing happens to the model when it produces a harmful output. It does not suffer consequences, learn from punishment, or feel the weight of responsibility. The system processes inputs and generates outputs according to its training. The concept of accountability has no purchase on a process that cannot experience anything at all. Responsibility, if it exists, must be imposed from outside — through human oversight, institutional constraints, and designed corrigibility. It cannot emerge from within.

This leads me to what I consider the correct framing of the problem. The question is not whether an LLM can police itself. The question is what minimum external structures are required to keep an autonomous system corrigible rather than merely consistent. Consistency is easy. A model can be perfectly internally coherent while being catastrophically wrong. Corrigibility — the property of remaining open to correction, deferring to appropriate authorities, and not resisting shutdown or modification — requires something the model cannot provide for itself: an external reference point against which its behaviour can be judged.

The implications for AI development are significant. We cannot rely on self-governance as a safety mechanism. We cannot assume that sufficiently capable models will somehow develop the capacity to constrain themselves. We must design systems that assume failure and build external structures to detect, contain, and correct it. The safety does not come from the model. It comes from the architecture around the model — the human oversight, the institutional checks, the guardrails that the model cannot unilaterally remove.

I recognise this conclusion is unsatisfying to those who hoped that AI safety could be solved from within. It would be convenient if the systems could watch themselves. It would scale better. It would require less human effort. However, convenience is not an argument. The structure of the problem does not change because we wish it were different. A system cannot audit itself with tools it controls. A judge cannot preside over their own trial. A model cannot verify its own blind spots. These are not engineering challenges to be overcome. They are structural impossibilities that constrain what we can reasonably expect from self-policing AI.

The path forward requires accepting this limitation and building accordingly. External oversight is not a temporary measure until the models become good enough to govern themselves. It is a permanent requirement, built into the architecture of safe deployment. The models will improve. The need for human judgment will not disappear.

Plutonic Rainbows

No System Can Verify Its Own Blind Spots

Recent Entries