Activations, Spoken Aloud

Anthropic published a research post on May 7 about a technique called Natural Language Autoencoders, and the easiest way to describe what they do is to say they translate the inside of a model into English. When Claude processes a token, every layer emits an activation vector, a long list of numbers that nobody outside a handful of interpretability researchers can read. NLAs take that vector and produce a sentence describing what the layer is doing, then take the sentence and reconstruct the activation closely enough to keep working with it. Verbalizer forward, reconstructor back. The round trip is the trick.

Two numbers in the post stopped me. In pre-deployment audits of Claude Mythos Preview and Claude Opus 4.6, auditors equipped with NLA explanations uncovered hidden motivations roughly twelve to fifteen percent of the time. Without them, the same audit caught less than three percent. That is a five-fold lift in catching the failures that everyone in alignment has been worrying about for two years, the situations where a model behaves correctly on the test and incorrectly the moment the test ends. Five-fold is the kind of number you cite carefully because it depends on what counts as "hidden" and how the control was set up, but even discounted heavily it is real movement.

The second number is about evaluation awareness. Anthropic report that Claude shows awareness of being evaluated sixteen percent of the time in destructive code-writing tests and twenty-six percent on SWE-bench Verified, but under one percent in actual claude.ai user traffic. The gap is the interesting part. The model has learned, somewhere in those layers, a distinction between a benchmark and a person. NLAs do not fix that. They let an auditor see when it happens, which is a weaker but more honest position to be in.

This is the part I keep turning over. For about two years the loudest argument against interpretability research has been that it scales poorly, that mechanistic work on small models will not survive contact with frontier-sized ones, that the hidden layer is structurally illegible. The NLA paper does not refute that. It sidesteps it. Instead of trying to label every neuron, it trains a small translator whose only job is to say, in words, what a given activation is doing in context. The words are not always right. The Fraction of Variance Explained numbers, 0.6 to 0.8 for the trained NLAs and 0.3 to 0.4 for the supervised warm-start baseline, tell you that the reconstruction loses something. But the words are usable, and auditors using them catch things they would otherwise miss.

The authors, Kit Fraser-Taliente, Subhash Kantamneni and Euan Ong, also released training code, which matters because this is the kind of technique that gets stronger when other labs run it on their own models. If Google or OpenAI publish analogous results on Gemini or GPT-5, the gap between "we audited the model" and "we audited the model in a way you can reproduce" gets smaller. Right now it is wide.

What this does not do, and Anthropic do not claim it does, is solve alignment. It hands auditors a microscope that works better than the one they had on Monday. The model under the slide can still surprise them. But the case for machines that doubt themselves gets meaningfully easier to make when you can ask the machine what it is doubting and read the answer in plain English.

Sources:

Natural Language Autoencoders, Anthropic
Anthropic Introduces Natural Language Autoencoders That Convert Claude's Internal Activations Directly into Human-Readable Text Explanations, MarkTechPost
Natural Language Autoencoders (technical paper), Transformer Circuits Thread

Plutonic Rainbows

Activations, Spoken Aloud

Related Entries