Most of what humanity has written, recorded, and published does not exist on the internet. Not even close. Large language models, search engines, recommendation algorithms: they all treat the web as though it were a reasonable proxy for human knowledge. It is not. It is a shallow, recent, and spectacularly incomplete sample.
Google has scanned tens of millions of books, but most sit behind copyright walls, neither fully searchable nor publicly readable. The rest exist on shelves, in basements, in charity shops where nobody is looking. The vast majority of the world's cultural heritage has never been digitized in any form. Not suppressed, not restricted. Just absent.
The pre-internet age was not merely analogue. It was geographically bounded. John Holbo, writing on Crooked Timber, described it as a kind of epistemic accident: you knew what the six people around you knew, what your local library stocked, what your local record shop carried. A left-handed guitarist might never discover that left-handed guitars existed. That accidental ignorance, that texture of ordinary life, was never documented in a form that any crawler could find. It was the water, not the fish.
The physical record is vanishing too. When the Chicago Sun-Times consolidated its suburban papers, photographs from the Aurora Beacon-News and Elgin Courier-News were thrown in the bin. The Louisville Courier Journal's archive of roughly ten million photographs nearly followed before the University of Louisville negotiated a last-minute donation. These aren't edge cases. They are the norm for local journalism across America and, by extension, for any community record that depended on newsprint.
Meanwhile, born-digital content fares no better. Pew Research found in 2024 that a quarter of all web pages that existed between 2013 and 2023 have already disappeared. MySpace's 2019 migration destroyed millions of songs, videos, and photographs in what the Long Now Foundation described as irreversible data loss. Andy Warhol's digital artwork from the 1980s sat stranded on obsolete Commodore hardware for decades.
The gap is self-reinforcing. If knowledge isn't online, AI can't learn it. If AI can't surface it, fewer people encounter it. If fewer people encounter it, there's less incentive to digitize it. The loop tightens and the memory without metadata that defined most of human experience drifts further from retrieval.
I think about this when people describe AI as a knowledge tool. It is a tool for a particular kind of knowledge, overwhelmingly English-language, overwhelmingly post-1990s, overwhelmingly sourced from the kind of person who publishes on the internet. Everything else, the vast majority of what humans have thought and made and recorded, sits in formats that no model will ever ingest. Not because the technology couldn't handle it, but because nobody is going to scan it.
Sources:
-
The Archaeology of the Ignorance of Pre-Internet Ordinary Life — Crooked Timber
-
Who's Going to Save Local Newspaper Archives? — Columbia Journalism Review
-
When Online Content Disappears — Pew Research Center
-
Shining a Light on the Digital Dark Age — Long Now Foundation