Twelve Hours on Eight GPUs

Twelve hours on eight NVIDIA P100 GPUs. That was the training cost for the model that ended twenty years of LSTM dominance in natural language processing. The base Transformer achieved state-of-the-art machine translation at the lowest reported training cost. Paper 1706.03762, published on arXiv in June 2017. The eight authors were all at Google.

The core hypothesis came from Jakob Uszkoreit: attention alone, without recurrence, could handle sequence transduction. His own father, the computational linguist Hans Uszkoreit, wasn't convinced. The prevailing view in 2017 was that recurrence (processing tokens one at a time, first to last) was structurally necessary for capturing temporal dependencies. LSTMs powered Google Translate and virtually every production NLP system in existence. Proposing to throw that away felt reckless.

Noam Shazeer designed the specific mechanism: scaled dot-product attention, multi-head attention, a position encoding scheme that replaced sequential processing with parallel computation across entire sequences. Llion Jones later described the process at NVIDIA's GTC conference: "We had very recently started throwing bits of the model away, just to see how much worse it would get. And to our surprise it started getting better."

He also named the paper. "Attention Is All You Need," a Beatles reference. The architecture itself got called Transformer because Uszkoreit liked the sound of the word.

The paradigm collapsed faster than anyone expected. GPT-1 arrived in June 2018, decoder-only and transformer-based. BERT landed four months later and crushed eleven NLP benchmarks simultaneously. By October 2019, Google was running BERT on live search queries. The total time from paper to undeniable dominance was under two years. Publishing an LSTM paper after 2019 felt like submitting a fax.

Then all eight authors left Google. Every one of them. Shazeer built LaMDA internally, watched Google refuse to release it, and quit in 2021 to found Character.AI. Vaswani and Parmar started Adept, then Essential AI. Uszkoreit went to design RNA molecules at Inceptive. Jones co-founded Sakana AI in Tokyo. Gomez co-founded Cohere. Polosukhin left immediately after the paper to build NEAR Protocol. Kaiser joined OpenAI.

In August 2024, Google structured a $2.7 billion deal to bring Shazeer back. Based on his ownership stake, he personally netted somewhere between $750 million and a billion dollars. He now co-leads Gemini alongside Jeff Dean.

The paper has 173,000 citations, among the most-cited of the twenty-first century. Vision Transformers replaced CNNs in image recognition. AlphaFold, a transformer derivative, won the 2024 Nobel Prize in Chemistry.

At that GTC panel, Aidan Gomez said something worth sitting with: "I think the world needs something better than the transformer." He may be right. But nobody has found it yet.

Sources:

Attention Is All You Need — arXiv
Transformer Authors Reunite at GTC — NVIDIA Blog
Noam Shazeer Returns to Google — TechCrunch

Plutonic Rainbows

Twelve Hours on Eight GPUs

Related Entries