The Transformer, 2017: Attention Is All You Need

Mountain View, California. Late 2016. A group of researchers at Google Brain are sitting around a table, talking about a problem that has been bothering them for years. The problem is not technical, exactly — they have all the technical skills required to work on it. The problem is conceptual.

They are working on machine translation — the use of neural networks to translate text from one language to another. The current state-of-the-art systems use encoder-decoder architectures with LSTMs and attention mechanisms. These systems work well. They have dramatically improved machine translation over the previous generation of statistical approaches. But they are slow to train, because LSTMs are inherently sequential — they process input one step at a time, and you cannot compute step three before you have computed step two.

The attention mechanism that Bahdanau and colleagues had developed in 2014 was the most important component of these systems — the mechanism that allowed the decoder to look at the entire source sentence when generating each output word. The LSTM was the computational substrate that the attention mechanism ran on.

One of the researchers at the table — Llion Jones — says something that sounds, at first, like a joke: “What if attention is all you need?”

It is not a joke. It is one of the most important ideas in the history of AI.

The Team That Built the Future

The “Attention Is All You Need” paper, published in December 2017, was the work of eight researchers: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin. All were working at Google Brain or Google Research when the work was done, though some have since moved on to other organisations.

Understanding who these people were and what they were working on illuminates why the Transformer architecture emerged when and where it did.

Ashish Vaswani and Noam Shazeer were experienced Google Brain researchers who had been working on neural network architectures for language. Shazeer in particular had worked on mixture-of-experts architectures — a way of making neural networks much larger without proportionally increasing the computational cost of each forward pass — that would later become important for scaling large language models.

Niki Parmar and Jakob Uszkoreit had backgrounds in computational linguistics and NLP, and had been working on the specific challenges of making sequence models faster and more parallelisable.

Llion Jones and Aidan Gomez were earlier-career researchers who brought fresh perspectives and, in Gomez’s case, a willingness to push on assumptions that more established researchers might have been less willing to question.

Łukasz Kaiser and Illia Polosukhin were working on related problems in neural program synthesis and multi-task learning, and they contributed perspectives on architectural flexibility and generalisation.

The team’s diversity — different backgrounds, different levels of experience, different primary research interests — was productive. The Transformer architecture emerged from a conversation that involved multiple perspectives on what was wrong with existing approaches and what a better architecture might look like.

The Problem: Sequential Processing in a Parallel World

To understand why the Transformer was important, you need to understand the specific problem it solved — and why that problem was worth solving.

The dominant approach to sequence modelling in 2016 — LSTMs with attention — had a fundamental computational bottleneck: sequential dependency. An LSTM processed a sequence by updating its hidden state at each step, with the new hidden state depending on the previous hidden state and the current input. This sequential dependency meant that the computation at step t could not begin until the computation at step t-1 was complete.

For training, this sequential dependency meant that LSTMs could not take full advantage of the massively parallel computing hardware — GPUs and TPUs — that had transformed other deep learning applications. A GPU with thousands of processing cores could not speed up sequential LSTM computation proportionally, because the steps had to be executed in order. The GPU was underutilised during LSTM training.

For long sequences, the problem was even more acute. A sentence of one hundred words required one hundred sequential LSTM steps to process. During training, the backpropagation signal had to propagate backward through one hundred time steps to reach the beginning of the sequence, potentially suffering from the vanishing gradient problem even with LSTM’s gating mechanisms.

The attention mechanism, as developed for encoder-decoder translation systems, had addressed some of these problems — the decoder could attend directly to any part of the encoder’s output without routing information through sequential hidden states. But the attention mechanism was used on top of LSTM, not instead of it. The LSTM bottleneck remained.

The insight at the core of the Transformer was to ask: what if you replaced the sequential LSTM with attention? Not just attention on top of LSTM — attention instead of LSTM? What if the entire computation was attention-based?

Self-Attention: The Core Mechanism

The mechanism at the heart of the Transformer — self-attention — was an extension of the encoder-decoder attention mechanism to sequences that were attending to themselves.

In the encoder-decoder attention used in translation systems, the decoder at each step attended to the encoder’s output: it computed a weighted sum of the encoder hidden states, with weights that reflected how relevant each source position was to the current target position being generated.

Self-attention did the same thing within a single sequence. Each position in the sequence computed a weighted sum of all other positions in the sequence, with weights reflecting how relevant each other position was to the current position. This allowed each position to gather information from all other positions directly, without having to route information through a sequence of hidden states.

The specific implementation of self-attention in the Transformer used three learned projections of each input — query, key, and value — to compute attention weights and aggregate information. Each input was projected into a “query” vector (what this position is looking for), a “key” vector (what this position has to offer), and a “value” vector (the information to be aggregated). Attention weights were computed by comparing the query of each position with the keys of all other positions; the output was a weighted sum of the values.

This implementation was elegant in a specific sense: it made the computation of attention weights and the aggregation of information differentiable, allowing the system to be trained end-to-end with backpropagation. The query, key, and value projections were learned parameters, not hand-designed. The system learned, from data, what to attend to and how to aggregate the attended information.

Multi-headed attention — computing multiple sets of attention weights in parallel, each attending to different aspects of the input — extended the expressiveness of the mechanism. Different attention heads could learn to attend to syntactic relationships, semantic relationships, positional relationships, and other aspects of the input simultaneously.

Positional Encoding: Solving the Order Problem

A potential problem with self-attention as described so far was that it was permutation-invariant: if you changed the order of the words in a sentence, the same attention weights would be computed and the same outputs produced. The meaning of a sentence depends on word order — “dog bites man” is different from “man bites dog” — and an architecture that was insensitive to word order would fail catastrophically.

The Transformer addressed this problem through positional encoding — the addition of information about position to the input representations before they were processed by attention. Each position in the input sequence was assigned a unique positional encoding, typically based on sinusoidal functions of the position and the encoding dimension.

By adding positional encodings to the input representations, the Transformer made position information available to the self-attention mechanism. Attention weights could then depend on position — a word at position five might attend differently to a word at position two versus a word at position eight. The architecture was no longer permutation-invariant.

The specific form of positional encoding used in the original Transformer — sinusoidal functions of position — was chosen for specific theoretical reasons: it allowed the model to generalise to sequence lengths not seen during training, and it had specific mathematical properties that made position relationships interpretable in terms of the encoding.

Subsequent work explored alternative forms of positional encoding, including learned positional encodings that were optimised during training and relative positional encodings that encoded the distance between positions rather than their absolute values. But the original sinusoidal encoding worked well for the initial demonstrations and established the principle.

The Architecture: Encoder, Decoder, and the Full Design

The complete Transformer architecture for machine translation consisted of an encoder and a decoder, each composed of multiple stacked layers of attention and feed-forward operations.

The encoder processed the source sequence (the sentence to be translated). Each encoder layer consisted of a multi-headed self-attention operation — allowing each position to attend to all other positions in the source sequence — followed by a feed-forward network applied independently to each position.

The decoder generated the target sequence (the translated sentence) one token at a time. Each decoder layer consisted of three components: a masked self-attention operation (masked so that each position could only attend to positions earlier in the sequence, since the later positions had not yet been generated), an encoder-decoder attention operation (allowing each target position to attend to all positions in the source sequence), and a feed-forward network.

Both encoder and decoder used residual connections — connections that added the input of each sub-layer to its output — which addressed the vanishing gradient problem in deep networks and made training more stable. Layer normalisation was applied after each sub-layer.

The number of stacked layers, the size of the attention representations, and the number of attention heads were hyperparameters that could be varied to trade off model capacity against computational cost. The original paper used 6 layers, 512-dimensional representations, and 8 attention heads for both encoder and decoder.

The resulting architecture was substantially simpler than the LSTM-based sequence models it replaced. It had no recurrence, no gates, no hidden states that were updated sequentially. The entire computation was a series of attention and feed-forward operations that could be parallelised across positions.

Why It Was Fast: The Parallelism Advantage

The key practical advantage of the Transformer over LSTM-based models was training speed. Because self-attention computed the relationship between all pairs of positions simultaneously — not sequentially — it could take full advantage of GPU and TPU parallelism in a way that sequential LSTM computation could not.

For a sequence of length n, self-attention computed all n×n pairs of attention weights in a single parallelised operation. LSTM required n sequential steps, each depending on the result of the previous step. On hardware with tens of thousands of parallel processing cores, the Transformer could train dramatically faster than an equivalent LSTM model.

The practical consequence was that Transformer models could be trained to a higher quality level in the same amount of time as LSTM models, or to the same quality level in much less time. For researchers working at the frontier of what was computationally feasible, this speedup was transformative.

The paper demonstrated the advantage quantitatively. On the WMT 2014 English-to-German translation benchmark, the Transformer achieved state-of-the-art performance while training for approximately 12 hours on 8 GPUs. LSTM-based models that achieved comparable performance typically required days of training on many more GPUs.

The training speed advantage was not just a matter of convenience. It meant that researchers could iterate faster — could try more architectural variations, more hyperparameter settings, more training procedures in the same time. Faster iteration accelerated the research cycle and enabled the rapid architectural improvements that followed the original Transformer paper.

Why It Was Better: Understanding the Performance Gains

The Transformer did not just train faster than LSTM-based models. It also performed better, achieving state-of-the-art results on machine translation benchmarks by substantial margins.

The performance improvement came from several sources.

Direct long-range attention. Self-attention connected any two positions in a sequence directly, regardless of their distance. LSTM had to route information through a sequence of hidden states, potentially degrading long-range dependencies even with the gating mechanisms that LSTM used to mitigate this problem. For tasks where long-range dependencies were important — understanding pronouns that referred to entities mentioned many sentences earlier, handling idioms and fixed phrases, translating sentences with complex syntactic structure — direct attention was more effective than sequential hidden state routing.

Expressiveness of multi-headed attention. The multiple attention heads in the Transformer could learn to attend to different aspects of the input simultaneously — syntactic relationships, semantic relationships, positional patterns, co-reference chains. This expressiveness allowed the model to represent richer relationships between parts of a sequence than single-channel LSTM hidden states could.

Training dynamics. The residual connections and layer normalisation in the Transformer, combined with the parallelism that allowed more training time in practice, produced training dynamics that led to better-converged models.

Ability to scale. The Transformer architecture proved more amenable to scaling than LSTM-based models. As model size increased — more layers, larger representations — Transformer performance improved smoothly and substantially. LSTM-based models showed diminishing returns from scale more quickly. The Transformer’s scalability was not immediately apparent from the original paper’s results, but it became one of the most important features of the architecture as subsequent researchers explored the relationship between scale and performance.

The Paper’s Reception: Immediate Recognition and Subsequent Explosion

The reception of the “Attention Is All You Need” paper was, by the standards of machine learning research, unusually positive and unusually fast.

The paper was submitted to NeurIPS 2017 and accepted as an oral presentation — one of the most competitive forms of recognition in the field. The presentation attracted significant attention, and the paper was downloaded thousands of times in the weeks following its publication.

Within a year of publication, the Transformer had been adopted by most major NLP research groups. The combination of state-of-the-art performance and fast training made it obviously superior to LSTM-based models for the translation tasks that were the standard NLP benchmark. Researchers who tried the Transformer on other NLP tasks — summarisation, question answering, text classification — found that it worked well for those tasks too.

The explosion came in 2018, when two papers demonstrated the power of combining the Transformer with large-scale pre-training.

Google’s BERT (Bidirectional Encoder Representations from Transformers) used the Transformer encoder, pre-trained on massive text corpora using a masked language modelling objective, and showed that fine-tuning the pre-trained model on specific NLP tasks produced dramatic improvements across the standard benchmarks. BERT set new state-of-the-art results on eleven NLP tasks simultaneously, with improvements that were larger than several years of previous incremental progress.

OpenAI’s GPT (Generative Pre-trained Transformer) used the Transformer decoder, pre-trained with a standard language modelling objective (predicting the next token), and showed that the resulting model could be fine-tuned effectively for a wide range of downstream tasks.

Together, BERT and GPT established the pre-training paradigm for NLP — the approach of training large Transformer models on internet-scale text and fine-tuning them for specific tasks — that would define NLP research for the subsequent decade and would eventually produce GPT-3, GPT-4, and the large language models that are now transforming how people interact with AI.

The Attention Mechanism’s Intellectual History

The attention mechanism that the Transformer generalised into self-attention had a specific intellectual history that the paper built on — and that deserves to be traced.

The foundational paper was Bahdanau, Cho, and Bengio’s 2014 “Neural Machine Translation by Jointly Learning to Align and Translate.” This paper introduced the mechanism of computing learned weights over the encoder hidden states, allowing the decoder to attend selectively to different source positions at each decoding step. The attention weights were interpretable as a soft alignment between source and target words, and the mechanism dramatically improved translation quality for long sentences.

Several papers between 2014 and 2017 explored extensions and variations of the attention mechanism. Luong and colleagues explored different formulations of the attention score function. Vinyals and colleagues developed pointer networks — attention mechanisms that allowed models to point to positions in the input rather than just aggregating information. Transformers for images explored spatial attention for visual processing.

The specific insight of the “Attention Is All You Need” paper was to remove the LSTM entirely — to use attention not as an add-on to a recurrent model but as the primary computational mechanism. This was a conceptual step beyond the previous literature, which had consistently used attention on top of recurrence rather than instead of it.

The paper also introduced several specific technical innovations that made attention-only models work well: multi-headed attention, the query-key-value formulation, the scaled dot-product attention computation, the positional encoding approach. These innovations were not just incremental improvements on existing attention mechanisms — they were a coherent, carefully designed package that made attention-only sequence modelling practical and effective.

The Broader Applications: Vision, Audio, and Everything Else

The Transformer was designed for language, but it quickly proved applicable to essentially any domain where data could be represented as sequences.

Vision. The Vision Transformer (ViT), published by Google in 2020, applied the Transformer architecture to image classification. Images were divided into patches, each patch was treated as a “word” in a sequence, and a standard Transformer encoder was used to classify the image. The results were competitive with convolutional networks for large-scale image recognition, and the architecture had specific advantages — particularly in scaling and in the ability to capture long-range spatial relationships — that made it attractive for certain applications.

Audio. Transformer architectures were applied to speech recognition and audio understanding, gradually replacing or complementing the recurrent architectures that had previously dominated. Wav2Vec and its successors used Transformer encoders pre-trained on large unlabelled audio datasets, following the pattern established by BERT for text.

Scientific data. AlphaFold 2, the protein structure prediction system that solved one of biology’s grand challenges, used a Transformer-based architecture to process amino acid sequences and predict three-dimensional protein structures. The protein language models that preceded AlphaFold — large Transformer models pre-trained on protein sequence databases — demonstrated that the pre-training paradigm transferred to biological sequence data.

Multi-modal models. Architectures like DALL-E, Stable Diffusion, and GPT-4V combined Transformers processing text and image data, enabling systems that could generate images from text descriptions, answer questions about images, and perform other tasks that required understanding relationships between different modalities.

The universality of the Transformer architecture — its applicability across modalities and domains — was one of its most important features and one of the things that distinguished it from the convolutional networks that had preceded it. Convolutional networks worked well for spatial data (images, audio spectrograms) but not for sequential text data. LSTMs worked well for sequences but not for images. The Transformer worked for everything that could be represented as a sequence — which, as subsequent research showed, was essentially everything.

The Scaling Revolution: What Bigger Transformers Could Do

The most consequential property of the Transformer architecture — more important, in the long run, than its performance on any specific benchmark — was its ability to scale.

The scaling laws for Transformer language models, empirically documented by Kaplan, McCandlish, and colleagues at OpenAI in 2020, showed that Transformer performance improved smoothly and predictably as model size, training data size, and computing budget increased. Each order-of-magnitude increase in these quantities produced a consistent improvement in performance, following a power law relationship.

The scaling laws implied that making language models more capable was largely a matter of training larger models on more data with more compute. Not fundamentally new architectures, not theoretical breakthroughs — just scale.

This implication transformed the economics of AI research. The organisations with the most resources — the most GPUs, the most training data, the largest computing budgets — would produce the most capable language models. The research that mattered most was not the research that developed the cleverest algorithms but the research that most efficiently used large-scale computing resources to train large models.

The Transformer’s scalability was specifically tied to properties of its architecture. Because the attention mechanism was inherently parallelisable, Transformer training could take full advantage of distributed computing across thousands of GPUs simultaneously. Because the architecture had few inductive biases — few assumptions about the structure of the data — it could learn from essentially any kind of text, making large-scale pre-training on diverse internet text effective.

GPT-2, with 1.5 billion parameters, demonstrated in 2019 that scaling produced qualitative changes in language generation capability. GPT-3, with 175 billion parameters, demonstrated in 2020 that further scaling produced the in-context learning capabilities that had not been present at smaller scales. GPT-4 and subsequent models demonstrated that scaling continued to produce qualitative capability improvements even at hundreds of billions of parameters.

What Attention Learns: Visualising the Mechanism

One of the most fascinating aspects of the Transformer is what the attention mechanism learns to attend to — the specific relationships between words and positions that attention weights capture after training.

Researchers who visualised the attention weights of trained Transformer models found that different attention heads specialised in different types of relationships.

Some heads learned to attend to syntactic structure — to track subject-verb-object relationships, to follow coreference chains, to capture the scope of negation. These heads produced attention patterns that, when visualised, looked like dependency parse trees — the formal syntactic structures that linguists had studied for decades.

Other heads learned to attend to positional patterns — to attend to adjacent words, or to words at specific distances, or to the beginning and end of the sequence. These heads captured the local context that was important for the morphological and collocational patterns of language.

Still other heads learned to attend to semantic relationships — to bring together words that were semantically related but syntactically distant, to connect pronouns with their referents, to relate verbs with their typical subjects and objects.

These visualisations were striking because they showed that the Transformer had learned, from training data alone and without any explicit instruction about syntactic or semantic structure, to represent the linguistic structures that linguists had painstakingly catalogued through decades of theoretical work. The emergent linguistic knowledge in the Transformer’s attention weights was not programmed; it was learned.

The visualisations also provided partial answers to the question of what large language models understood. The attention patterns showed that the models were tracking syntactic and semantic structure — they were not just memorising surface patterns. Whether this tracking constituted genuine understanding, in the philosophical sense, remained debated. But it was clearly something more than arbitrary pattern matching.

The Team Scatters: What Happened to the Paper’s Authors

The eight authors of the “Attention Is All You Need” paper have, in the years since its publication, scattered across the AI landscape in ways that illustrate the extraordinary commercial value that the paper created.

Ashish Vaswani left Google and co-founded Adept AI, an AI company focused on building AI that could take actions in software. Noam Shazeer left Google and co-founded Character.AI, a company building AI-powered conversational personas that became one of the most widely used AI consumer applications. Niki Parmar and Jakob Uszkoreit also left Google; Uszkoreit co-founded Inceptive, a company using AI for biological design. Llion Jones and Ilya Polosukhin left Google; Polosukhin co-founded NEAR Protocol. Aidan Gomez left Google and co-founded Cohere, an AI company providing language model APIs. Łukasz Kaiser remained at Google for several years before moving to OpenAI.

The founding of multiple significant AI companies by the paper’s authors within a few years of publication is a striking illustration of the commercial value that the Transformer architecture created. The specific technical knowledge embodied in the paper — not just the architecture but the experience of developing it, the understanding of how to make it work, the intuitions about what would scale well — was worth hundreds of millions of dollars in venture capital terms.

The scattering was also a sign of the competitive landscape that the Transformer had helped create. By 2019, it was clear that large Transformer language models were going to be commercially extremely valuable. Every major technology company and dozens of startups were racing to build and deploy such models. The researchers who had developed the foundational architecture were in enormous demand.

The Philosophical Questions the Transformer Raised

The Transformer’s success — and particularly the surprising capabilities of large Transformer language models — raised philosophical questions that the AI community had been debating since the era of ELIZA.

Do large language models understand language? The Transformer’s attention mechanism learns to represent syntactic and semantic structure in ways that look like understanding. The models trained with the Transformer can answer questions, translate between languages, summarise text, write code, explain concepts, and engage in sophisticated conversations. Do these capabilities constitute genuine understanding, or are they sophisticated pattern matching that mimics understanding without achieving it?

The question is harder with Transformer-based language models than with ELIZA because the capabilities are so much broader and so much more impressive. An ELIZA user who probed the system with unexpected inputs quickly discovered its limitations. A GPT-4 user who probes with unexpected inputs often finds that the model handles them appropriately. The space of inputs where the model seems to understand is much larger than with previous systems.

But the failures are also revealing. Language models hallucinate — confidently assert false information. They fail at certain systematic reasoning tasks — multi-step arithmetic, logical puzzles that require careful tracking of state. They are sensitive to the specific framing of questions in ways that genuine understanding would not be. These failures suggest that the models are doing something different from human understanding, even if what they are doing is impressive in many contexts.

What is the relationship between scale and intelligence? The scaling laws demonstrate that Transformer language models become more capable as they are scaled up. But “more capable” means better performance on benchmarks — better accuracy on specific evaluation tasks. Does better benchmark performance imply more intelligence, in some deeper sense? Or is it possible to achieve better benchmark performance without the kind of general reasoning and understanding that human intelligence involves?

This question is not purely philosophical — it has practical implications for how AI should be evaluated and deployed. If benchmark performance is a reliable proxy for underlying capability, then the scaling results justify optimism about the path to general AI. If benchmark performance can be achieved without the underlying capability, then there may be fundamental limitations to the current paradigm that scale alone cannot overcome.

The Architecture That Runs the World

The Transformer architecture, introduced in a December 2017 paper by eight Google Brain researchers, has become the foundation of essentially every large AI system deployed at significant scale.

GPT-4, the language model behind ChatGPT, is a Transformer. Claude, Gemini, LLaMA, Mistral — every major large language model is a Transformer. DALL-E, Midjourney, Stable Diffusion — the major image generation systems use Transformer architectures. AlphaFold 2 uses a Transformer. The speech recognition systems in every major voice assistant use Transformers. The translation systems in Google Translate, DeepL, and their competitors use Transformers.

The Transformer’s universality — its applicability across modalities and domains, its scalability, its ability to be pre-trained on internet-scale data and fine-tuned for specific tasks — has made it the dominant architectural paradigm for AI in the way that the convolutional network was the dominant paradigm for image recognition in the years immediately following AlexNet.

Whether the Transformer will remain dominant — whether it will be superseded by fundamentally different architectures, whether its limitations will eventually become obstacles that scale cannot overcome — is an open question. Researchers are exploring alternatives: state space models, which aim to address the computational cost of attention for long sequences; mixture-of-experts architectures that scale model capacity without proportionally scaling computation; architectures specifically designed for physical or spatial reasoning.

But for the foreseeable future, the Transformer is the architecture that runs the world’s AI systems. The eight researchers who asked “what if attention is all you need?” and answered their own question with five months of work and a paper that runs to fifteen pages have built the foundation of modern AI.

The Legacy: A Paper That Changed Everything

The “Attention Is All You Need” paper sits alongside a small number of scientific papers that genuinely changed the trajectory of an entire field. Turing’s 1936 paper on computation. Shannon’s 1948 paper on information theory. Watson and Crick’s 1953 paper on DNA structure. Rumelhart, Hinton, and Williams’ 1986 paper on backpropagation.

These papers share certain properties: they solved a fundamental problem, they introduced ideas that were immediately applicable, and they opened research directions that proved enormously productive. The Transformer paper shares all of these properties.

It solved the fundamental problem of efficient sequence modelling — how to model long-range dependencies in sequences without the computational bottleneck of sequential recurrence.

It introduced ideas — self-attention, multi-headed attention, the query-key-value formulation — that were immediately adopted by the research community and that have proved robust as the field has scaled from the models of 2017 to the models of 2024.

It opened the research direction of large-scale pre-training for NLP, which produced BERT, GPT, and their successors, and which has transformed what AI can do with language.

The paper is also notable for what it was not: it was not the product of a decades-long research programme. The team that built the Transformer was not part of the neural network underground that had maintained the connectionist research tradition through two AI winters. They were practitioners who saw a problem with existing approaches and found a better solution. The Transformer was the product of focused, applied research rather than sustained basic research, and it emerged in the specific institutional environment of Google Brain where such research was possible.

This is also part of the Transformer’s story: the deep learning revolution had, by 2017, created an environment in which foundational architectural innovations could emerge from industrial research as well as academic research. The relationship between industry and academia in AI had been transformed by the commercial value that deep learning had demonstrated, and the Transformer was a product of that transformation.