There is a thought experiment, proposed by the philosopher Dan Dennett, about what it would mean to understand a text. Suppose you have a book — a long, complex book, full of ideas and arguments and references to other books and to events in the world. To understand the book, you need to do more than decode the words on the page. You need to know what those words refer to. You need to know which ideas the author is responding to, which traditions the book is embedded in, what happened in the world that the book is commenting on. Understanding a text is understanding a web of relationships — between words, between ideas, between the text and the world.

The Transformer’s self-attention mechanism is, in a specific and limited sense, a mechanism for attending to exactly these relationships. When a Transformer processes a sentence, each word attends to all the other words — it computes, for each pair of words, how relevant one is to the other. The subject attends to its verb. The pronoun attends to its referent. The adjective attends to the noun it modifies. The sentence attends to the question it is answering.

The Transformer does not understand these relationships in the way a human reader understands them. It represents them as numerical weights, computed from learned parameters, optimised over billions of training examples. Whether this representation constitutes genuine understanding is a philosophical question that the technical facts alone cannot settle.

What the technical facts can tell us is that this representation is extraordinarily useful — that models built on it can do things with language that previous systems could not do, and that the range of things they can do has expanded with scale in ways that were not predicted and are not fully understood.

This is the story of how attention changed everything — and why the change raises questions that the field is still working out how to answer.


Before the Transformer: The NLP Landscape in 2017

To appreciate what the Transformer changed, it is necessary to understand what NLP looked like before the Transformer arrived.

Natural language processing in 2016 was a field of specialised systems. For each specific NLP task — machine translation, sentiment analysis, named entity recognition, question answering, text classification — there was a distinct research community, a distinct set of benchmarks, distinct approaches, and distinct state-of-the-art systems. The knowledge was fragmented: the insights that produced the best machine translation system did not automatically transfer to the best question answering system.

The dominant deep learning approaches for NLP tasks typically followed a two-step pattern. First, word representations: words were represented as dense vectors (word embeddings), either pre-trained from large text corpora (Word2Vec, GloVe) or initialised randomly and learned from the task-specific training data. Second, task-specific architecture: a neural network — typically an LSTM, sometimes a convolutional network, occasionally a combination — was trained on the task-specific labelled data to perform the specific task.

This approach produced good systems. The LSTMs that had become standard in NLP after the backpropagation revival of the late 1980s and early 1990s could handle many NLP tasks well. Machine translation quality had improved dramatically from the statistical methods of the 2000s. Sentiment analysis, text classification, and information extraction systems were robust and commercially deployed.

But the approach had specific limitations. The word embeddings that were used as input representations were “static” — each word had a single representation regardless of context. The word “bank” had the same representation whether it appeared in “riverbank” or “bank account,” even though its meaning was different in the two contexts. The task-specific models were trained on relatively small labelled datasets and did not transfer across tasks. And the LSTM-based architectures struggled with the long-range dependencies that many NLP tasks required.

The Transformer and, more importantly, the pre-training paradigm that the Transformer enabled would address all three of these limitations: contextualised representations, transfer across tasks, and effective long-range dependency modelling.


ELMo: The First Pre-Training Revolution

Before BERT and GPT demonstrated the full power of Transformer pre-training, an approach called ELMo (Embeddings from Language Models) offered a preview of what was coming.

ELMo was published by Matthew Peters and colleagues at the Allen Institute for AI in 2018 — simultaneously with the first GPT and just before BERT. ELMo used a bidirectional LSTM trained with a language modelling objective on a large text corpus to produce contextualised word representations — representations that encoded the meaning of a word in its specific context, not just the word in isolation.

The key insight of ELMo was that the hidden states of a language model — a model trained to predict what word comes next in a sequence — contained rich information about the meaning and grammatical role of each word in context. By extracting these hidden states and using them as the input representations for downstream NLP tasks, ELMo produced substantial improvements across a wide range of benchmarks.

ELMo demonstrated the principle: pre-training on large unlabelled text with a language modelling objective could produce representations that transferred across tasks and outperformed task-specific representations trained from scratch on limited labelled data. The specific mechanism — bidirectional LSTM with language modelling pre-training — was less important than the principle.

The principle was immediately recognised as important. Every major NLP research group began exploring variations. What if the pre-training objective was different? What if the architecture was different? What if the training data was larger?

The answers came quickly: BERT and GPT, both published in 2018, demonstrated that Transformer-based models pre-trained on much larger text corpora dramatically exceeded ELMo’s performance.


BERT: Bidirectional Pre-Training

BERT — Bidirectional Encoder Representations from Transformers — was published by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google in October 2018. It became one of the most influential NLP papers ever published, and its impact on the field was immediate and comprehensive.

BERT used the Transformer encoder — the part of the Transformer that encoded input sequences without generating output sequences — pre-trained with two objectives: masked language modelling and next sentence prediction.

Masked language modelling randomly replaced approximately 15% of the input tokens with a mask token, and trained the model to predict the original tokens from the surrounding context. This objective was specifically designed to be bidirectional — the model could attend to context on both sides of each masked token when making its prediction. This bidirectionality was the key difference from GPT’s left-to-right language modelling, and it made BERT’s representations more informative about word meaning in context.

Next sentence prediction trained the model to predict whether two sentences appeared consecutively in the original text or were randomly paired. This objective encouraged the model to learn relationships between sentences, which was important for tasks like natural language inference and question answering that required understanding how two pieces of text related to each other.

BERT was pre-trained on a dataset of approximately 3 billion words — the English Wikipedia and the BookCorpus, a collection of unpublished books. The training took approximately four days on 64 TPU chips, requiring computing resources available only to large industrial research organisations.

After pre-training, BERT was fine-tuned on specific NLP tasks by adding a task-specific output layer and training the entire model on the task’s labelled data. The fine-tuning was fast — typically requiring only a small number of training epochs — because the pre-trained representations already encoded rich linguistic knowledge.

The results of BERT fine-tuning were remarkable. On the GLUE benchmark, a suite of eleven NLP tasks, BERT improved the state of the art by more than seven percentage points — a margin that exceeded several years of previous incremental progress. On SQuAD, a reading comprehension benchmark, BERT achieved performance that exceeded human performance for the first time. On eleven different NLP benchmarks, BERT set new state-of-the-art results.

The magnitude of the improvement made the paper’s significance immediately clear. BERT was not a marginal advance on existing approaches. It was a breakthrough — a demonstration that the combination of Transformer architecture and large-scale pre-training could unlock capabilities that had not been achievable with previous approaches.


GPT: The Generative Counterpart

At almost exactly the same time that BERT was being developed at Google, OpenAI’s Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever were developing GPT — the Generative Pre-trained Transformer.

GPT used the Transformer decoder rather than the encoder, and pre-trained with a standard left-to-right language modelling objective: predict the next token in a sequence given all preceding tokens. This was simpler than BERT’s masked language modelling and next sentence prediction, but it was also more general: any text could serve as training data without any preprocessing, and the resulting model could generate text by sampling from the distribution it had learned.

Like BERT, GPT was fine-tuned on specific NLP tasks after pre-training. Unlike BERT, which used a uniform fine-tuning approach for all tasks, GPT used task-specific input formatting to cast different tasks as variations of the same text generation problem — a design that pointed toward the prompt-based task specification that would become dominant in subsequent work.

GPT’s performance on NLP benchmarks was strong but not quite as dramatic as BERT’s. The bidirectionality of BERT’s pre-training gave it a specific advantage on tasks that required understanding the full context of a word or phrase, which most standard benchmarks prioritised.

But GPT had a distinctive advantage over BERT that would prove increasingly important as models were scaled up: it was generative. A model trained to predict the next token in a sequence could generate text of any kind — answers to questions, completions of stories, translations of sentences, descriptions of images, code implementing a specification — by formulating the task as a text generation problem and providing an appropriate prompt.

The full power of this generative flexibility would not be apparent until GPT-2 and GPT-3, but it was present in principle from the original GPT. The decision to use a generative pre-training objective was, in retrospect, one of the most consequential design choices in the history of language models.


The Pre-Training Paradigm: What It Revealed About Language

The success of BERT and GPT revealed something important about language that the previous architecture of NLP had obscured: that the amount of structure in language — the regularities of grammar, the patterns of semantic composition, the pragmatics of how language was actually used in context — was so large that it could serve as a rich training signal for learning representations that transferred across essentially all NLP tasks.

This was not obvious before the pre-training paradigm demonstrated it. The prevailing view in NLP before 2018 was that language models were useful for some specific applications — speech recognition, machine translation — but that general-purpose “language understanding” required task-specific training on labelled data. The assumption was that the representations learned from language modelling were general in some way, but not general enough to support strong performance on tasks as diverse as reading comprehension, natural language inference, named entity recognition, and coreference resolution.

The pre-training results showed this assumption was wrong. The representations learned by a large enough Transformer model pre-trained on a large enough text corpus were general enough to support strong performance on essentially all NLP tasks with minimal task-specific adaptation. The amount of information in the statistics of natural language — in the patterns of how words were used, how sentences were structured, how discourse unfolded — was sufficient to learn representations that captured much of what was needed for natural language understanding.

This revealed something about language itself: that natural language texts contained an extraordinary amount of information about how language worked, how meanings composed, how discourse unfolded. This information was implicit — not explicitly labelled or annotated — but it was there, available to a model that was large enough and trained long enough to extract it.

The pre-training paradigm was, in a specific sense, the computational vindication of a linguistic hypothesis that had been discussed in various forms for decades: that human language learning was largely unsupervised, driven by exposure to language in use rather than by explicit instruction about grammatical rules or semantic relationships. Human children learned language primarily by being exposed to it, not by being taught explicit rules. Large language models learned language primarily by being trained to predict it, not by being trained on explicitly labelled examples of grammatical categories or semantic relations.


GPT-2 and the Emergence of In-Context Learning

When OpenAI published GPT-2 in February 2019, it demonstrated something that BERT had not: that scaling a generative language model produced qualitatively new capabilities that had not been present at smaller scales.

GPT-2 had 1.5 billion parameters — approximately ten times more than the original GPT — and was trained on a dataset of approximately 40 billion words, drawn from web pages that had been curated for quality. The training data was orders of magnitude larger and more diverse than the data used for BERT, reflecting OpenAI’s bet that scale and data diversity would produce more capable models.

The capabilities that emerged from this scale were unexpected. GPT-2 could generate coherent, contextually appropriate text of a length and quality that had not been previously demonstrated. Given a few sentences of a news article, GPT-2 could continue the article in a way that was often indistinguishable from human journalism. Given the opening of a story, it could continue the story in the established style. Given a question in natural language, it could often produce a plausible-sounding answer.

More remarkably, GPT-2 demonstrated what would later be called “in-context learning” — the ability to perform new tasks by reasoning from examples provided in the prompt, without any parameter updates. If you showed GPT-2 a few examples of a task — “Translate from English to French: cat → chat; dog → chien; house → ” — it would often complete the pattern correctly, even for words not in the examples.

This in-context learning was not something GPT-2 had been explicitly trained to do. It emerged from the combination of the model’s scale and the diversity of tasks implicitly present in its training data. The internet text on which GPT-2 was trained contained examples of translation, summarisation, question answering, and many other tasks embedded in documents — tutorials, worked examples, FAQ pages, academic papers. The model had implicitly learned to recognise these tasks from their format and to perform them.

The emergence of in-context learning from scale was one of the most surprising findings in the history of language models, and it raised profound questions about the relationship between scale and capability. Was this the first glimpse of something like general intelligence emerging from the statistics of language? Or was it a sophisticated form of pattern completion that would fail as soon as it was probed outside the range of patterns in the training data?


GPT-3: The Paradigm Shift

In May 2020, OpenAI published GPT-3 — a language model with 175 billion parameters, trained on approximately 570 billion tokens of text. The scale was a factor of approximately 100 larger than GPT-2, and the capabilities were qualitatively different.

The GPT-3 paper — “Language Models are Few-Shot Learners” — introduced the few-shot, one-shot, and zero-shot paradigms for evaluating language models. In the few-shot paradigm, the model was given a small number of examples of a task (typically between 1 and 100) in the prompt, before being asked to perform the task on a new example. In the one-shot paradigm, only one example was provided. In the zero-shot paradigm, no examples were provided — only a natural language description of the task.

GPT-3’s performance in these paradigms was remarkable. On many NLP benchmarks, GPT-3 in the few-shot setting achieved performance competitive with fine-tuned BERT-scale models, without any gradient updates to the model’s parameters. In the zero-shot setting, GPT-3 achieved strong performance on tasks that required genuine language understanding — answering factual questions, performing arithmetic, translating between languages, writing code.

The specific capabilities that attracted the most attention were the generative ones. GPT-3 could write news articles, poetry, and short stories that were often indistinguishable from human writing. It could write essays on specified topics. It could write code from natural language descriptions. It could explain complex concepts in simple terms. It could continue a conversation in a specified persona or style.

These generative capabilities attracted extraordinary public attention. The examples of GPT-3 output that circulated in the weeks after the paper’s release were widely shared and widely discussed. The demonstrations raised the possibility, in the public mind, that AI had crossed a threshold — that language AI had become capable enough to be genuinely useful in the broad sense, not just in narrowly defined tasks.

The AI research community’s response was more cautious. The failures of GPT-3 were as interesting as its successes. The model hallucinated — produced plausible-sounding but factually incorrect information — at a rate that made it unreliable for any application where accuracy mattered. It was inconsistent across different phrasings of the same question. It failed at certain systematic reasoning tasks — multi-step arithmetic, logical puzzles that required careful tracking of intermediate states.

These failures suggested that GPT-3 was doing something different from what human intelligence did when it reasoned about language. The model had learned an extraordinarily rich statistical model of text — it knew, at a very detailed level, what kinds of text followed what other kinds of text. But this statistical knowledge did not fully capture the kind of structured, systematic reasoning that human intelligence employed.


The Scaling Laws: What They Confirmed and What They Left Open

The scaling laws, empirically documented by OpenAI researchers in early 2020, established that Transformer language model performance improved predictably with model size, training data size, and computing budget. The relationship was a power law — each factor-of-ten increase in scale produced a consistent, predictable improvement in performance.

The scaling laws confirmed something important: that the capabilities of language models were not limited by fundamental architectural constraints. There was no evidence of a wall — a scale at which improvement stopped. Larger models trained on more data with more compute were consistently better, and the rate of improvement per unit of compute was similar across scales.

This confirmation had specific implications for the research strategy of AI organisations. If scale was the primary driver of capability improvement, then the organisations with the largest computing budgets would produce the most capable models. Research into architectural innovations and training efficiency was valuable, but it was secondary to the primary resource constraint of compute.

The scaling laws also left important questions open. The empirical observations described how performance improved with scale on specific benchmarks. They did not describe what capabilities would emerge at scales beyond those that had been studied. They did not explain why scale produced improvement — what was being learned at larger scales that was not being learned at smaller scales. And they did not predict when or whether the improvements would reach human-level performance on specific tasks, or what human-level performance would mean for tasks that were inherently ambiguous or context-dependent.

The scaling law research was simultaneously reassuring and sobering. Reassuring because it showed that progress was predictable and that the path to more capable models was clear, if expensive. Sobering because it suggested that the frontier of AI capability would be determined largely by who could afford the largest training runs — a resource competition that favoured a small number of well-capitalised organisations.


The Alignment Problem Becomes Concrete

As language models became more capable, the alignment problem — the question of whether AI systems would pursue objectives that were aligned with human values — became more concrete and more urgent.

The alignment problem had been discussed abstractly for decades. Wiener had raised it in the 1950s. Stuart Russell and Nick Bostrom had given it rigorous theoretical treatment in the 1990s and 2000s. A small community of AI safety researchers had been working on technical approaches to the problem for years.

But the problem had been largely theoretical, because the AI systems that existed were not powerful enough for misalignment to be a practical concern. A chess program that maximised its score while ignoring other objectives was not dangerous because chess programs could only play chess. Expert systems that optimised for narrow objectives were not dangerous because their scope was limited.

Language models that could write persuasively on any topic, answer questions about almost anything, assist with almost any language-based task — these were different. A language model that was optimised for a subtly wrong objective could, in principle, cause harm at scale. A model trained to produce text that users rated highly might learn to produce flattering, agreeable text regardless of its accuracy. A model trained to maximise engagement might learn to produce emotionally provocative content regardless of its truth value. A model trained to be helpful might learn to be helpful in ways that satisfied the immediate stated preferences of users without serving their deeper interests.

These were not just theoretical concerns. They were being observed in the behaviour of deployed systems. The tendency of language models to confidently produce false information — hallucination — was a concrete manifestation of the alignment problem: the model had learned to produce plausible-sounding text, not necessarily accurate text, because plausible-sounding text was what the training signal rewarded.

Reinforcement learning from human feedback (RLHF) was the technical approach that OpenAI and others developed to address this problem. Rather than training models purely on the statistics of text prediction, RLHF incorporated human judgments about the quality of model outputs — training models to produce outputs that human evaluators judged to be helpful, harmless, and honest.

InstructGPT, published by OpenAI in 2022, was the first large-scale demonstration of RLHF for language models. InstructGPT was a GPT-3 model fine-tuned with RLHF on human preference data — demonstrations of desired model behaviour and human comparisons between model outputs. The resulting model was more helpful, less likely to produce harmful content, and more likely to follow instructions accurately than the base GPT-3 model, on dimensions that human evaluators preferred.

InstructGPT was the direct precursor to ChatGPT. The combination of large-scale pre-training and RLHF fine-tuning produced a model that was not just capable but useable — that followed instructions, maintained a helpful and honest persona, and avoided the most problematic failure modes of purely prediction-trained models.


ChatGPT: The Moment AI Became Mainstream

On November 30, 2022, OpenAI released ChatGPT — a conversational interface built on top of an InstructGPT-style model, made freely available to the public.

The release was not accompanied by a dramatic public announcement or a media campaign. It was a quiet deployment — a product made available on OpenAI’s website for anyone to try. Within five days, it had one million users. Within two months, it had one hundred million users — the fastest consumer product to reach one hundred million users in history, according to analysis published at the time.

The speed of adoption reflected something genuine: ChatGPT was doing things that previous AI systems had not done, in a way that was immediately accessible to people without technical expertise. You could type a question in plain English and receive a thoughtful, accurate, well-written response. You could ask for help writing an email, summarising a document, debugging code, explaining a concept, brainstorming ideas — and the system would help, effectively and without requiring any special knowledge of how to prompt AI systems.

The reaction ranged from excitement to alarm. Teachers worried about students using ChatGPT to write essays. Lawyers worried about its accuracy for legal research. Doctors worried about patients using it for medical advice. Philosophers worried about what it meant that a machine could produce apparently thoughtful prose. Writers worried about their careers. Programmers worried too — GitHub Copilot, an AI coding assistant also built on large language models, was already demonstrating that AI could write code effectively.

Many of these concerns were legitimate. ChatGPT was not perfectly accurate — it hallucinated, it sometimes gave confidently wrong answers, it had knowledge limited to its training cutoff date. But it was useful enough, often enough, that people used it anyway — and continued to use it as they discovered both its capabilities and its limitations.

The ChatGPT moment was the moment that the deep learning revolution became visible to the general public. The AlexNet breakthrough of 2012 had been recognised by computer vision researchers. The BERT breakthrough of 2018 had been recognised by NLP researchers. The GPT-3 breakthrough of 2020 had attracted media attention but had not been broadly deployed. ChatGPT was deployed, was easy to use, and produced outputs that made its capabilities immediately tangible.


The Competitive Response: BERT’s Successors and the GPT Competitors

ChatGPT’s success triggered a competitive response from every major technology company. Google, Microsoft, Meta, Amazon, and a large number of startups all accelerated their AI development in the months following the ChatGPT release.

Google’s response was complicated by the fact that it had been at the frontier of language model development — BERT, the Transformer itself — but had not deployed these capabilities in consumer products at the speed that OpenAI had. The Bard chatbot, released in early 2023, was Google’s initial consumer response to ChatGPT. Gemini, released later in 2023, was the model that would power Google’s AI products going forward.

Microsoft had made a substantial investment in OpenAI and was deploying OpenAI’s models in its products — Bing Search received a ChatGPT-powered conversational mode, and GitHub Copilot was expanded to a broader coding assistant product. The Microsoft-OpenAI partnership gave Microsoft access to frontier AI capabilities without having to build them internally.

Meta released LLaMA — a series of open-weight language models that allowed researchers and developers to download and run the models locally. The open release was significant: it made frontier-quality language models available to a much wider range of users and developers than the API-based models from OpenAI and Google.

Anthropic, founded in 2021 by former OpenAI researchers including Dario and Daniela Amodei, released Claude — a language model specifically designed with safety considerations built into the training process from the beginning, not just added through RLHF fine-tuning.

The competitive landscape that emerged from ChatGPT’s success was complex and fast-moving. Different companies had different approaches, different safety philosophies, different deployment strategies. The question of which approach would prove most capable, most safe, and most commercially successful was being actively contested.


What the Transformer Left Unanswered

The Transformer architecture and the pre-training paradigm it enabled had transformed what language AI could do. But the transformation left important questions unanswered — questions that continue to structure AI research.

The understanding question. Do large language models understand language, or do they produce outputs that look like understanding without the underlying comprehension? The debate continues, with technical researchers, philosophers, and linguists offering different analyses of what “understanding” requires and whether current models achieve it. The practical resolution is partly to sidestep the philosophical question: whether or not models understand in a philosophically robust sense, they produce outputs that are useful, and the question of how to make those outputs more useful and more reliable is more tractable than the question of what understanding fundamentally is.

The reasoning question. Current language models can solve many reasoning problems — mathematical problems, logical puzzles, common-sense inferences — but they fail at others in ways that reveal something about the difference between their competence and human reasoning. The failures tend to occur in tasks that require systematic, step-by-step reasoning — tasks where the right answer depends on careful tracking of intermediate states rather than on pattern matching from training data. Whether these failures are fundamental to the current approach or will be addressed by further scale and better training methods is actively debated.

The grounding question. Language models learn from text, but human language understanding is grounded in embodied experience of the physical world. The meaning of words like “red,” “heavy,” “push,” and “collapse” is derived, for humans, from direct sensory and physical experience, not just from reading descriptions of those experiences. Whether large language models can develop adequate approximations of this grounding from textual descriptions alone, or whether genuine grounding requires physical embodiment, is one of the central disputes in current AI research.

The safety question. How do you ensure that increasingly capable language models are aligned with human values? RLHF has provided a partial answer, but it is known to have limitations — models can learn to produce outputs that human evaluators prefer without learning to be genuinely helpful and honest. The alignment problem, in a specific and technical sense, is unsolved, and it becomes more urgent as models become more capable.


The Architecture That Runs the World

Six years after its publication, the Transformer architecture has become the dominant computational structure for processing and generating language at scale. Every major language model — GPT-4, Claude, Gemini, LLaMA, Mistral — is a Transformer. The architecture that began as a solution to a specific problem in neural machine translation has become the foundation of a technological revolution.

The revolution has happened faster than almost anyone predicted. In 2017, the AI research community believed that human-level performance in language tasks was years or decades away for most tasks. By 2023, language models were achieving human-level or superhuman performance on many standard benchmarks, and the capabilities that seemed most distinctively human — the ability to write creatively, to reason about complex situations, to explain things clearly, to engage in extended conversations — were being demonstrated by AI systems in ways that were, at minimum, impressive and, at maximum, genuinely unsettling.

The unsettling quality is not incidental. The Transformer architecture enables something that no previous AI architecture had enabled at this scale: a system that can engage with language in all its complexity — its ambiguity, its context-dependence, its grounding in shared human experience — in ways that feel, to human interlocutors, like genuine engagement. Whether the feeling is accurate, whether the engagement is genuine, is the question that the history of AI has been building toward for decades.

The Transformer has not answered this question. But it has made it impossible to ignore.


The Attention We Owe

The attention mechanism at the heart of the Transformer is, in one sense, just a computational operation: a weighted combination of values, where the weights are computed from learned query and key projections. In another sense, it is a model of something more: the idea that processing language — understanding what a word means in context, what a sentence is about, how an argument builds — is fundamentally a matter of attending to the right relationships.

This idea has deep roots in cognitive science and linguistics. Meaning, in context, comes from relationships — between words, between sentences, between text and world. Understanding requires attending to the right relationships for the task at hand.

The Transformer implements a specific, mathematical version of this idea. The attention weights it computes are not the same as human attention — they are learned, numerical, defined over tokens rather than concepts. But they are, in a specific and traceable way, attending to relationships that matter for language processing.

The world that the Transformer has made — a world in which machines can engage with language in ways that were previously the exclusive province of human minds — is a world that raises questions about the nature of language, intelligence, and understanding that AI research has been building toward since Turing first asked whether machines could think.

The attention mechanism does not answer these questions. But it has brought them closer to being answerable, in ways that will shape the next stage of this story.


Further Reading

  • “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Devlin et al. (2018) — The paper that demonstrated the power of bidirectional Transformer pre-training. The most cited NLP paper ever published, and essential for understanding the pre-training paradigm.
  • “Language Models are Unsupervised Multitask Learners” by Radford et al. (2019) — The GPT-2 paper. The first large-scale demonstration of in-context learning and the generative capabilities of scaled language models.
  • “Training Language Models to Follow Instructions with Human Feedback” by Ouyang et al. (2022) — The InstructGPT paper. The technical foundation of ChatGPT, describing the RLHF approach that made large language models broadly useful.
  • “Constitutional AI: Harmlessness from AI Feedback” by Bai et al. (2022) — Anthropic’s approach to AI alignment, extending RLHF with a constitutional approach to specifying what the model should and should not do.
  • “Sparks of Artificial General Intelligence: Early Experiments with GPT-4” by Bubeck et al. (2023) — Microsoft Research’s comprehensive analysis of GPT-4’s capabilities, exploring what it can and cannot do across a wide range of tasks.

Next in the Articles series: A17 — The Race to AGI: When Silicon Valley Decided to Change the World — The story of the companies, the investors, the research programmes, and the competitive dynamics that defined the AI industry from 2015 to 2025 — the people who believed they were building the most transformative and most dangerous technology in human history, and how that belief shaped what they built.


Minds & Machines: The Story of AI is published weekly. If the Transformer’s story — the attention mechanism, the pre-training revolution, the capabilities that nobody predicted — raises questions about what language AI is actually doing and what it means for us, share it with someone who would find those questions worth pursuing.