The Godfathers Go Underground

Pittsburgh, Pennsylvania. 1988. Geoffrey Hinton is at a dinner party with colleagues from Carnegie Mellon’s computer science department. He has been at Carnegie Mellon for two years, having moved there from UC San Diego, where he had been part of the cognitive science community that produced the backpropagation paper. He is one of the most intellectually respected researchers in the building.

The conversation turns to research directions, to where the field is going, to what the next decade will bring. Hinton says what he has been saying for years: the future of AI is neural networks. Learning-based systems. Distributed representations. The biological approach.

One of his colleagues — a respected researcher in the symbolic AI tradition, a person Hinton genuinely likes — shakes his head with a combination of affection and exasperation. “Geoff,” he says, “you’re going to keep working on this your whole career and it’s never going to amount to anything.”

Hinton smiles. He has heard this before. He will hear it again. He does not change his mind.

This is what the underground years looked like: not dramatic confrontations or public battles, but the accumulated social pressure of a field that had decided you were pursuing a dead end, expressed through a thousand dinners and seminars and grant review panels and editorial decisions. The neural network researchers were not persecuted. They were tolerated, occasionally celebrated for their ingenuity, and fundamentally not taken seriously as the future of the field they were in.

They kept working anyway. The story of why — and of what they built during the years when nobody was watching — is one of the most important stories in the history of science.

The Landscape of Marginalisation: What the Underground Looked Like

The period from roughly 1987 to 2006 — from the second AI winter through the beginning of the deep learning revival — was the underground period for neural network research. Understanding what this period looked like requires understanding both the specific ways that neural networks were marginalised and the specific ways that marginalised researchers sustained their work.

The marginalisation was not absolute. Neural network research did not disappear during this period. Papers were published, conferences were attended, students were trained. The International Joint Conference on Neural Networks continued to meet. IEEE Transactions on Neural Networks continued to publish. There was a functioning research community, with its own culture, its own leaders, its own internal debates.

But the neural network community was a minority within the broader AI and machine learning research landscape, and it was a minority that the mainstream did not take seriously as the path forward for the field. The most prestigious venues — NeurIPS (then called NIPS), ICML, and the top AI conferences — were not dominated by neural network research. The most prestigious grants were not primarily going to neural network researchers. The students who chose research areas based on where the field seemed to be going often chose statistical learning theory, support vector machines, graphical models, and other approaches that had the advantage of strong theoretical foundations and established academic credibility.

The machine learning mainstream through the 1990s and early 2000s was dominated by approaches with theoretical guarantees: support vector machines, which came with convergence proofs and generalisation bounds derived from statistical learning theory; Gaussian processes, which were theoretically well-founded Bayesian approaches to regression and classification; boosting algorithms like AdaBoost, which had elegant theoretical analysis and strong empirical performance.

These approaches were not wrong — they were often very good for the specific problems they were applied to. What they lacked was the ability to scale to the complex, high-dimensional data — images, audio, text — that would ultimately require the kind of deep, learned representations that neural networks could provide. But the limitation was not yet fully apparent, because the datasets were still small and the benchmarks were still manageable by the approaches that did not require deep representations.

Toronto: Hinton’s Island

Geoffrey Hinton’s move to the University of Toronto in 1987 established the institutional base from which the neural network underground would eventually emerge. The twenty-five years he spent at Toronto — before his partial move to Google in 2013 — defined what sustained neural network research through the lean years looked like at its best.

Toronto was not, in 1987, a particularly prominent centre of AI research. It was a solid Canadian university with good computer science — a department that had some distinguished researchers but was not in the same conversation as MIT, Stanford, or CMU for overall AI prestige. This relative obscurity was, in some ways, an asset. The pressures that dominated the prestige institutions — the need to publish in specific venues, to pursue research directions that could attract the most competitive funding, to engage with the field’s mainstream on the mainstream’s terms — were somewhat less acute at Toronto.

Hinton built a research group that had a specific culture. It valued theoretical understanding but was never purely theoretical — the goal was to build systems that worked, to demonstrate empirically that the neural network approach could deliver on its promise, and to understand theoretically why it worked when it did. It valued depth of engagement over breadth of coverage — Hinton’s group was not trying to cover all of machine learning, it was trying to understand a specific set of questions about learning in neural networks as deeply as possible.

The group was small — by the standards of the competitive AI labs at major American research universities, Hinton’s group at Toronto was tiny. He typically had a handful of PhD students and a few postdoctoral researchers at any given time. The smallness was partly a resource constraint — Canadian research funding was limited — and partly a deliberate choice. Hinton preferred a small group in which everyone could engage deeply with everyone else’s work over a large group in which individual researchers were relatively isolated.

The students who came through this small, intense research group went on to become some of the most important figures in deep learning. Ruslan Salakhutdinov, who co-authored with Hinton the 2006 Science paper that began the deep learning revival. Yann LeCun, who was a brief postdoctoral visitor at Toronto before Bell Labs. Richard Zemel, who would become an important figure in deep learning and probabilistic modelling. Brendan Frey, who would contribute to graphical models and deep learning. And eventually Ilya Sutskever and Alex Krizhevsky, who would build AlexNet.

The common thread was not a specific research direction but a culture of serious engagement with hard problems, of willingness to follow the evidence regardless of what the mainstream was thinking, and of belief that the neural network approach would eventually prove itself.

The Funding Problem: How Research Survived Without Support

Research requires resources — computing time, graduate student stipends, postdoctoral salaries, conference travel, equipment. Neural network research during the underground period was chronically under-resourced by the standards of the competing approaches that dominated the mainstream.

The chronic under-resourcing was not random. It was a consequence of the funding mechanism of academic research. In AI and machine learning, the major funding sources — DARPA in the United States, the Canadian research councils, the European Framework Programmes — were not hostile to neural networks, but they were responsive to the field’s conventional assessments of what was promising. Reviewers on grant panels were drawn from the research community, and the research community’s assessment was that statistical learning theory, Bayesian methods, and kernel methods were more promising and more theoretically grounded than neural networks.

Neural network researchers developed strategies for surviving in this constrained environment.

Cross-disciplinary funding. Research that could be framed as cognitive science — as an investigation of how biological minds might work computationally — could sometimes attract funding from cognitive science sources. Hinton’s work was genuinely cognitive scientific, connecting to questions about how the brain represented information and learned from experience. This genuine connection to neuroscience and psychology provided access to funding streams that were not primarily about machine learning performance.

Industrial collaboration. Corporate research labs — Bell Labs, AT&T Research, NEC Research — were sometimes more willing to fund neural network research than government funding bodies, partly because they were less subject to academic peer review and partly because they had their own assessments of what was technically promising. LeCun’s long association with Bell Labs and AT&T Research was the most important example, and the ZIP code recognition deployment that he achieved there demonstrated that corporate environments could provide both the resources and the application pressure that advanced neural network research.

Canadian funding. The Natural Sciences and Engineering Research Council of Canada provided Hinton with more stable, if modest, funding than was easily available in the more competitive American funding environment. The Canadian system’s lower competitive pressure — grants were more reliably renewed for researchers with strong track records — provided a degree of continuity that was harder to achieve in the American system.

Doing more with less. Neural network researchers became expert at extracting maximum research value from minimum resources. The GPU repurposing that made AlexNet possible — using gaming graphics cards for neural network training — was an extreme example of this culture of resourcefulness. More generally, the underground period developed a research culture in which resource constraints were accepted as given and researchers competed to achieve the most with the least.

The Ideas That Kept the Programme Alive

The underground period was not just a period of survival — it was a period of intellectual progress, of the development of ideas that would eventually prove foundational to the deep learning revolution. Understanding what the neural network researchers were actually working on during these years is important for understanding where deep learning came from.

Restricted Boltzmann Machines. Hinton’s development of the Restricted Boltzmann Machine (RBM) — a probabilistic generative model that could learn to represent the statistical structure of a dataset — was one of the most important intellectual contributions of the underground period. RBMs provided a way to do unsupervised learning in neural networks — to learn useful representations without labelled data — that addressed one of the main criticisms of the supervised learning approach.

The RBM work led directly to the 2006 breakthrough. Hinton and Salakhutdinov’s demonstration that deep networks could be pre-trained using a stack of RBMs — with each RBM learning to represent the output of the previous one — provided a way around the vanishing gradient problem that had made training deep networks difficult. The pre-training approach initialised the deep network’s weights in a good region of parameter space, from which backpropagation fine-tuning could proceed effectively.

Distributed Representations. The theoretical understanding of distributed representations — of why representing information in patterns of activation across many units, rather than in the activation of individual units, was more powerful — was developed and deepened during the underground period. Hinton’s 1986 paper “Distributed Representations” had introduced the idea, but the subsequent two decades saw it developed into a comprehensive theoretical framework.

The power of distributed representations was twofold. First, they were efficient: n binary units could represent 2^n distinct patterns, compared to n distinct patterns with local representations. Second, they were generalisable: distributed representations naturally captured similarity — similar objects had similar patterns of activation, and properties of similar objects could be shared through the overlap in their representations.

The attention mechanism that Bengio’s group developed in 2014, and the word embeddings that became standard in NLP in the 2010s, were specific implementations of the distributed representation idea that Hinton had been developing since the 1980s.

Long Short-Term Memory. Sepp Hochreiter and Jürgen Schmidhuber’s 1997 LSTM paper — which solved the vanishing gradient problem in recurrent networks through gating mechanisms — was one of the most important contributions of the underground period. The solution was not immediately recognised as the breakthrough it was, but LSTM would eventually become the dominant approach to sequence modelling and would enable applications in speech recognition, machine translation, and language modelling that would not have been possible with standard recurrent networks.

Convolutional Networks. LeCun’s development and refinement of convolutional networks through the underground period — from the original LeNet through the 1998 paper and beyond — maintained and advanced the most practically successful neural network architecture of the era. The convolutional network was the template for AlexNet and for all subsequent computer vision deep learning architectures.

The Bell Labs Years: LeCun’s Productive Isolation

Yann LeCun’s years at Bell Labs and AT&T Research — from 1988 through the early 2000s — were the most productive period of his career in terms of foundational contributions, even as the broader machine learning field was focused elsewhere.

Bell Labs provided an unusual combination of resources and freedom. The institution’s tradition — established through the development of the transistor, information theory, and Unix — was to provide excellent researchers with excellent resources and trust them to pursue research that would eventually prove important, without requiring immediate commercial relevance. This tradition was under pressure through the corporate restructurings that affected AT&T through the 1990s, but it provided enough protection for LeCun to pursue a research agenda that the academic funding environment might not have supported.

The industrial context provided specific benefits. LeCun had access to real-world problems — not just image recognition on academic benchmarks but the specific challenges of reading handwritten text on checks and envelopes, of processing visual information at the scale of a major postal or banking operation. The ZIP code recognition deployment was not just a demonstration but a genuine production system, with all the engineering challenges that implied.

The production challenges were intellectually productive. LeCun’s convolutional network had to work not just on carefully curated research images but on the messy, varied, sometimes degraded images that arrived in real-world operation. The demands of production deployment drove improvements in the architecture, the training procedure, and the preprocessing pipeline that would not have been driven by purely academic research.

At the same time, the industrial environment had costs. The pace of publication was slower than in academia — Bell Labs was not a place that pressured researchers to publish prolifically, but neither was it a place where publishing was straightforwardly the primary product of the work. Influence on the academic field required academic publications, and the publication rate from LeCun’s Bell Labs years, while excellent in quality, was lower than it might have been in an academic environment.

The 1998 LeNet paper — “Gradient-Based Learning Applied to Document Recognition” — was the definitive publication of the convolutional network approach and became one of the most cited papers in machine learning history. But the four years between the paper’s publication and the NeurIPS 2002 conference where it began to receive the attention it deserved were years in which the approach was underestimated by the mainstream.

The Community That Kept the Faith

The underground period was sustained not just by individual researchers working in isolation but by a community — a network of people who shared the conviction that the neural network approach was right and who maintained that community through informal connections, shared conferences, and mutual support.

The annual Neural Information Processing Systems (NIPS) conference — now called NeurIPS — was the primary meeting ground for this community. NIPS had been founded in 1987, at almost exactly the moment the underground period began, and it became the venue where the neural network community gathered, shared results, argued about directions, and maintained the sense of collective intellectual project that isolated individual researchers could not sustain alone.

The NIPS community was broader than just neural network researchers — it included statisticians, neuroscientists, and other researchers interested in learning systems — but the neural network researchers were a significant and coherent presence within it. The conference provided a venue where Hinton, LeCun, Bengio, Schmidhuber, and their students could engage with each other’s work, debate interpretations, and maintain the collaborative-competitive relationship that drove intellectual progress.

Less visible but equally important were the informal connections — the email exchanges, the workshop conversations, the visits between labs — that allowed ideas to circulate and develop across institutional boundaries. Bengio spent time at both MIT and Bell Labs before establishing his Montreal group. LeCun visited Hinton at Toronto. Students moved between groups, carrying ideas across institutional boundaries. The community was small enough to be coherent but distributed enough to develop multiple perspectives on shared problems.

This community culture had specific characteristics that were different from the mainstream machine learning culture of the 1990s and early 2000s.

It was empirically oriented. The neural network community valued demonstrations that things worked — that specific architectures could learn specific tasks — over theoretical proofs that things were guaranteed to work under specific conditions. This empirical orientation was sometimes criticised by the statistically-oriented mainstream as lacking rigour, but it was productive for developing practical systems.

It was biologically inspired. Many of the researchers in the neural network community maintained a genuine interest in neuroscience and cognitive science — in understanding how biological systems solved learning and representation problems, and in whether the solutions that evolution had found were informative for the design of artificial systems. This biological inspiration was sometimes seen as a weakness — as importing unjustified assumptions from biology — but it was also a source of creative ideas that the more theoretically pure approaches did not have access to.

It was patient with long-term investments. The underground period required researchers who could sustain interest in a direction that was not currently producing the most impressive benchmark results, who could invest in foundational work — the LSTM, the development of better training algorithms, the theoretical understanding of distributed representations — that would not pay off for years. The patience required by the underground period selected for researchers who were genuinely convinced of the approach’s promise, not just following the current fashion.

The Counterrevolution: SVMs and the Theoretical Mainstream

The period from roughly 1995 to 2010 was dominated, in the machine learning mainstream, by the support vector machine and its theoretical framework — Vapnik-Chervonenkis (VC) theory and structural risk minimisation. Understanding why SVMs dominated, and why their dominance eventually ended, illuminates what the underground period was actually about.

Vladimir Vapnik’s development of the support vector machine in the 1990s — building on theoretical work that he had been developing since the 1960s — produced a learning algorithm with several properties that the machine learning mainstream valued highly.

SVMs had theoretical guarantees. The VC theory that underpinned SVMs provided bounds on generalisation error — mathematically rigorous statements about how well a classifier trained on finite data would perform on new examples. These bounds were the kind of theoretical foundation that statistical machine learning researchers aspired to.

SVMs had convex optimisation. The SVM training problem was a convex quadratic program — a problem with no local minima, where gradient-based optimisation was guaranteed to find the global optimum. Neural networks, by contrast, were trained by gradient descent on a non-convex objective, providing no such guarantees. The convexity of SVM training was seen as a major advantage: you knew you were finding the best solution, not just a good one.

SVMs worked well on the benchmarks of the era. On the standard machine learning benchmarks of the 1990s and early 2000s — handwritten digit recognition, text categorisation, protein structure prediction — SVMs performed well. Their performance was often better than neural networks trained with the limited resources and techniques available at the time.

The neural network researchers’ response to SVM dominance was telling. They did not abandon their approach. They argued — sometimes publicly, sometimes among themselves — that the advantages of SVMs were advantages for a specific kind of problem on a specific kind of data. For complex, high-dimensional data — images, audio, natural language — the ability to learn hierarchical representations through multiple layers of non-linear transformation was more important than the theoretical guarantees and convex optimisation of SVMs.

This argument was right but difficult to demonstrate convincingly with the data and computing resources available at the time. The datasets where SVMs outperformed neural networks were the datasets that the field was using as benchmarks. The datasets where neural networks would outperform SVMs — large, diverse, messy datasets like ImageNet — did not yet exist or were not yet the field’s primary evaluation standard.

The underground period was, in part, a long wait for the world to catch up with the evidence that the neural network researchers needed to make their case convincingly.

The Data Problem: Waiting for the World

One of the most important constraints on neural network research during the underground period was data. Deep networks needed large, diverse datasets to learn useful representations. The datasets available through most of the underground period were too small to allow deep networks to demonstrate their potential.

The MNIST dataset — handwritten digit recognition, sixty thousand training images — was a reasonable benchmark for demonstrating that learning algorithms could recognise handwritten digits. LeNet performed well on MNIST. But MNIST was too small and too simple to demonstrate the advantages of deep representations over shallower approaches. SVMs and other methods also performed well on MNIST; the dataset was not complex enough to discriminate between approaches.

The CIFAR-10 dataset — ten thousand training images across ten categories of natural images — was more challenging but still too small for deep networks to realise their potential. Deep networks trained on CIFAR-10 tended to overfit unless regularisation was carefully applied.

The data problem was related to the computing problem but distinct from it. Even with sufficient computing power, a deep network trained on too little data would overfit. What neural network research needed was a dataset large enough, diverse enough, and complex enough that deep representations — the hierarchical features learned through multiple layers — would prove demonstrably superior to the shallow representations that SVMs and other approaches used.

ImageNet was that dataset. When it became available in 2009 and the ILSVRC competition began in 2010, the environment changed. For the first time, there was a dataset large enough and complex enough that deep networks could demonstrate their advantages. The AlexNet result in 2012 was the consequence: given sufficient data, sufficient computing, and the right architecture, deep networks outperformed every alternative by a dramatic margin.

The underground period was, partly, a period of waiting for the data infrastructure that would allow neural networks to prove themselves. The researchers who built ImageNet — particularly Fei-Fei Li — were not themselves neural network researchers, but they built the resource that made the neural network revolution possible.

The Computing Problem: GPU Computing Changes Everything

The other major constraint on neural network research during the underground period was computing. Training large neural networks on large datasets required computational resources that were simply not available to academic researchers — or were available only at costs that were prohibitive for research groups working on modest academic grants.

The computing constraint was genuine and important. Neural networks trained with backpropagation required millions or billions of multiply-accumulate operations per training example, repeated over millions of examples, potentially for hundreds of training passes. Even a relatively modest deep network trained on a medium-sized dataset could require weeks of computation on the hardware available to academic researchers in the 1990s and early 2000s.

The computing situation changed dramatically with the emergence of NVIDIA’s CUDA platform in 2007 — a programming interface that allowed general-purpose computing on NVIDIA graphics processing units. GPUs had been designed for the specific computational demands of rendering 3D graphics — massive parallel processing of many small operations on floating-point data. These computational demands turned out to be similar to the demands of neural network training — many parallel multiply-accumulate operations on floating-point weights and activations.

Researchers who discovered that GPUs could dramatically accelerate neural network training — achieving speedups of one to two orders of magnitude over CPU training — found that they had been handed a resource that changed the economics of neural network research. Training that had previously required weeks could now be done in days. Training that had been economically infeasible was now affordable. Networks that had been too large to train were now within reach.

Alex Krizhevsky, who would build AlexNet, was one of the first researchers to systematically exploit GPU computing for neural network training. His development of cuda-convnet — a highly optimised CUDA implementation of convolutional network training — was a technical achievement that made AlexNet practically possible. The AlexNet paper is, among other things, a report of the engineering work required to make GPU-accelerated deep learning efficient.

The GPU computing revolution was not solely responsible for the AlexNet breakthrough — ImageNet data and the convolutional architecture were equally necessary — but it was an enabling condition. Without the ability to train a deep network on ImageNet in weeks rather than years, the AlexNet result would not have happened when it did.

The Slow Accumulation: Results That Built a Case

Through the underground period, neural network researchers were building a case — an accumulating body of evidence that the approach worked, that it could be applied to increasingly challenging problems, that its specific capabilities were different from and complementary to those of the statistical mainstream.

The case was built slowly, result by result, across years and decades.

Speech recognition. The application of deep learning to speech recognition — in particular, the work of Hinton’s group with researchers at Microsoft and Google in the early 2010s — produced dramatic improvements in word error rate that demonstrated the power of deep representations for acoustic modelling. The improvements were large enough to convince the speech recognition community, which had been dominated by GMM-HMM approaches for thirty years, that deep learning was the right direction.

Natural language modelling. Bengio’s development of neural language models — neural networks trained to predict the next word in a sequence — demonstrated that learned word representations could capture semantic and syntactic relationships that traditional n-gram models could not. The word vector representations that emerged from neural language models — later refined into Word2Vec and GloVe embeddings — became a foundational tool in natural language processing.

Object recognition. LeCun’s convolutional network applications to character recognition, object detection, and image segmentation demonstrated that learned visual features were more powerful than hand-engineered features for visual recognition tasks. The demonstrations were compelling but limited to the datasets available — larger datasets and more computing power would have produced more dramatic results earlier.

Generative modelling. The development of restricted Boltzmann machines, deep belief networks, and eventually variational autoencoders and generative adversarial networks demonstrated that neural networks could not only classify data but generate data — could learn the statistical structure of a dataset well enough to produce new examples that shared the dataset’s properties.

Each of these results contributed to the case. Individually, none of them was decisive — the statistical mainstream could always argue that the specific results could have been achieved with simpler methods, that the advantages were not large enough to justify the additional complexity, that the theoretical guarantees of competing approaches were worth the performance cost.

Collectively, the results were building toward a case that would eventually become undeniable: that learned hierarchical representations were genuinely better for complex, high-dimensional data than shallow approaches, and that the advantages would become larger, not smaller, as the data became richer and more plentiful.

The Moment of Vindication: 2012 and Its Aftermath

The AlexNet result in 2012 was not a surprise to the underground community. It was a vindication.

Hinton, LeCun, and Bengio had been expecting something like this result for years — had been working toward it, had been building the architectures and algorithms and theoretical understanding that would make it possible. What they had not been certain of was the timing and the scale of the effect. The magnitude of AlexNet’s improvement over the previous year’s best result — nearly 40% reduction in error in a single entry — was more dramatic than most people had predicted.

The vindication was not just technical. It was social and institutional. After years of being told that the neural network approach was misguided, that it would never amount to anything, that the theoretical foundations were inadequate, the research community’s most high-profile benchmark had been decisively won by a neural network system trained by a group that had been working in the tradition they had spent decades establishing.

The response of the machine learning community was rapid and in some ways surprising in its speed. Within a few months of the 2012 result, the conversation in machine learning had shifted. The most competitive researchers were pivoting toward deep learning. Graduate students who had been planning to work on kernel methods or graphical models were reconsidering. The major technology companies — Google, Facebook, Microsoft, Baidu — were accelerating their investment in deep learning research and hiring the neural network researchers who had been working in the margins of the field.

The speed of the transition reflected how clearly the evidence pointed in one direction. AlexNet was not a marginal improvement on a contested benchmark. It was a decisive result on the most important benchmark in the field. The statistical mainstream’s theoretical advantages — the convergence proofs, the generalisation bounds, the convex optimisation — were less compelling than 10 percentage points of error rate improvement on the most important computer vision dataset in the world.

What the Underground Period Built That Would Not Have Been Built Otherwise

The underground period was not merely a period of waiting — it was a period of building, during which specific ideas, architectures, and theoretical frameworks were developed that proved foundational to the deep learning revolution.

If neural network research had been part of the mainstream from 1987 to 2006 — if it had been as well-funded, as competitive, and as focused on near-term results as the statistical mainstream — would it have developed the same ideas? Probably not in the same way.

The underground period developed ideas with long time horizons — the LSTM took a decade to become widely used after its publication; the distributed representation framework took two decades to find its full expression in word embeddings; the convolutional architecture took fifteen years from its first large-scale deployment to its recognition as the dominant approach. These long-horizon ideas required a research culture that could sustain investment through extended periods of apparent unproductivity.

The underground period also developed a culture of empirical commitment — of investing in demonstrations that things worked, in building systems that could be deployed, in accumulating empirical evidence even when theoretical understanding lagged. This culture was less prestigious than the theoretical culture of the statistical mainstream but more directly connected to the practical demonstrations that would eventually prove decisive.

The community that the underground period produced — the specific network of researchers who shared the neural network commitment, who knew each other’s work, who had debated ideas across years of conferences and workshops — was itself a valuable product. When the moment of vindication came, there was a ready community of researchers who could rapidly extend the breakthrough, apply it to new domains, develop the theoretical understanding that the breakthrough’s success demanded.

The Personal Cost: What the Underground Required

The underground period exacted personal costs from the researchers who sustained it. These costs are worth acknowledging honestly, both because they are part of the story and because they illuminate what it means to maintain an unpopular conviction through years of institutional pressure.

The social cost was real. Being a neural network researcher during the 1990s meant occupying a marginal position within the machine learning community — being tolerated but not taken seriously as a leader, being included in the conversation but not defining its direction. The most competitive positions, the most prestigious grants, the most influential review committees were occupied primarily by the statistical mainstream. Neural network researchers had to work harder, compete more aggressively, and make a stronger case for the importance of their work than their mainstream colleagues who had the field’s prevailing assessment on their side.

The personal motivation to continue — to maintain the conviction that the approach was right despite years of evidence that the field did not agree — required a specific kind of intellectual character. Not stubbornness, exactly — the good neural network researchers were not dogmatic about their specific approaches, were willing to revise specific claims when the evidence demanded it. But a kind of principled contrarianism, a willingness to trust one’s own assessment of the evidence over the field’s consensus, that is both admirable and socially costly.

Hinton has described, in interviews, the specific experience of being told repeatedly that his research direction was wrong, and of maintaining his conviction that it was right. The experience was not comfortable. It required a specific kind of intellectual confidence — not arrogance, which would have made revision of specific ideas difficult, but confidence in the overall direction combined with openness to revision of specific claims.

The opportunity cost was also real. Researchers who spent the underground period working on neural networks were not working on approaches that the field valued more highly. They were making career choices that, if they turned out to be wrong, could have been costly. The career risk of sustained commitment to an unfashionable approach is real, and the underground period required researchers to accept that risk.

That the risk paid off — that the approach vindicated itself as thoroughly as it did — does not make the risk retrospectively trivial. The underground researchers did not know in 1995 or 2000 that they were right. They continued because they believed they were right, not because they had certainty. The willingness to act under uncertainty, to make career-defining commitments based on judgment rather than consensus, is one of the more admirable features of the underground period’s leading figures.

The Legacy of the Underground: What the Field Learned

The underground period left a specific legacy in the culture of machine learning research — a set of lessons that the field has absorbed, with varying degrees of success, from the experience of the neural network community.

Empirical evidence can overturn theoretical consensus. The neural network community maintained a research direction that lacked the theoretical foundations that the statistical mainstream valued. When the empirical evidence became sufficiently clear — when AlexNet demonstrated what deep learning could do at scale — the empirical evidence was more persuasive than years of theoretical argument. The lesson is not that theory is unimportant but that empirical performance on the right benchmarks is an independent form of evidence that can challenge theoretical arguments.

Scale changes the picture. The advantages of deep learning were not visible at small scale — on small datasets with limited computing power, SVMs and other shallow methods were often competitive with or superior to deep networks. The advantages of deep learning became visible only at larger scale — with larger datasets and more computing power. The lesson is that evaluation at one scale may not predict performance at another scale, and that research directions that look unpromising at small scale may prove transformative at large scale.

Research communities need patience. The underground period required decades of patient investment before the payoff. The research culture that enabled this patience — the acceptance of long time horizons, the willingness to invest in foundational work before applications were clear, the maintenance of community across years of unfashionability — is a specific kind of institutional asset that is difficult to create and easy to destroy.

Contrarianism is occasionally right. The neural network community was contrarian — it maintained a position that the mainstream had rejected. Contrarianism is usually wrong; the mainstream is usually right for good reasons. But occasionally the mainstream is wrong, and the value of maintaining a contrarian position is precisely that it preserves an alternative when the mainstream’s consensus turns out to be mistaken. The underground period is one of the clearest examples in the history of science of a contrarian position that turned out to be right.

The Comeback Complete: From Underground to Revolution

The story of the underground period ends not with the AlexNet result of 2012 but with the transformation of the field that followed it — the rapid, almost complete shift of machine learning research toward deep learning that made the underground period’s ideas the mainstream within a few years.

By 2014, deep learning was the dominant approach in computer vision. By 2015, it was dominant in speech recognition. By 2017, with the development of the Transformer architecture, it was dominant in natural language processing. By 2020, it was the primary approach for virtually every major AI application.

The researchers who had sustained the underground period were recognised, eventually, as having been right. The Turing Awards came — for Hinton, LeCun, and Bengio in 2018. The positions at major technology companies came — Hinton at Google, LeCun at Facebook/Meta, Bengio as the founder of Mila with close industry ties. The graduate students who had chosen to work in the underground tradition found themselves in enormous demand, recruited by every major technology company.

The vindication was complete. But it would not have come without the underground — without the decades of patient work, the sustained belief in a direction that the field had rejected, the specific ideas and architectures and theoretical frameworks that the underground period had developed.

The most important research programme in modern AI survived by going underground. That is one of the most important facts about how deep learning came to transform the world.