The Bitter Lesson

What seventy years of AI research keeps teaching us

The Man Behind the Observation

In March 2019, Rich Sutton published a short essay on his personal website titled "The Bitter Lesson." It ran barely thirteen paragraphs. It contained no equations, no graphs, no experimental results. And yet it may be the single most important piece of writing about artificial intelligence strategy produced in the last two decades.

To understand why this essay carries so much weight, you have to understand who wrote it. Richard S. Sutton is not a commentator or a journalist covering AI. He is one of the people who built it. Sutton is widely regarded as the father of modern reinforcement learning, the branch of machine learning concerned with how agents learn to make sequential decisions by interacting with an environment. His 1984 doctoral thesis at the University of Massachusetts Amherst laid the groundwork for temporal-difference learning, a family of algorithms that would eventually power everything from backgammon-playing programs to the systems that taught themselves to play Atari games, and then Go, and then to navigate the open-ended complexity of the real world.

His textbook, Reinforcement Learning: An Introduction, co-authored with Andrew Barto, is the definitive reference in the field. It has been cited tens of thousands of times. Sutton has held positions at AT&T Labs, the University of Alberta, and DeepMind. He has spent over four decades watching AI research unfold from the inside, observing which approaches succeed and which fail, which ideas persist and which get swept away. He has the rare combination of deep technical expertise and the long historical memory necessary to see patterns that span entire generations of research.

When Rich Sutton says he has noticed something important about AI, the field listens. And what he noticed was something that most researchers already knew, at some level, but did not want to accept.

The Core Thesis

The bitter lesson is this: over the seventy-year history of AI research, general methods that leverage computation have ultimately proven more effective than specialized methods that leverage human knowledge. And this is not a close contest. It is not that general methods win slightly or win in certain domains. They win overwhelmingly, repeatedly, and across every domain where the comparison has been made.

Sutton puts it with characteristic directness:

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

The word "bitter" is carefully chosen. The lesson is bitter because it runs directly counter to the instincts and motivations of the researchers who do the work. AI researchers are, by nature and training, people who understand domains deeply. They understand the structure of language, the geometry of vision, the strategy of games. They want to build that understanding into their systems. They want to encode the fruits of human insight, to give machines a head start by sharing what we already know.

And the bitter lesson says: that impulse, however natural and well-intentioned, is ultimately counterproductive. The approaches that win are the ones that do not try to encode human knowledge but instead create general frameworks that can discover their own representations, their own strategies, their own understanding, given enough computation.

The Pattern in Chess

Sutton's essay draws on several historical case studies to make his point. The first and perhaps most vivid is computer chess.

For decades, the dominant approach to computer chess was to encode human chess knowledge into the system. Researchers worked with grandmasters to develop sophisticated evaluation functions that could assess a position's quality based on material balance, pawn structure, king safety, piece activity, and dozens of other strategic considerations that strong human players had identified over centuries of play. The programs were, in a sense, containers for human chess expertise, translated into code.

This approach worked. Programs improved steadily. But the breakthroughs came from a different direction. Deep Blue, the IBM system that defeated world champion Garry Kasparov in 1997, succeeded not primarily because it had better chess knowledge than its predecessors, but because it could search roughly 200 million positions per second. It used specialized hardware to apply brute-force search at a scale that was previously impossible. Deep Blue did contain hand-crafted chess knowledge in its evaluation function, but its competitive advantage was raw computational search depth.

Many in the chess programming community were disappointed by this result. It felt like cheating, or at least like the wrong kind of achievement. A program that truly "understood" chess, they felt, should not need to search so deeply. It should be able to evaluate a position with the kind of intuitive grasp that a grandmaster brings to the board.

Then, in 2017, DeepMind's AlphaZero made the debate moot. AlphaZero was given nothing but the rules of chess. No opening books, no endgame tablebases, no hand-crafted evaluation functions, no strategic concepts from human play. It learned entirely through self-play, using a deep neural network and Monte Carlo tree search. Within four hours of training, it reached superhuman strength. Within nine hours, it defeated Stockfish, the strongest traditional chess engine in the world, in a hundred-game match, winning 28 games and losing none.

AlphaZero did not just win. It played chess in a style that grandmasters described as alien, creative, and beautiful. It sacrificed material in ways that violated conventional chess wisdom. It discovered strategic ideas that humans had never considered in centuries of play. The system that encoded zero human chess knowledge ended up playing more creatively, more beautifully, and more effectively than any system that had been built on human expertise.

The Pattern in Go

The story of Go is even more dramatic, because the game was long considered fundamentally resistant to the brute-force approach that had worked in chess. Go's branching factor is roughly 250 compared to chess's roughly 35. The number of possible board positions exceeds the number of atoms in the observable universe. For decades, researchers believed that strong Go play would require the kind of pattern recognition and intuitive judgment that only human-like understanding could provide. They built systems around hand-crafted heuristics, influence maps, and pattern databases derived from expert play.

These systems were mediocre. For years, the best Go programs played at a level that any moderately experienced human amateur could defeat. The gap between machine and expert human performance in Go was used as an argument that AI needed more sophisticated, knowledge-rich approaches, that brute force and general methods had hit their ceiling.

Then AlphaGo, again from DeepMind, defeated world champion Lee Sedol in 2016, using deep neural networks trained on human games combined with reinforcement learning through self-play. Its successor, AlphaGo Zero, went further: it was trained entirely from self-play with no human game data at all, and it surpassed the version that had beaten Lee Sedol within 72 hours. AlphaZero later generalized this approach across chess, Go, and shogi simultaneously, using the same architecture and algorithm for all three games.

The domain that was supposed to be the proof case for knowledge-rich, human-guided AI instead became the proof case for the bitter lesson. The methods that worked were the ones that threw away human knowledge and let computation discover everything from scratch.

The Pattern in Speech Recognition

Sutton cites speech recognition as another instance of the same pattern. For decades, speech recognition research was dominated by approaches grounded in human knowledge of linguistics. Researchers built systems around phonemes, the basic units of sound that linguists had identified. They encoded knowledge of phonotactics (which sound combinations are legal in a given language), prosody (the rhythm and intonation of speech), and grammar. They built elaborate pipelines that moved from acoustic signals to phoneme hypotheses to word hypotheses to sentence hypotheses, each stage informed by expert knowledge of how human language works.

These systems were fragile and complex. They required enormous engineering effort to build and maintain. They broke in noisy environments. They struggled with accents, dialects, and informal speech. They improved slowly, incrementally, through painstaking refinement of their knowledge-rich components.

The revolution came when researchers moved to statistical methods, first Hidden Markov Models and then deep neural networks, that learned their representations directly from large amounts of data. Modern speech recognition systems, like those in your phone or smart speaker, use end-to-end deep learning that takes raw audio waveforms as input and produces text as output. They have no explicit representation of phonemes, no built-in knowledge of phonotactics, no hand-crafted linguistic rules. They work by learning statistical patterns from vast quantities of transcribed speech, powered by enormous amounts of computation.

And they work dramatically better than any knowledge-rich system ever did. They handle noise, accents, and informal speech with a robustness that hand-engineered systems never achieved. The cumulative knowledge of decades of linguistic research, laboriously encoded by experts, was surpassed by systems that knew nothing about linguistics but had access to enough data and computation.

The Pattern in Computer Vision

Computer vision tells the same story. For decades, researchers built vision systems around hand-crafted features: edge detectors, corner detectors, texture descriptors, shape models. They encoded human knowledge about what makes visual features distinctive and informative. Systems like SIFT (Scale-Invariant Feature Transform) and HOG (Histogram of Oriented Gradients) were engineering marvels, carefully designed to capture the kinds of visual information that human researchers judged to be important.

In 2012, AlexNet, a deep convolutional neural network, won the ImageNet Large Scale Visual Recognition Challenge by a margin that shocked the field. AlexNet learned its own features from data. Its early layers learned edge detectors and texture analyzers. Its later layers learned to recognize complex shapes and object parts. But none of these representations were hand-designed. They emerged from training on millions of labeled images using backpropagation and large-scale GPU computation.

The features that AlexNet learned were not identical to the ones that human researchers had designed. In many cases, they were richer, more nuanced, and more effective. The system discovered visual representations that decades of human expertise in computer vision had not identified. Within a few years, hand-crafted features were almost entirely abandoned by the computer vision community. The knowledge-rich approach did not just lose. It became obsolete.

The Recurring Dynamic

Sutton identifies a consistent pattern across all of these domains. The cycle goes like this:

First, a problem area emerges that is poorly understood and difficult for machines. Early approaches use whatever tools are available and make limited progress. Researchers then bring domain expertise to bear, encoding human knowledge into their systems. Performance improves. Papers are published. Careers are built. A community forms around the knowledge-rich approach, with shared assumptions, benchmarks, and methods.

Then, someone tries a simpler, more general approach that relies less on human knowledge and more on computation and learning. Initially, this approach may perform worse than the knowledge-rich systems. The community is skeptical. The general approach seems naive, unprincipled, even intellectually lazy. Why would you throw away decades of carefully accumulated domain knowledge?

But computation keeps getting cheaper. Moore's Law, or its functional equivalents, keeps delivering more processing power. The general approach scales with computation in a way that the knowledge-rich approach cannot. Within a few years, or sometimes within a few months, the general approach overtakes the specialized one. And it does not just overtake it marginally. It renders the knowledge-rich approach completely uncompetitive.

The researchers who spent years building the knowledge-rich systems are left with a difficult realization. Their deep understanding of the domain, which they encoded so carefully into their systems, was not just unnecessary. It was, in a sense, an obstacle. It locked them into particular representations and particular ways of thinking about the problem. The systems that discarded all of that knowledge were free to discover better representations, ones that no human expert had imagined.

Why It Is Bitter

The lesson is bitter for profoundly human reasons. It is bitter because AI researchers are smart people who have spent years, sometimes decades, acquiring deep expertise in specific domains. They understand how language works, how vision works, how strategic reasoning works. They have Ph.D.s and publication records that reflect this expertise. And the bitter lesson says: your expertise, however real and hard-won, is less valuable than you think.

It is bitter because it suggests that the most productive thing an AI researcher can do is not to study a domain more deeply, but to develop more general methods and then wait for computation to catch up. This feels wrong. It feels intellectually defeatist. It feels like giving up on understanding.

It is bitter because it implies something uncomfortable about the nature of human knowledge itself. When a system that knows nothing about chess plays more creatively than any human grandmaster, when a system that knows nothing about linguistics handles speech better than any linguist-designed system, it raises a question that most researchers would rather not confront: is human knowledge, in many domains, actually a compressed and impoverished approximation of patterns that are better discovered from scratch by a sufficiently powerful learning system?

Sutton is careful to note that this does not mean human knowledge is worthless in general. It means that the attempt to build human knowledge into AI systems is ultimately counterproductive. Human knowledge is valuable for humans. But encoding it into machines is a dead end, because machines can discover representations that are better suited to their own computational architecture than any representation borrowed from human cognition.

The Prediction That Came True

Sutton published "The Bitter Lesson" in March 2019. By that point, the evidence was already accumulating. OpenAI had released GPT-2 just weeks earlier, in February 2019: a language model with 1.5 billion parameters that could generate remarkably coherent text. GPT-2 was itself an extension of the original GPT, published in 2018, which had demonstrated that a large transformer model trained on a simple next-token prediction objective could develop surprisingly broad language capabilities. Sutton was not predicting the future. He was naming something that was already happening.

What happened next was the most dramatic validation of the bitter lesson that the field has ever seen. GPT-3, released in 2020, scaled to 175 billion parameters. The jump in capability was not incremental. GPT-3 could write essays, generate code, answer factual questions, translate between languages, and perform dozens of other tasks that had traditionally required specialized, domain-specific systems. It did all of this with a single model, a single training objective (predict the next token), and a single method (a transformer trained on a large corpus of text).

The scaling continued. GPT-4, Claude, Gemini, and their successors demonstrated that the same basic approach, transformers trained on next-token prediction at increasing scale, continued to yield dramatic improvements. Capabilities that researchers had spent years trying to engineer into specialized systems, from common-sense reasoning to mathematical problem-solving to code generation, emerged as side effects of simply making the model bigger and training it on more data.

The large language model revolution is the bitter lesson made manifest. Natural language processing, as a field, had spent decades building knowledge-rich systems: parsers, semantic role labelers, coreference resolution systems, sentiment analyzers, each one encoding human linguistic knowledge into specialized architectures. All of these were rendered largely obsolete by a general method (the transformer) applied with enough computation. The researchers who had spent their careers on these specialized systems watched as a single, general architecture surpassed all of their work.

Inside OpenAI, Sutton's essay became something close to scripture. It circulated as required reading for engineers. The decision to bet on scale, to build larger and larger models rather than more sophisticated architectures, was a direct application of the bitter lesson. And the bet paid off spectacularly.

Scaling Laws and Empirical Confirmation

The bitter lesson received further empirical support from the discovery of neural scaling laws. In 2020, researchers at OpenAI, including Jared Kaplan, published a landmark paper showing that the performance of language models followed smooth, predictable power laws as a function of model size, dataset size, and the amount of compute used for training. These scaling laws held over many orders of magnitude, suggesting that the relationship between computation and capability was not just real but remarkably regular and predictable.

This was a quantitative vindication of Sutton's qualitative observation. The scaling laws said, in precise mathematical terms, that more computation reliably produces better performance. You did not need clever architectural innovations or domain-specific engineering. You needed scale. The Chinchilla paper from DeepMind in 2022 refined the scaling laws, showing that models had been under-trained relative to their size and that optimal performance required scaling data and parameters together. But the fundamental message was the same: general methods plus computation beats everything else.

The Counter-Arguments

The bitter lesson is not without its critics, and the counter-arguments are worth taking seriously.

Sample Efficiency

Perhaps the most common objection is about sample efficiency. Human children learn language from a few million words of input, not trillions. They learn to recognize objects from a handful of examples, not millions of labeled images. If general methods require vastly more data and computation than human learning, the argument goes, then they are not truly general. They are brute-force approximations that succeed only because we can throw resources at them that biological systems cannot.

This is a real concern, but it may be less damaging to Sutton's thesis than it first appears. Human children do not learn from scratch. They begin with a brain that was shaped by hundreds of millions of years of evolution, which is itself a form of computation over data. The "sample efficiency" of human learning may be, in part, an artifact of the enormous computational investment that evolution has already made. If you account for the evolutionary compute budget, the comparison looks different.

Embodied and Situated Intelligence

A related objection comes from the embodied cognition tradition, which argues that intelligence is fundamentally grounded in physical interaction with the world. On this view, language models trained on text are missing something essential. True understanding requires a body, sensory experience, and interaction with physical reality. No amount of scaling a text-based system will produce genuine understanding, because understanding is not the kind of thing that can be learned from text alone.

This argument may turn out to be correct, but the bitter lesson would predict a specific response: if embodied experience is important, the solution is not to hand-engineer representations of embodied knowledge, but to create general systems that can learn from embodied experience at scale. Robotics researchers are increasingly finding that this approach works. Large-scale simulation, combined with sim-to-real transfer, is producing robotic systems that learn general motor skills without hand-coded movement primitives, following the same pattern the bitter lesson describes.

Inductive Biases and Architecture Design

Another objection is that the success of general methods is itself a product of well-chosen inductive biases. The transformer architecture, for instance, is not a "general method" in the sense that a lookup table is general. It embodies specific assumptions about the importance of attention, positional relationships, and hierarchical composition. Its success may reflect the quality of these architectural choices, not just the quantity of computation applied.

This is fair, but Sutton's point is about the level of abstraction at which human knowledge is applied. There is a difference between designing a general-purpose learning architecture (like the transformer) and encoding domain-specific knowledge (like linguistic parse trees or chess evaluation functions). The bitter lesson does not say that all forms of human insight are useless. It says that domain-specific knowledge, encoded directly into the system, tends to get surpassed by more general approaches. The transformer is general precisely because it does not encode knowledge about any specific domain.

Interpretability and Safety

A more practical objection concerns interpretability and safety. Systems that learn their own representations are, almost by definition, harder to understand and control than systems built around human-legible knowledge structures. If we cannot understand how a system makes decisions, we cannot verify that it is making them for the right reasons. In high-stakes domains like medical diagnosis, autonomous driving, or criminal justice, this opacity is a serious problem.

This objection does not refute the bitter lesson so much as complicate its implications. It may be true that general, computation-heavy methods produce the best performance, and also true that deploying such systems responsibly requires additional work on interpretability and alignment that the bitter lesson does not address. Performance is not the only thing that matters. But the bitter lesson is specifically about what produces the best performance, and on that narrow question, the evidence is clear.

Diminishing Returns

Some researchers argue that scaling will eventually hit a wall. Energy costs, data availability, and physical limits on computation will prevent the indefinite continuation of the scaling paradigm. When that happens, the argument goes, domain-specific knowledge and architectural cleverness will become important again.

This is possible. But it is worth noting that people have predicted the imminent end of scaling at every point in the history of computing, and they have consistently been wrong. The forms of computation change (CPUs gave way to GPUs, which are giving way to custom AI accelerators), but the overall trend of increasing computation per unit cost has continued for decades. Betting against scale has, historically, been a losing bet.

The Two Principles

Sutton distills his observation into two principles for AI research:

First, AI researchers should not try to build in knowledge of the domain. The history of AI shows that this approach provides short-term gains but creates a ceiling that is eventually surpassed by simpler, more general methods. Instead, researchers should focus on developing methods that can discover their own representations and strategies through computation and learning.

Second, AI researchers should focus on methods that can take advantage of increasing computation. A good AI method is one that gets better as you give it more compute. A method that reaches a fixed performance ceiling, regardless of how much computation is available, is ultimately a dead end, because computation will keep getting cheaper and more abundant.

These two principles are related. Domain-specific knowledge creates fixed representations that cannot be improved by adding more computation. General learning methods, by contrast, can use additional computation to discover better representations. As computation grows exponentially over time, the gap between these two approaches widens exponentially as well.

Philosophical Implications

The bitter lesson has implications that extend well beyond AI research strategy. Read carefully, it says something unsettling about the nature of intelligence itself.

If general methods that learn from experience consistently outperform methods that encode human knowledge, it suggests that intelligence is less about knowing the right things and more about having the right learning process. The content of knowledge matters less than the ability to acquire knowledge. The specific representations matter less than the capacity to discover representations.

This resonates with a deep philosophical tradition. Socrates claimed that wisdom consists in knowing that you know nothing. The bitter lesson offers a computational version of this claim: the best AI systems are the ones that start knowing nothing and learn everything from experience.

But the pattern is older than AI. It is older than computers. Natural selection is the original bitter lesson. For billions of years, evolution has been running the same experiment: general optimization over vast computation versus designed solutions. And the result is always the same. No engineer could have designed the human eye, the immune system, or the neural architecture of a bird in flight. These solutions emerged from a process that encoded no domain knowledge whatsoever, that had no concept of optics or immunology or aerodynamics. Evolution is just search, operating over an unthinkable number of iterations. It works because it scales. It has always worked because it scales. The bitter lesson is not a discovery about AI. It is a rediscovery of something that was true long before humans existed to be bitter about it.

It also raises questions about the relationship between human cognition and artificial intelligence. Human brains evolved under severe constraints: limited energy, limited sensory bandwidth, limited lifespan for learning. Our cognitive architecture is shaped by these constraints. The concepts, categories, and representations we use to understand the world are adapted to our specific biological situation. They are good enough for human purposes, but they may not be optimal in any absolute sense.

When an AI system discovers representations that outperform human-designed ones, it may be discovering something closer to the true structure of the problem, unencumbered by the biological constraints that shaped human cognition. The bitter lesson, in its strongest form, suggests that human understanding is a locally useful but globally suboptimal compression of reality. Machines, freed from our constraints and given enough computation, can find better compressions.

This is a humbling thought. It does not diminish human achievement, any more than the telescope diminished the achievement of naked-eye astronomy. But it does suggest that the project of encoding human knowledge into machines is fundamentally misguided, not because human knowledge is wrong, but because it is limited in ways we cannot see from inside our own cognition.

What Sutton Got Right and Where the Story Continues

Seven years after its publication, "The Bitter Lesson" looks more prescient than even its author might have anticipated. The rise of large language models, the success of foundation models across vision, language, and multimodal tasks, and the continued discovery of scaling laws have all confirmed the central thesis. The methods that scaled with computation won. The methods that relied on human knowledge were superseded.

But the story is not over. The current frontier of AI research is grappling with questions that the bitter lesson addresses only obliquely. How do you align a system that has discovered its own representations? How do you ensure safety in a system whose internal workings you do not fully understand? How do you build systems that are not just powerful but trustworthy?

These questions may require new forms of human insight, not domain knowledge encoded into the system, but meta-level understanding of how to train, evaluate, and deploy systems whose capabilities emerge from scale. The bitter lesson tells us what works. The next challenge is making what works also work for us.

Perhaps there is a sequel to the bitter lesson that has not yet been written. If the first bitter lesson is that you cannot beat scale with knowledge, the second may be that you cannot govern scale with ignorance. Understanding these systems, not to encode our knowledge into them, but to ensure they remain beneficial as they surpass us, may turn out to be the most important intellectual challenge of the coming decades.

Sutton's thirteen paragraphs did not answer every question. They did not need to. They identified a pattern so clear and so consistent that ignoring it requires active effort. The bitter lesson is not a theory. It is an observation, backed by seventy years of evidence, that the field keeps re-learning and keeps resisting. General methods that leverage computation win. They always win. The pattern does not care whether we accept it or not.

That is the bitter lesson. It is simple, it is uncomfortable, and it is true.