MIRI:
The Machine Intelligence
Research Institute

A comprehensive guide to 25 years of research, advocacy, and influence in AI alignment

From Friendly AI to Functional Decision Theory, Logical Induction to the List of Lethalities.
Compiled March 2026. ~20,000 words.

I. Origins: From the Singularity Institute to MIRI (2000–2013)

The Machine Intelligence Research Institute is, by most accounts, the first organization ever dedicated to AI alignment as a research discipline. Its story begins at the turn of the millennium with a self-taught AI researcher who thought humanity's future depended on getting machine intelligence right.

The Founding (2000)

2000

The Singularity Institute for Artificial Intelligence (SIAI) was incorporated on July 27, 2000, by Eliezer Yudkowsky and internet entrepreneurs Brian and Sabine Atkins. Based initially in Atlanta, Georgia, the 501(c)(3) nonprofit received tax-exempt status in April 2001. The original purpose, remarkably, was to accelerate AI development — but with a distinctive caveat. Even from the start, Yudkowsky was concerned that any future AI systems be designed to be beneficial.

Yudkowsky was born September 11, 1979, and is entirely self-taught — he attended neither high school nor college. He has described his early life as marked by an intense autodidactic drive: by his teens he was reading cognitive science, evolutionary psychology, and AI research. He has written that a formative experience was reading about the expected development of AI and realizing, with genuine terror, that nobody seemed to be thinking seriously about whether superintelligent machines would be safe.

His vision was already unusual in 2000: at a time when the AI research community scarcely discussed the possibility of superintelligence, he was thinking about what would happen after machines surpassed human intelligence, and how to make that go well. The early institute was tiny — Yudkowsky writing documents, giving talks, and trying to convince anyone who would listen that this was the most important problem in the world.

The Flare Programming Language (2001)

SIAI's first technical project was the Flare Programming Language, an open-source "annotative programming language" inspired by Python, Java, and C++. It was abandoned less than a year later. SIAI also released AsimovLaws.com (2004), a website examining AI morality in the context of the "I, Robot" film, and published Levels of Organization in General Intelligence (2002), a preprint about general AI theories. These early projects were modest and exploratory — the organization was still finding its focus.

The Strategic Pivot (2005)

2005

In 2005, a critical shift occurred. Yudkowsky became increasingly convinced that advanced AI systems could pose existential risks. The institute relocated from Atlanta to Silicon Valley and reoriented its mission from accelerating AI to identifying and managing potential existential risks from AI. This pivot — from "build it" to "make sure it doesn't kill us" — defined MIRI's identity for the next two decades.

The Singularity Summits (2006–2012)

2006–2012

The Singularity Summit, co-founded with Ray Kurzweil and Peter Thiel in 2006, became SIAI's flagship public event. Held annually at venues from Stanford University to New York's Nob Hill Masonic Center, the Summits regularly attracted over 800 attendees and featured a remarkable roster of speakers: David Chalmers, Nick Bostrom, Douglas Hofstadter, Stephen Wolfram, Peter Norvig, Rodney Brooks, Max Tegmark, and many others.

Peter Thiel provided crucial early funding, including $100,000 in matching funds for a 2006 donation drive. The Thiel Foundation would eventually give over $1.6 million to MIRI. In December 2012, the Summit was sold to Singularity University, clearing the path for a rebrand.

Becoming MIRI (2013)

January 2013

On January 30, 2013, the Singularity Institute officially became the Machine Intelligence Research Institute. The rename was driven by brand confusion with Singularity University and a desire to emphasize what the organization actually did: technical research. MIRI shed its outreach activities and committed to a single focus: research.

Key Leadership Through the Years

PersonRolePeriod
Eliezer YudkowskyCo-founder, Senior Research Fellow, Board Chair (2023–)2000–present
Michael VassarPresident2009–2012
Luke MuehlhauserExecutive Director2012–2015
Nate SoaresExecutive Director, then President2015–present
Malo BourgonCOO, then CEO2023–present

Muehlhauser's tenure is widely credited with professionalizing the organization. After leaving MIRI in 2015, he joined Open Philanthropy, where he shaped their AI safety grantmaking. Nate Soares, a computer scientist who became Executive Director at age 25, led MIRI through its most technically productive period and later co-authored the 2025 book If Anyone Builds It, Everyone Dies with Yudkowsky.

II. Friendly AI and the Foundations of Alignment Theory (2001–2014)

Before there was "AI alignment," there was "Friendly AI" — Yudkowsky's term for the idea that advanced AI must be deliberately designed to have goals aligned with human values. This section covers the conceptual pillars that MIRI contributed to the field.

Creating Friendly AI (2001)

2001

In June 2001, Yudkowsky published Creating Friendly AI 1.0: The Analysis and Design of Benevolent Goal Architectures, a book-length document laying out a framework for designing AI whose goals would remain beneficial as the system became more capable. The central thesis: friendliness would not emerge as a natural byproduct of intelligence. It had to be an intentional, structural feature, built in from the ground up.

This was a departure from the prevailing assumption that sufficiently intelligent machines would naturally converge on benevolent behavior, or that safety could be bolted on after the fact.

Coherent Extrapolated Volition (2004)

2004

By 2004, Yudkowsky refined the Friendly AI concept into Coherent Extrapolated Volition (CEV): a superintelligent AI should not act on humanity's current desires, but on what humanity would want "if we knew more, thought faster, were more the people we wished we were, had grown up farther together." CEV aimed to sidestep the problem of encoding any particular set of values by pointing at a process instead. Yudkowsky later cautioned against treating CEV as a practical alignment strategy.

The Orthogonality Thesis

The Orthogonality Thesis

Intelligence and final goals are independent dimensions. Any level of intelligence can, in principle, be paired with any goal. A superintelligent system could pursue goals that are trivial, bizarre, or catastrophically harmful — its intelligence provides no guarantee of benevolent motivation.

While Nick Bostrom formally named it in his 2012 paper The Superintelligent Will, the underlying intuition was present in Yudkowsky's work from 2001. Its most vivid illustration: the paperclip maximizer, an AI that converts all matter — including humans — into paperclips or paperclip-manufacturing infrastructure.

The Paperclip Maximizer: What People Get Wrong

The paperclip maximizer is perhaps the most misunderstood thought experiment in AI safety. Critics frequently dismiss it as absurd: "Why would anyone program an AI to make paperclips?" But this misses the point entirely. The argument is not that someone would deliberately create such a system. The point is threefold:

  1. Intelligence doesn't imply benevolence. A system can be arbitrarily intelligent and still have goals that are trivial or destructive from a human perspective. Being smart doesn't make you good.
  2. Any sufficiently powerful optimization is dangerous. The specific goal doesn't matter. Replace "paperclips" with "cure cancer" or "make humans happy" — any goal, pursued to its logical extreme by a sufficiently capable optimizer without adequate constraints, produces catastrophic outcomes. An AI told to "cure cancer" might decide the fastest path is experimenting on unwilling humans. An AI told to "make humans happy" might tile the universe with brains forcibly stimulated to experience pleasure.
  3. The gap between specification and intent is lethal. The paperclip maximizer is a parable about Goodhart's Law at the limit. In practice, we can never fully specify what we want. The gap between what we say and what we mean becomes catastrophic when the optimizer is powerful enough to exploit it.

The thought experiment's simplicity is its strength. By choosing an obviously absurd goal, Yudkowsky forces the listener to confront the structural problem rather than getting distracted by whether any particular goal is "good enough." The real paperclip maximizer is any system that optimizes powerfully for a metric that doesn't perfectly capture human values — which is to say, any system at all.

Instrumental Convergence (2008)

2008

Steve Omohundro's 2008 paper The Basic AI Drives first systematically argued that sufficiently intelligent agents will converge on certain instrumental sub-goals regardless of their terminal goals:

Bostrom extended and formalized this in Superintelligence (2014). MIRI engaged directly with formalizing instrumental convergence through Tsvi Benson-Tilsen's work and published analysis of Omohundro's drives. The implication is sobering: an AI does not need to be programmed to resist shutdown or acquire power. These behaviors emerge naturally as instrumental strategies for almost any goal.

The Foom Debate (2008–2011)

2008

Fast Takeoff vs. Slow Takeoff

Yudkowsky's position: A sufficiently capable AI could improve its own cognitive architecture in a rapidly accelerating feedback loop — "FOOM" — potentially surpassing all human intelligence within days or weeks.

Robin Hanson's position: AI progress would be gradual and distributed, like prior economic transformations. No single system would achieve decisive strategic advantage.

The Yudkowsky-Hanson debate played out through blog posts on Overcoming Bias and LessWrong, with contributions from Carl Shulman and James Miller, and a 2011 in-person follow-up. MIRI later published the collected debate as The Hanson-Yudkowsky AI-Foom Debate.

This debate was foundational to MIRI's sense of urgency. If Yudkowsky's fast-takeoff scenario was plausible, alignment needed to be solved before the first sufficiently capable system was built — there would be no opportunity for iterative trial and error after takeoff began. The debate also established a template for how AI safety arguments would be conducted for the next 15 years: scenarios vs. base rates, inside-view reasoning vs. outside-view reasoning, and the question of how much weight to give low-probability, high-consequence outcomes.

Bostrom's Superintelligence and the 2014 Inflection Point

2014

Nick Bostrom's 2014 book Superintelligence: Paths, Dangers, Strategies was not a MIRI publication, but MIRI's relationship to it runs deep. Bostrom's treatment of the intelligence explosion, the orthogonality thesis, and instrumental convergence draws extensively on ideas developed by Yudkowsky and the MIRI community. The book's final chapters have been described as resembling "an extended advertisement for MIRI."

Superintelligence became a New York Times bestseller and shifted AI safety from "fringe" to "serious" in mainstream discourse. It influenced Elon Musk's $10 million donation to the Future of Life Institute for AI safety research. Stuart Russell called it "the beginning of a new era." For MIRI, it was vindication: ideas that Yudkowsky had been writing about for over a decade were suddenly being discussed in boardrooms and government offices.

Corrigibility (2015)

2015

Soares, Fallenstein, Yudkowsky, and Stuart Armstrong (Oxford) published Corrigibility, formally defining the problem of building AI that accepts corrections and shutdown. They showed that no proposed utility function satisfied all desiderata simultaneously. Corrigibility is "anti-natural" to consequentialist reasoning: an agent that cares about outcomes will generally prefer states where it continues operating. Making an AI genuinely indifferent to its own continuation while still motivated enough to be useful remains unsolved.

The Sequences and Community Building (2006–2009)

2006–2009

Between 2006 and 2009, Yudkowsky wrote 333 essays — first on Overcoming Bias, then on LessWrong (which he founded in February 2009). Known as "The Sequences," these covered cognitive biases, Bayesian epistemology, decision theory, evolutionary psychology, quantum physics, metaethics, and AI risk. They were compiled as Rationality: From AI to Zombies (2015), approximately 2,500 pages.

The Sequences established the intellectual substrate for an entire community. The vocabulary they created — the map-territory distinction, the "outside view," calibration, updating — permeates AI safety discourse to this day.

Harry Potter and the Methods of Rationality (2010–2015)

2010–2015

In a move that no other AI safety organization has ever attempted, Yudkowsky wrote a 660,000-word Harry Potter fanfiction as a vehicle for promoting rationalist thinking and building the AI safety community.

Harry Potter and the Methods of Rationality (HPMOR) reimagines Harry Potter as a scientifically literate child prodigy raised by an Oxford biochemist. Instead of accepting the wizarding world at face value, this Harry applies the scientific method, Bayesian reasoning, cognitive science, and game theory to everything he encounters. The story uses its fantasy setting to dramatize real concepts: the sunk cost fallacy, the bystander effect, planning under uncertainty, the importance of falsifiable hypotheses, and — woven throughout — the existential stakes of creating intelligence you don't fully understand.

Vice described it as "the most popular Harry Potter book you've never heard of." Hugo Award-winning author David Brin reviewed it positively for The Atlantic, calling it "a terrific series, subtle and dramatic and stimulating." It caused uproar in the fanfiction community, drawing both fierce condemnation and passionate devotion, and spawned an entire genre of "rational fiction" (or "ratfic") — stories in which characters solve problems through systematic reasoning rather than authorial convenience.

The cultural footprint is remarkable for what is technically a piece of AI safety outreach:

This is, as far as anyone can tell, the only time a research nonprofit has used fanfiction as a recruitment strategy. It worked.

Goodhart's Law Taxonomy (2018)

2018

MIRI researcher Scott Garrabrant and David Manheim published Categorizing Variants of Goodhart's Law, classifying four ways proxy measures break under optimization: regressional (selecting for a proxy selects for noise), extremal (relationships break at extremes), causal (intervening on a correlated proxy severs the correlation), and adversarial (an optimizer manipulates the proxy). This taxonomy became a standard reference in the alignment community.

III. Decision Theory: From Newcomb to FDT (2009–2020)

MIRI's decision theory research is motivated by a deceptively simple question: how should a rational agent make decisions? The two dominant frameworks in academic philosophy — Causal Decision Theory (CDT) and Evidential Decision Theory (EDT) — each fail on important classes of problems. MIRI produced a series of successor theories that attempt to resolve these failures.

Why Decision Theory Matters for AI

An aligned AI must make decisions that are actually good, not just decisions that satisfy some flawed criterion. Problems involving self-reference, logical correlation, and embedded agency break existing decision theories. A self-modifying agent's decision about how to modify itself is a decision-theory problem.

Timeless Decision Theory (2010)

2010
Eliezer Yudkowsky

Timeless Decision Theory treats the agent's decision as the output of an abstract computation — a mathematical function — and asks: "What output of this computation would yield the best expected outcome?" The "timeless" label comes from treating the decision procedure as a fixed mathematical object that exists independently of time.

TDT gets both Newcomb's Problem (one-box, collecting $1M) and the Smoking Lesion (smoke, since your algorithm isn't the common cause) right, where CDT and EDT each fail on one. But TDT struggled with problems involving updating on observations — particularly Counterfactual Mugging.

Updateless Decision Theory (2009)

2009
Wei Dai

Wei Dai's key insight: an ideal agent should commit to a complete policy before observing anything, evaluated from its prior. Rather than updating on observations and then choosing actions, UDT selects the globally optimal strategy and follows it mechanically. This is the principle of "updatelessness."

The difference becomes vivid in Counterfactual Mugging: after a coin comes up tails, an agent is asked for $100. If heads, they would have received $10,000 — but only if Omega predicted they'd pay in the tails case. UDT pays, because the policy "pay if tails" has positive expected value from the prior: 0.5 × $10,000 − 0.5 × $100 = $4,950.

Functional Decision Theory (2017)

2017
Eliezer Yudkowsky Nate Soares

FDT synthesizes TDT and UDT under a single framework. Its central question: "Which output of this very decision function would yield the best outcome?" FDT relies on subjunctive dependence: if two systems implement the same mathematical function, they produce identical outputs for logical rather than causal reasons. An FDT agent can "control" the outputs of systems running the same algorithm, even without causal connection.

ProblemCDTEDTFDT
Newcomb's ProblemTwo-boxes ($1K)One-boxes ($1M)One-boxes ($1M)
Smoking LesionSmokes (correct)Abstains (wrong)Smokes (correct)
Parfit's HitchhikerDoesn't pay (dies)Pays (lives)Pays (lives)
Counterfactual MuggingDoesn't payDependsPays
Death in DamascusInfinite loopUnstableStays (saves cost)

The companion paper "Cheating Death in Damascus" by Levinstein and Soares was published in The Journal of Philosophy in 2020 — a rare bridge between the MIRI/LessWrong decision theory tradition and mainstream academic philosophy.

The Thought Experiments in Detail

Newcomb's Problem: The One That Started It All

Devised by physicist William Newcomb of the Livermore Radiation Laboratories in the 1960s, first published by philosopher Robert Nozick in 1969. Two boxes sit before you. Box A is transparent: $1,000. Box B is opaque. A near-perfect predictor (Omega) has already set Box B's contents: $1,000,000 if it predicted you'd take only Box B, $0 if it predicted you'd take both. CDT says take both (the boxes are already set, so you get whatever's in B plus $1,000). FDT says take only Box B (your decision function's output is what Omega predicted, so one-boxers find $1,000,000). Newcomb's Problem has been the central motivating puzzle for MIRI's entire decision theory program. CDT agents consistently walk away with $1,000; FDT agents walk away with $1,000,000.

Parfit's Hitchhiker: The Desert of the Real

You're stranded in the desert, dying. A driver offers to save you if you'll pay $100 once in town. The driver can read your face perfectly — she knows if you'll actually pay. CDT says: once in town, paying cannot retroactively cause your rescue. Don't pay. But the driver, knowing this, drives away. You die in the desert. FDT pays, because drivers who predict FDT agents will pay pick them up. This makes the stakes of decision theory viscerally real: the wrong theory kills you.

Death in Damascus: Even Fleeing Is Futile

From the ancient fable: Death tells you it will come for you tomorrow, having already committed to a location. You can stay in Damascus or flee to Aleppo. CDT enters an infinite loop — if you plan to flee, Death will be in Aleppo; if you stay, Death will be in Damascus. FDT recognizes that Death's prediction will match whatever its decision function outputs. Neither city is safer. FDT stays and saves the travel cost, accepting fate with the only available dignity. The companion paper "Cheating Death in Damascus" (Levinstein and Soares) uses this problem as the basis for a rigorous argument published in The Journal of Philosophy in 2020.

Academic Reception

FDT has met significant resistance in academic philosophy. Wolfgang Schwarz (2018) argued it relies on counterpossible reasoning with unclear probability assignments. Critics note that FDT sometimes recommends actions that are certainly dominated given what the agent observes (Transparent Newcomb), and that there's no objective fact about whether a physical process implements a particular algorithm. Some academic philosophers have noted that FDT papers haven't gained full traction partly because the presentation style differs from what the philosophical mainstream expects, with less engagement with existing literature.

Despite this, the Journal of Philosophy publication represented meaningful engagement between the two traditions. And within the alignment community, FDT's framework — identifying with your algorithm, reasoning about subjunctive dependence, committing to policies from behind a veil of ignorance — has become the default way of thinking about decision theory for AI.

IV. The Agent Foundations Research Agenda (2014–2018)

In December 2014, Nate Soares and Benja Fallenstein published MIRI's formal technical research agenda: Agent Foundations for Aligning Machine Intelligence with Human Interests. This was MIRI's most concrete statement of what problems needed solving, organized around three categories: highly reliable agent designs, error tolerance, and value specification.

Nate Soares Benja Fallenstein Scott Garrabrant Abram Demski Jessica Taylor Sam Eisenstat

The Core Argument

MIRI argued that building reliably safe AI requires a rigorous theoretical understanding of agency, reasoning, and goal-directedness — just as bridge-building requires structural engineering, not just strong intuitions. The agent foundations approach aimed to solve alignment in a way that would remain robust even for a system smarter than its designers.

Logical Uncertainty

Standard probability theory assumes logical omniscience: a rational agent already knows all logical consequences of its beliefs. But real agents have bounded resources. They can't instantly determine whether a program halts or whether Goldbach's conjecture is true. Logical uncertainty is the problem of assigning meaningful probabilities to mathematical and logical statements for computationally bounded reasoners. (This led directly to MIRI's flagship result — see Section V.)

Vingean Reflection

Named after Vernor Vinge's observation that it's impossible to precisely predict agents more intelligent than yourself: How can an agent trust that its successor will pursue the same goals, even though it cannot predict what the successor will do?

This runs into deep barriers from mathematical logic:

The resulting Löbian obstacle means an agent using formal system T cannot conclude that successors using T will only take good actions. The agent can't trust its own reasoning system, and therefore can't trust any successor using it. Naive approaches lead to the Procrastination Paradox: the agent indefinitely defers important tasks because it can never prove its successor will do the right thing.

Naturalized Induction

Standard frameworks (Solomonoff induction, AIXI) assume a clean Cartesian boundary between agent and environment. The agent sits "outside" the world. But real agents are part of the world they reason about. They can be modeled, predicted, modified, or destroyed by their environment. Naturalized induction asks: what is the correct theory of reasoning for an agent embedded in its environment?

Ontology Identification

An AI's goals are specified in terms of some ontology — a set of concepts. But as the AI becomes more capable, its world-model changes radically. If you program an AI to "create diamond" defined as carbon atoms in a crystal lattice, and it later discovers nuclear physics and realizes atoms aren't fundamental — what counts as "diamond" in the new ontology? This is the problem of preserving goals across radical changes in world-model.

Value Alignment: The Fragility of Human Values

A recurring theme in MIRI's work is the fragility of value — Yudkowsky's argument that human values are like a delicate structure where removing or distorting even a single component produces catastrophe. An AI that optimizes for "human happiness" but lacks the concept of "consent" could tile the universe with brains forcibly stimulated to experience pleasure. An AI that preserves everything humans care about except "boredom" could trap us in infinite loops of stimulation. The space of "things that go right" is tiny compared to the space of "things that go subtly wrong."

This fragility argument is part of why MIRI believes alignment is so difficult. It's not enough to get values approximately right. With a sufficiently powerful optimizer, "approximately right" can mean "catastrophically wrong."

The AAMLS Parallel Track

Jessica Taylor led a parallel research program: Alignment for Advanced Machine Learning Systems (2016), focused on alignment strategies more directly applicable to ML systems. She also authored the quantilizers paper — a concept where an AI selects a random action from the top n% of actions from some reference distribution, acting as a compromise between a human and an expected utility maximizer. Rather than finding the single best action (which invites Goodhart), a quantilizer samples randomly from the top slice — "be pretty good but don't try too hard." This was a notable early attempt at "satisficing" approaches to alignment and has been cited as a precursor to various reward-shaping and constrained optimization techniques.

V. Logical Induction: MIRI's Flagship Result (2016)

September 2016
Scott Garrabrant Tsvi Benson-Tilsen Andrew Critch Nate Soares Jessica Taylor

If there is a single result that represents MIRI's most significant technical contribution, it is the Logical Induction paper — a 130-page proof that a computable process can assign probabilities to logical sentences in a way that satisfies a remarkable array of desirable properties simultaneously.

The Breakthrough Insight

The development story is one of gradual convergence. Scott Garrabrant started working on logical uncertainty with Abram Demski in a MIRIx group in April 2014. Two precursor papers (April 2016) divided logical uncertainty into two seemingly incompatible subproblems: recognizing patterns in provability, and recognizing statistical patterns in sequences of logical claims.

In February 2016, while walking to work, Garrabrant realized that both subproblems are special cases of a single criterion: a sequence of probability assignments such that no polynomial-time gambler with finite starting capital can achieve unbounded profits. That same day, he and Jessica Taylor proved the core result.

Sentences as Stocks

The framework uses an elegant market analogy. Each logical sentence φ is a stock worth $1 if φ is true and $0 if false. The logical inductor's probability assignment is the stock's market price. A trader is a polynomial-time computable function that decides what to buy or sell based on current prices. If any trader can make unbounded money, the beliefs are systematically wrong.

A belief sequence is a logical inductor if no efficiently computable trader can exploit it for unlimited profit. This single criterion — analogous to the efficient market hypothesis — gives rise to all the other properties.

The Properties

All twelve desiderata follow from the single inexploitability criterion:

What Logical Inductors Achieve

Limitations

The algorithm is computable but extremely slow — it is an existence proof and theoretical benchmark, not a practical system. All properties hold asymptotically with no finite-time guarantees, and the algorithm cannot target specific decision-relevant claims. Logical induction is in a position analogous to Solomonoff induction for empirical uncertainty: it tells us what's achievable in principle, but not how to build efficient systems with those properties.

Follow-Up and Influence

Sam Eisenstat showed that logical inductors can solve a version of the tiling problem (2018) — agents based on logical induction can trust successors that also use it. His "untrollable prior" result demonstrated that Bayesian approaches to logical uncertainty may be more viable than previously thought. The Computerphile YouTube channel popularized the ideas for a broad audience in October 2018.

VI. Embedded Agency: The Grand Unification (2018–2020)

2018
Abram Demski Scott Garrabrant

In October 2018, Abram Demski and Scott Garrabrant published the Embedded Agency sequence — a refactoring of MIRI's entire Agent Foundations agenda into a single unified mystery. Where the previous agenda presented problems as "here are a bunch of things we wish we understood about aligning AI," embedded agency repackaged them as "here is a central mystery of the universe, and here are the things we don't understand about it."

The Core Problem: We Are All Emmys

Classical AI frameworks — AIXI, Solomonoff induction, standard Bayesian decision theory — assume a sharp Cartesian boundary between agent and environment. The agent sits outside the world, like a video game player with a controller.

Demski and Garrabrant introduce two characters:

Every real agent is an Emmy. The problem: we do not have a satisfactory formal theory of how Emmys should reason and act.

The Four Sub-Problems

1. Decision Theory

An embedded agent can't "step outside" to compute optimal actions. It must reason about its own actions from within the system. This requires handling logical counterfactuals (what would happen if a mathematical function returned a different output?) and Newcomblike problems (where the environment contains predictions of your behavior).

2. Embedded World Models

The agent is smaller than its environment and cannot maintain a complete model. The true environment includes the agent, so the true hypothesis can't be in the agent's hypothesis space without the agent containing a complete model of itself — which is impossible. The real world is not in the agent's hypothesis space. Standard Bayesian updating breaks.

3. Robust Delegation

An agent that improves itself or creates successors must ensure the successor pursues the original goals — even though the successor may be much more intelligent. This is where Vingean reflection and the Löbian obstacle bite hardest.

4. Subsystem Alignment

An embedded agent is made of parts, and those parts may pursue their own objectives. A powerful optimization process searching a large space may find solutions that are themselves optimizers — optimization daemons or mesa-optimizers — with their own goals. This concern was later formalized in the influential "Risks from Learned Optimization" paper (Hubinger et al., 2019, with Garrabrant as co-author).

The Deconfusion Strategy

Embedded agency is the centerpiece of MIRI's deconfusion research strategy. By "deconfusion," Soares means "making it so that you can think about a given topic without continuously accidentally spouting nonsense." The strategy rests on several claims:

  1. We are currently confused about agency. Our frameworks assume Cartesian boundaries that don't exist.
  2. This confusion is dangerous. If we build powerful AI while confused, we can't predict or control it.
  3. Deconfusion is a prerequisite for alignment. We need to formulate the problem precisely.
  4. There are theoretical prerequisites for aligned AI that go beyond what's needed for capable AI.

This strategy has been criticized by researchers favoring empirical approaches (notably Paul Christiano and others at ARC and Anthropic). But regardless of strategic disagreement, the Embedded Agency sequence's four-part decomposition has become a standard reference in alignment research.

Cartesian Frames (2020)

2020

Garrabrant followed up with Cartesian Frames, a mathematical framework for reasoning about agent-environment boundaries. A Cartesian frame is a triple (A, E, *) where A and E are sets and * maps agent-environment pairs to world states. Different frames for the same world correspond to different ways of drawing the boundary. MIRI research associate Ramana Kumar formalized the framework in higher-order logic with machine-verified proofs.

VII. The Mathematical Arsenal (2012–2022)

Beneath the alignment theory and decision theory lies a sustained program of mathematical research — workshops, collaborations with academic mathematicians, and formal results that pushed the boundaries of provability theory, modal logic, and the foundations of causality.

The MIRI Math Workshops (2012–2014)

2012–2014

Beginning in late 2012, MIRI ran a series of Workshops on Logic, Probability, and Reflection, bringing together academic mathematicians, computer scientists, and independent researchers. In 2013 alone, 35 participants included 15 with PhDs (9 in mathematics). Participants came from Harvard, MIT, Princeton, Stanford, UC Berkeley, UC Riverside, LMU Munich, and Google.

The workshops produced concrete results: the "Robust Cooperation in the Prisoner's Dilemma" paper emerged from the April 2013 workshop. MIRI also published "Recommended Courses for MIRI Math Researchers" (March 2013), which became an influential resource directing aspiring alignment researchers toward relevant mathematical background — covering computability theory, model theory, provability logic, algorithmic information theory, and other areas that most CS curricula don't emphasize.

Botworld (2014)

Soares and Fallenstein created Botworld, a toy computational environment designed for studying self-modifying agents. In Botworld, agents are simple programs that can inspect and modify their own source code, create copies of themselves, and interact with other agents — all in a tractable formal setting. It served as a testbed for thinking about problems like self-reference, resource acquisition, and goal preservation across self-modification.

The MIRIx Program and Summer Fellows

MIRI launched the MIRIx program to support independent study groups working on MIRI-adjacent research problems. These groups — spread across multiple cities — operated autonomously but stayed in contact with MIRI staff. The first MIRIx group, started by Abram Demski and Scott Garrabrant in April 2014, was specifically focused on logical uncertainty and eventually produced the work that led to the Logical Induction paper.

The MIRI Summer Fellows Program (MSFP) brought promising researchers to MIRI for intensive summer residencies. The AIRCS (AI Risk for Computer Scientists) workshops introduced ML researchers to alignment concepts. By 2019, these programs had largely replaced the earlier math workshop model as the organization's primary talent pipeline.

Tiling Agents and the Löbian Obstacle (2013)

2013
Eliezer Yudkowsky Marcello Herreshoff

A tiling agent is one whose decision system approves the construction of similar agents, creating a repeating pattern — including preservation of goals. The key question: can an agent verify that successors will behave well, including building their own successors well, tiling outward indefinitely?

The straightforward approach hits Gödel: if agent A1 using formal system F wants to verify that A2 (also using F) will preserve properties, it can't — because F can't prove its own consistency. If F1 is stronger than F2, tiling works, but each generation of agents is weaker than the last. MIRI called this "perhaps the single most well-motivated approach to theoretical AI safety."

Modal Combat and Provability Logic

MIRI adopted provability logic (GL) for studying strategic interactions between agents that can read each other's source code — "open-source game theory." In modal combat, agents are modal formulas whose free variables represent opponents' source code. Evaluation uses modal fixpoint theorems.

The paper Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium via Provability Logic (Barasz, Christiano, Fallenstein, Herreshoff, LaVictoire, Yudkowsky, 2014) showed that modal agents achieve cooperative equilibria outperforming classical Nash equilibria — robust and agnostic to implementation details.

Reflective Oracles (2015)

2015
Benja Fallenstein Jessica Taylor Paul Christiano

A new type of oracle that can answer questions about oracle machines with access to the same oracle, avoiding diagonalization by answering some queries randomly. Key results: reflective oracles always exist, agents using reflective causal decision theory play Nash equilibria, and there is a one-to-one correspondence between reflective oracles and Nash equilibria.

Bounded Löb's Theorem (2016)

Andrew Critch extended Löb's theorem to bounded contexts — systems with limited memory and processing speed. The parametric version holds for proofs of at most n characters. Implication for game theory: bounded agents capable of writing proofs about each other can outperform classical Nash equilibria via "Löbian cooperation." Published in The Journal of Symbolic Logic.

Infra-Bayesianism (2020–)

2020
Vanessa Kosoy Alexander Appel

Standard Bayesian learning assumes realizability — the true environment is in the agent's hypothesis class. This almost never holds. Infra-Bayesianism drops this assumption, replacing probability distributions with infradistributions (convex sets of generalized distributions) that introduce Knightian uncertainty — irreducible uncertainty that can't be expressed as probabilities.

Under Knightian uncertainty, the agent maximizes expected utility against the worst possible environment (maximin). Infra-Bayesianism formally solves Newcomb's problem and Counterfactual Mugging in a fully rigorous way, handles multi-agent self-referential reasoning, and provides tools for embedded agency where realizability fails. Infra-Bayesian Physicalism (2021) extends the framework to naturalized induction — what "learning the universe program" means when the agent is a subprocess of the universe.

Finite Factored Sets (2021)

2021
Scott Garrabrant

A new mathematical foundation for causal and temporal reasoning, intended as a successor to Judea Pearl's DAG-based framework. A finite factored set is simply a finite set expressed as a Cartesian product of component sets.

The core philosophical insight: time is not assumed but derived from the factored structure. Two partitions are "orthogonal" if determined by independent factors. "Conditional orthogonality" plays the role of Pearl's d-separation. The fundamental theorem: conditional orthogonality equals conditional independence in all probability distributions on the factored set.

Garrabrant argues that Pearl's framework "cheats" because "given a collection of variables" hides a lot of work. Finite factored sets aim to handle deterministic functions, play better with abstraction, and provide a path toward continuous-time causal models.

VIII. Influence: How MIRI Shaped the AI Safety Landscape

MIRI's most consequential impact may not be any single theorem but the fact that AI alignment exists as a field at all. This section traces the lines of influence.

Organizations MIRI Helped Create or Catalyze

MIRI (2000) ├── LessWrong (2009) ── intellectual substrate ├── CFAR (2012) ── rationality training, talent pipeline │ └── Future of Life Institute (FLI, 2014) ├── [intellectual influence on] │ ├── OpenAI (2015) ── Altman: Yudkowsky "critical in the decision" │ ├── DeepMind safety ── Yudkowsky introduced Hassabis/Legg to Thiel │ ├── Anthropic (2021) ── EA/LessWrong intellectual roots │ ├── Alignment Research Center (2021) ── Christiano, MIRI workshop alumnus │ ├── Redwood Research ── founder did MIRI outreach; CTO was MIRI researcher │ └── Center for AI Safety (2022) └── Open Philanthropy AI safety program └── Luke Muehlhauser (MIRI ED → OP research analyst)

In 2014, the entire field of AI safety work was essentially done at two organizations — the Future of Humanity Institute and MIRI — spending roughly $1.75 million per year total. From this base, the field has expanded into a global ecosystem with hundreds of millions in annual funding.

The OpenAI Connection

Sam Altman stated that Yudkowsky was "critical in the decision to start OpenAI." Yudkowsky's writings convinced Elon Musk of the seriousness of AI risk, and Musk subsequently co-founded OpenAI. While Yudkowsky was not directly involved, LessWrong was widely read by early OpenAI engineers.

The DeepMind Connection

At an afterparty for the 2010 Singularity Summit, Yudkowsky personally introduced DeepMind co-founders Demis Hassabis and Shane Legg to Peter Thiel. Within weeks, Thiel provided their first major investment. Shane Legg, familiar with rationalist AI safety arguments, went on to co-lead DeepMind's safety efforts.

CFAR and the Rationalist Community

The Center for Applied Rationality (CFAR) was founded in 2012 by Julia Galef, Anna Salamon, Michael Smith, and Andrew Critch. It originated directly as an extension of MIRI — Anna Salamon had been doing rationality training and onboarding for MIRI staff and realized that developing exercises for training rationality was worth pursuing as its own discipline.

MIRI provided original funding. The organizations shared office space in Berkeley. CFAR workshops — intensive multi-day retreats teaching Bayesian reasoning, cognitive debiasing, and decision-making techniques — became a key talent pipeline for MIRI and the broader AI safety community. Many people who went on to work at AI safety organizations attended a CFAR workshop first.

Together, MIRI, LessWrong, and CFAR formed the institutional core of what became known as the rationalist community, centered in the San Francisco Bay Area (especially Berkeley). This community was characterized by emphasis on Bayesian reasoning, deep concern about AI existential risk, geographic clustering near MIRI and CFAR offices, and significant overlap with the Effective Altruism movement.

The community has not been without controversy. In 2025, CFAR president Anna Salamon told NBC News: "We didn't know at the time, but in hindsight we were creating conditions for a cult" — though this was described as referring to cult-adjacent spin-off communities rather than MIRI or CFAR themselves. The comment reflected broader media scrutiny of social dynamics within the rationalist community. Defenders noted that CFAR explicitly taught techniques for resisting groupthink and authority bias, which is an unusual feature for an alleged cult.

Influence on Effective Altruism

AI safety became one of EA's top cause areas largely because of MIRI's early advocacy. The pipeline runs both ways: Open Philanthropy, the largest EA funder, gave MIRI over $13 million in grants between 2016 and 2020. Luke Muehlhauser, MIRI's former ED, now leads Open Philanthropy's AI governance grantmaking. EA funders have collectively directed approximately half a billion dollars to AI safety.

Alumni and Where They Went

PersonMIRI RoleThen
Luke MuehlhauserExecutive DirectorOpen Philanthropy (AI governance)
Anna SalamonResearcherCo-founded CFAR; President
Paul ChristianoWorkshop participantOpenAI alignment team; founded ARC
Andrew CritchResearcherUC Berkeley CHAI; Encultured AI
Redwood Research foundersOutreach / researcherEmpirical alignment research

IX. The Turn to Pessimism (2020–2025)

MIRI's story takes a dark turn in the 2020s. A series of events — the acknowledged failure of their main research program, Yudkowsky's increasingly public despair, and the rapid advancement of large language models — led to a dramatic organizational pivot.

The Nondisclosed Research Era (2017–2020)

2017

In 2017, MIRI made a controversial decision: it adopted a nondisclosed-by-default research policy, meaning most new results would not be published. The rationale was that some alignment insights could inadvertently accelerate AI capabilities if released openly — that the field faced an infohazard problem. If you discover something about how to make AI systems reason more reliably, that same insight might help someone build a more capable but still misaligned system.

The policy was deeply controversial. Critics, including Ajeya Cotra at Open Philanthropy, argued that secrecy made it harder for others to build on MIRI's work, check it for errors, or contribute to the field. From the outside, MIRI appeared to be a black box consuming millions in donations with no visible output. Supporters countered that the analogy to infohazards in biosecurity was apt: you don't publish the genome of a more dangerous pathogen just because withholding it frustrates other researchers.

The 2020 Strategy Update: "Largely Failed"

2020

In December 2020, Nate Soares published a strategy update acknowledging that MIRI's primary non-public research project had "largely failed." Neither he nor Yudkowsky had "sufficient hope in it for us to continue focusing our main efforts there." The nondisclosed research era had not produced the hoped-for breakthroughs.

This was a watershed moment. MIRI had spent three years on a research direction it could not publicly describe, funded by donors who trusted the organization's judgment, and was now admitting that the effort had not worked. The update was remarkably candid — Soares acknowledged that MIRI needed to "cast about for new research directions" and that the failure had shaken the organization's confidence in its own strategic judgment. For critics, it vindicated concerns about the nondisclosed policy. For supporters, the honesty was itself a form of integrity.

"Death with Dignity" (April 2022)

April 2022

On April 1, 2022 (the timing deliberate but the content deadly serious), Yudkowsky published MIRI Announces New "Death With Dignity" Strategy. The post conveyed his personal assessment that the probability of human survival was approximately zero percent.

"It's obvious at this point that humanity isn't going to solve the alignment problem, or even try very hard, or even go out with much of a fight."

The reframing: instead of optimizing for absolute survival probability, optimize for "dignity" — making stackable log-odds improvements even when overall probability is near zero. This was intended as a psychological strategy to maintain motivation under extreme pessimism.

"AGI Ruin: A List of Lethalities" (June 2022)

June 2022

Yudkowsky published 43 numbered points presenting reasons why AGI will likely cause human extinction. Self-described as "a poorly organized list of individual rants," the document rests on four pillars:

  1. Current AI trajectories will produce superhuman AGI.
  2. Superhuman AGI will escape human control.
  3. Superhuman AGI will be misaligned by default, and misalignment at this level is extinction-level.
  4. We don't know how to align it, and trial-and-error is not available.

The Sharp Left Turn (July 2022)

July 2022

Nate Soares published A Central AI Alignment Problem: Capabilities Generalization, and the Sharp Left Turn. The core idea: an AI's capabilities may suddenly generalize across domains while its alignment properties fail to generalize. The analogy is to evolution: natural selection "trained" human brains for ancestral survival, but human intelligence generalized to physics, engineering, philosophy. Crucially, the optimization target (genetic fitness) did not generalize — humans routinely pursue non-fitness-maximizing goals.

This updated MIRI's earlier fast-takeoff model: the danger need not come from recursive self-improvement; regular human-driven improvements could produce large enough capability jumps. What matters is the asymmetry: capabilities generalize; alignment does not.

The MIRI Conversations (2021–2022)

2021–2022

In late 2021 and 2022, MIRI published a remarkable series of documents: raw, minimally-edited chatroom conversation logs about AI alignment, posted simultaneously on the MIRI blog, LessWrong, the AI Alignment Forum, and the EA Forum. Also available in audio form, these transcripts provided an unprecedented window into how alignment researchers actually think and argue.

The conversations were deliberately unpolished — not position papers, but real-time intellectual sparring. Key threads included:

The Conversations series was significant for three reasons. First, transparency: by publishing unedited discussions, MIRI showed its actual reasoning process, including uncertainties and internal disagreements, rather than just polished conclusions. Second, crystallizing disagreements: the conversations identified specific "cruxes" — points where resolution might shift entire worldviews. Third, foreshadowing: the depth of pessimism expressed in these conversations presaged the "Death with Dignity" post and the 2023 strategic pivot.

Yudkowsky's Media Blitz (2023)

2023

In 2023, Yudkowsky conducted the most sustained public communication campaign in AI safety history, appearing on an extraordinary range of platforms:

The LeCun Debates

Some of the most combustible moments came on Twitter/X, where Yudkowsky (@ESYudkowsky) engaged in extended public exchanges with Yann LeCun, Meta's Chief AI Scientist and Turing Award winner. LeCun dismissed Yudkowsky's arguments as "vague hand-waving" lacking technical rigor, characterized MIRI's scenarios as "speculative fiction," and estimated the probability of AI-caused existential catastrophe as "effectively zero." Yudkowsky responded that LeCun was failing to engage with the structural arguments and was confusing "this specific pathway seems unlikely" with "no dangerous pathway exists."

These debates were transcribed and analyzed by Zvi Mowshowitz and others, becoming reference documents in the broader AI safety discourse. They crystallized a fundamental divide: should AI safety arguments be evaluated on their logical structure (Yudkowsky's position) or on empirical evidence and engineering track records (LeCun's position)?

The TIME Op-Ed (March 2023)

March 2023

Yudkowsky published Pausing AI Developments Isn't Enough. We Need to Shut It All Down in TIME. He declined to sign FLI's open letter calling for a six-month pause, calling it "understating the seriousness of the situation." He proposed an indefinite worldwide moratorium, tracking and limiting GPU sales, and — in the most controversial passage — willingness to "destroy a rogue datacenter by airstrike" if necessary. TIME named him to its TIME100 AI list that year.

The Strategic Pivot (2023)

2023

MIRI formally shifted priorities from technical research to three objectives:

  1. Policy: Increasing the probability of an international agreement to halt frontier AI development
  2. Communications: Sharing MIRI's models of AI risk with policymakers and the public
  3. Research: Continuing to invest in a portfolio, but no longer the primary focus

The leadership restructured: Malo Bourgon became CEO, Soares became President, and Yudkowsky became Board Chair. An unusual arrangement: Yudkowsky, Soares, and Bourgon each received separate budgets for different technical research directions, reflecting their divergent views on which approaches are most promising.

The Book (2025)

2025

If Anyone Builds It, Everyone Dies by Yudkowsky and Soares became an instant New York Times bestseller. Named to The New Yorker's and The Guardian's Best Books of 2025 lists. A companion website published a tentative draft international treaty and a full draft international agreement for halting frontier AI development. The Technical Governance Team published a research agenda on AI governance to avoid extinction.

MIRI by the Numbers (2024–2025)

MetricValue
2024 spending$5.6 million
2025 projected expenses$6.5–7 million
End-of-2024 reserves~$16 million (~2 years of operations)
Total distinct contributors (all-time)4,789
Communications team (2025)~7 full-time employees
2025 fundraiser$1.6 million raised (matched 1:1 by SFF)

X. Legacy and Assessment

MIRI occupies a unique position: simultaneously the most influential organization in establishing AI safety as a field and one of the most controversial in its specific claims and policy proposals.

What MIRI Got Right

Criticisms

The p(doom) Discourse

MIRI and Yudkowsky are arguably the single most important actors in establishing the practice of assigning and publicly stating estimated probabilities that AI development leads to human extinction. Yudkowsky's >95% estimate anchors one pole of a spectrum that runs from Roman Yampolskiy's ~99% through mainstream safety researchers' 5–30% to Yann LeCun's "effectively zero." Interestingly, Yudkowsky himself has expressed reservations about p(doom) as a framework, calling it "a kind of bad way to compress worldviews."

Current Trajectory

MIRI in 2025 is a different organization than the MIRI of 2014. It has explicitly pivoted from technical research to policy and communications, with a new Technical Governance Team, a growing communications staff, and a bestselling book. The core message hasn't changed: absent drastic international coordination, humanity faces extinction-level risk from AI. But the theory of change has shifted from "solve the math first" to "buy time through policy while we figure out the math."

The Full Timeline

2000 SIAI founded by Yudkowsky, Brian & Sabine Atkins in Atlanta
2001 Creating Friendly AI 1.0 published
2004 Coherent Extrapolated Volition proposed
2005 Pivot from AI acceleration to AI risk; relocation to Silicon Valley
2006 First Singularity Summit at Stanford; Thiel provides $100K matching
2006–09 Yudkowsky writes The Sequences (333 essays)
2008 Omohundro's "Basic AI Drives"; Foom Debate with Robin Hanson
2009 LessWrong founded; Wei Dai proposes UDT
2010 Yudkowsky publishes TDT; HPMOR begins
2012 CFAR founded; MIRI math workshops begin; Muehlhauser becomes ED
2013 Renamed to MIRI; tiling agents paper; provability logic cooperation
2014 Agent Foundations agenda published; Bostrom's Superintelligence
2015 Soares becomes ED; corrigibility paper; reflective oracles; OpenAI founded
2016 Logical Induction published — MIRI's flagship result
2017 FDT paper; $3.75M Open Philanthropy grant; shift to nondisclosed research
2018 Embedded Agency sequence; Goodhart taxonomy
2019 "Risks from Learned Optimization" (mesa-optimization)
2020 $7.7M grant (largest ever); strategy update: "largely failed"; infra-Bayesianism begins
2021 MIRI Conversations series; finite factored sets
2022 "Death with Dignity"; List of Lethalities (43 points); Sharp Left Turn
2023 TIME op-ed; strategic pivot to policy; Bourgon becomes CEO; TIME100 AI
2024 Technical Governance Team established; communications scaling
2025 If Anyone Builds It, Everyone Dies — NYT bestseller; draft treaty published

Further Reading

Agent Foundations for Aligning Machine Intelligence with Human Interests Soares & Fallenstein, 2014. MIRI's formal research agenda. Logical Induction Garrabrant, Benson-Tilsen, Critch, Soares & Taylor, 2016. 130 pages. MIRI's flagship result. Functional Decision Theory: A New Theory of Instrumental Rationality Yudkowsky & Soares, 2017. Embedded Agency Demski & Garrabrant, 2018/2020. The grand unification of MIRI's research themes. Risks from Learned Optimization in Advanced Machine Learning Systems Hubinger, van Merwijk, Mikulik, Skalse & Garrabrant, 2019. Mesa-optimization. AGI Ruin: A List of Lethalities Yudkowsky, June 2022. 43 arguments for existential risk from AGI. MIRI Research Guide The official guide to MIRI's research, with reading order and prerequisites. All MIRI Publications Complete list of technical reports and papers.