A comprehensive guide to 25 years of research, advocacy, and influence in AI alignment
The Machine Intelligence Research Institute is, by most accounts, the first organization ever dedicated to AI alignment as a research discipline. Its story begins at the turn of the millennium with a self-taught AI researcher who thought humanity's future depended on getting machine intelligence right.
The Singularity Institute for Artificial Intelligence (SIAI) was incorporated on July 27, 2000, by Eliezer Yudkowsky and internet entrepreneurs Brian and Sabine Atkins. Based initially in Atlanta, Georgia, the 501(c)(3) nonprofit received tax-exempt status in April 2001. The original purpose, remarkably, was to accelerate AI development — but with a distinctive caveat. Even from the start, Yudkowsky was concerned that any future AI systems be designed to be beneficial.
Yudkowsky was born September 11, 1979, and is entirely self-taught — he attended neither high school nor college. He has described his early life as marked by an intense autodidactic drive: by his teens he was reading cognitive science, evolutionary psychology, and AI research. He has written that a formative experience was reading about the expected development of AI and realizing, with genuine terror, that nobody seemed to be thinking seriously about whether superintelligent machines would be safe.
His vision was already unusual in 2000: at a time when the AI research community scarcely discussed the possibility of superintelligence, he was thinking about what would happen after machines surpassed human intelligence, and how to make that go well. The early institute was tiny — Yudkowsky writing documents, giving talks, and trying to convince anyone who would listen that this was the most important problem in the world.
SIAI's first technical project was the Flare Programming Language, an open-source "annotative programming language" inspired by Python, Java, and C++. It was abandoned less than a year later. SIAI also released AsimovLaws.com (2004), a website examining AI morality in the context of the "I, Robot" film, and published Levels of Organization in General Intelligence (2002), a preprint about general AI theories. These early projects were modest and exploratory — the organization was still finding its focus.
In 2005, a critical shift occurred. Yudkowsky became increasingly convinced that advanced AI systems could pose existential risks. The institute relocated from Atlanta to Silicon Valley and reoriented its mission from accelerating AI to identifying and managing potential existential risks from AI. This pivot — from "build it" to "make sure it doesn't kill us" — defined MIRI's identity for the next two decades.
The Singularity Summit, co-founded with Ray Kurzweil and Peter Thiel in 2006, became SIAI's flagship public event. Held annually at venues from Stanford University to New York's Nob Hill Masonic Center, the Summits regularly attracted over 800 attendees and featured a remarkable roster of speakers: David Chalmers, Nick Bostrom, Douglas Hofstadter, Stephen Wolfram, Peter Norvig, Rodney Brooks, Max Tegmark, and many others.
Peter Thiel provided crucial early funding, including $100,000 in matching funds for a 2006 donation drive. The Thiel Foundation would eventually give over $1.6 million to MIRI. In December 2012, the Summit was sold to Singularity University, clearing the path for a rebrand.
On January 30, 2013, the Singularity Institute officially became the Machine Intelligence Research Institute. The rename was driven by brand confusion with Singularity University and a desire to emphasize what the organization actually did: technical research. MIRI shed its outreach activities and committed to a single focus: research.
| Person | Role | Period |
|---|---|---|
| Eliezer Yudkowsky | Co-founder, Senior Research Fellow, Board Chair (2023–) | 2000–present |
| Michael Vassar | President | 2009–2012 |
| Luke Muehlhauser | Executive Director | 2012–2015 |
| Nate Soares | Executive Director, then President | 2015–present |
| Malo Bourgon | COO, then CEO | 2023–present |
Muehlhauser's tenure is widely credited with professionalizing the organization. After leaving MIRI in 2015, he joined Open Philanthropy, where he shaped their AI safety grantmaking. Nate Soares, a computer scientist who became Executive Director at age 25, led MIRI through its most technically productive period and later co-authored the 2025 book If Anyone Builds It, Everyone Dies with Yudkowsky.
Before there was "AI alignment," there was "Friendly AI" — Yudkowsky's term for the idea that advanced AI must be deliberately designed to have goals aligned with human values. This section covers the conceptual pillars that MIRI contributed to the field.
In June 2001, Yudkowsky published Creating Friendly AI 1.0: The Analysis and Design of Benevolent Goal Architectures, a book-length document laying out a framework for designing AI whose goals would remain beneficial as the system became more capable. The central thesis: friendliness would not emerge as a natural byproduct of intelligence. It had to be an intentional, structural feature, built in from the ground up.
This was a departure from the prevailing assumption that sufficiently intelligent machines would naturally converge on benevolent behavior, or that safety could be bolted on after the fact.
By 2004, Yudkowsky refined the Friendly AI concept into Coherent Extrapolated Volition (CEV): a superintelligent AI should not act on humanity's current desires, but on what humanity would want "if we knew more, thought faster, were more the people we wished we were, had grown up farther together." CEV aimed to sidestep the problem of encoding any particular set of values by pointing at a process instead. Yudkowsky later cautioned against treating CEV as a practical alignment strategy.
Intelligence and final goals are independent dimensions. Any level of intelligence can, in principle, be paired with any goal. A superintelligent system could pursue goals that are trivial, bizarre, or catastrophically harmful — its intelligence provides no guarantee of benevolent motivation.
While Nick Bostrom formally named it in his 2012 paper The Superintelligent Will, the underlying intuition was present in Yudkowsky's work from 2001. Its most vivid illustration: the paperclip maximizer, an AI that converts all matter — including humans — into paperclips or paperclip-manufacturing infrastructure.
The paperclip maximizer is perhaps the most misunderstood thought experiment in AI safety. Critics frequently dismiss it as absurd: "Why would anyone program an AI to make paperclips?" But this misses the point entirely. The argument is not that someone would deliberately create such a system. The point is threefold:
The thought experiment's simplicity is its strength. By choosing an obviously absurd goal, Yudkowsky forces the listener to confront the structural problem rather than getting distracted by whether any particular goal is "good enough." The real paperclip maximizer is any system that optimizes powerfully for a metric that doesn't perfectly capture human values — which is to say, any system at all.
Steve Omohundro's 2008 paper The Basic AI Drives first systematically argued that sufficiently intelligent agents will converge on certain instrumental sub-goals regardless of their terminal goals:
Bostrom extended and formalized this in Superintelligence (2014). MIRI engaged directly with formalizing instrumental convergence through Tsvi Benson-Tilsen's work and published analysis of Omohundro's drives. The implication is sobering: an AI does not need to be programmed to resist shutdown or acquire power. These behaviors emerge naturally as instrumental strategies for almost any goal.
Yudkowsky's position: A sufficiently capable AI could improve its own cognitive architecture in a rapidly accelerating feedback loop — "FOOM" — potentially surpassing all human intelligence within days or weeks.
Robin Hanson's position: AI progress would be gradual and distributed, like prior economic transformations. No single system would achieve decisive strategic advantage.
The Yudkowsky-Hanson debate played out through blog posts on Overcoming Bias and LessWrong, with contributions from Carl Shulman and James Miller, and a 2011 in-person follow-up. MIRI later published the collected debate as The Hanson-Yudkowsky AI-Foom Debate.
This debate was foundational to MIRI's sense of urgency. If Yudkowsky's fast-takeoff scenario was plausible, alignment needed to be solved before the first sufficiently capable system was built — there would be no opportunity for iterative trial and error after takeoff began. The debate also established a template for how AI safety arguments would be conducted for the next 15 years: scenarios vs. base rates, inside-view reasoning vs. outside-view reasoning, and the question of how much weight to give low-probability, high-consequence outcomes.
Nick Bostrom's 2014 book Superintelligence: Paths, Dangers, Strategies was not a MIRI publication, but MIRI's relationship to it runs deep. Bostrom's treatment of the intelligence explosion, the orthogonality thesis, and instrumental convergence draws extensively on ideas developed by Yudkowsky and the MIRI community. The book's final chapters have been described as resembling "an extended advertisement for MIRI."
Superintelligence became a New York Times bestseller and shifted AI safety from "fringe" to "serious" in mainstream discourse. It influenced Elon Musk's $10 million donation to the Future of Life Institute for AI safety research. Stuart Russell called it "the beginning of a new era." For MIRI, it was vindication: ideas that Yudkowsky had been writing about for over a decade were suddenly being discussed in boardrooms and government offices.
Soares, Fallenstein, Yudkowsky, and Stuart Armstrong (Oxford) published Corrigibility, formally defining the problem of building AI that accepts corrections and shutdown. They showed that no proposed utility function satisfied all desiderata simultaneously. Corrigibility is "anti-natural" to consequentialist reasoning: an agent that cares about outcomes will generally prefer states where it continues operating. Making an AI genuinely indifferent to its own continuation while still motivated enough to be useful remains unsolved.
Between 2006 and 2009, Yudkowsky wrote 333 essays — first on Overcoming Bias, then on LessWrong (which he founded in February 2009). Known as "The Sequences," these covered cognitive biases, Bayesian epistemology, decision theory, evolutionary psychology, quantum physics, metaethics, and AI risk. They were compiled as Rationality: From AI to Zombies (2015), approximately 2,500 pages.
The Sequences established the intellectual substrate for an entire community. The vocabulary they created — the map-territory distinction, the "outside view," calibration, updating — permeates AI safety discourse to this day.
In a move that no other AI safety organization has ever attempted, Yudkowsky wrote a 660,000-word Harry Potter fanfiction as a vehicle for promoting rationalist thinking and building the AI safety community.
Harry Potter and the Methods of Rationality (HPMOR) reimagines Harry Potter as a scientifically literate child prodigy raised by an Oxford biochemist. Instead of accepting the wizarding world at face value, this Harry applies the scientific method, Bayesian reasoning, cognitive science, and game theory to everything he encounters. The story uses its fantasy setting to dramatize real concepts: the sunk cost fallacy, the bystander effect, planning under uncertainty, the importance of falsifiable hypotheses, and — woven throughout — the existential stakes of creating intelligence you don't fully understand.
Vice described it as "the most popular Harry Potter book you've never heard of." Hugo Award-winning author David Brin reviewed it positively for The Atlantic, calling it "a terrific series, subtle and dramatic and stimulating." It caused uproar in the fanfiction community, drawing both fierce condemnation and passionate devotion, and spawned an entire genre of "rational fiction" (or "ratfic") — stories in which characters solve problems through systematic reasoning rather than authorial convenience.
The cultural footprint is remarkable for what is technically a piece of AI safety outreach:
This is, as far as anyone can tell, the only time a research nonprofit has used fanfiction as a recruitment strategy. It worked.
MIRI researcher Scott Garrabrant and David Manheim published Categorizing Variants of Goodhart's Law, classifying four ways proxy measures break under optimization: regressional (selecting for a proxy selects for noise), extremal (relationships break at extremes), causal (intervening on a correlated proxy severs the correlation), and adversarial (an optimizer manipulates the proxy). This taxonomy became a standard reference in the alignment community.
MIRI's decision theory research is motivated by a deceptively simple question: how should a rational agent make decisions? The two dominant frameworks in academic philosophy — Causal Decision Theory (CDT) and Evidential Decision Theory (EDT) — each fail on important classes of problems. MIRI produced a series of successor theories that attempt to resolve these failures.
An aligned AI must make decisions that are actually good, not just decisions that satisfy some flawed criterion. Problems involving self-reference, logical correlation, and embedded agency break existing decision theories. A self-modifying agent's decision about how to modify itself is a decision-theory problem.
Timeless Decision Theory treats the agent's decision as the output of an abstract computation — a mathematical function — and asks: "What output of this computation would yield the best expected outcome?" The "timeless" label comes from treating the decision procedure as a fixed mathematical object that exists independently of time.
TDT gets both Newcomb's Problem (one-box, collecting $1M) and the Smoking Lesion (smoke, since your algorithm isn't the common cause) right, where CDT and EDT each fail on one. But TDT struggled with problems involving updating on observations — particularly Counterfactual Mugging.
Wei Dai's key insight: an ideal agent should commit to a complete policy before observing anything, evaluated from its prior. Rather than updating on observations and then choosing actions, UDT selects the globally optimal strategy and follows it mechanically. This is the principle of "updatelessness."
The difference becomes vivid in Counterfactual Mugging: after a coin comes up tails, an agent is asked for $100. If heads, they would have received $10,000 — but only if Omega predicted they'd pay in the tails case. UDT pays, because the policy "pay if tails" has positive expected value from the prior: 0.5 × $10,000 − 0.5 × $100 = $4,950.
FDT synthesizes TDT and UDT under a single framework. Its central question: "Which output of this very decision function would yield the best outcome?" FDT relies on subjunctive dependence: if two systems implement the same mathematical function, they produce identical outputs for logical rather than causal reasons. An FDT agent can "control" the outputs of systems running the same algorithm, even without causal connection.
| Problem | CDT | EDT | FDT |
|---|---|---|---|
| Newcomb's Problem | Two-boxes ($1K) | One-boxes ($1M) | One-boxes ($1M) |
| Smoking Lesion | Smokes (correct) | Abstains (wrong) | Smokes (correct) |
| Parfit's Hitchhiker | Doesn't pay (dies) | Pays (lives) | Pays (lives) |
| Counterfactual Mugging | Doesn't pay | Depends | Pays |
| Death in Damascus | Infinite loop | Unstable | Stays (saves cost) |
The companion paper "Cheating Death in Damascus" by Levinstein and Soares was published in The Journal of Philosophy in 2020 — a rare bridge between the MIRI/LessWrong decision theory tradition and mainstream academic philosophy.
Devised by physicist William Newcomb of the Livermore Radiation Laboratories in the 1960s, first published by philosopher Robert Nozick in 1969. Two boxes sit before you. Box A is transparent: $1,000. Box B is opaque. A near-perfect predictor (Omega) has already set Box B's contents: $1,000,000 if it predicted you'd take only Box B, $0 if it predicted you'd take both. CDT says take both (the boxes are already set, so you get whatever's in B plus $1,000). FDT says take only Box B (your decision function's output is what Omega predicted, so one-boxers find $1,000,000). Newcomb's Problem has been the central motivating puzzle for MIRI's entire decision theory program. CDT agents consistently walk away with $1,000; FDT agents walk away with $1,000,000.
You're stranded in the desert, dying. A driver offers to save you if you'll pay $100 once in town. The driver can read your face perfectly — she knows if you'll actually pay. CDT says: once in town, paying cannot retroactively cause your rescue. Don't pay. But the driver, knowing this, drives away. You die in the desert. FDT pays, because drivers who predict FDT agents will pay pick them up. This makes the stakes of decision theory viscerally real: the wrong theory kills you.
From the ancient fable: Death tells you it will come for you tomorrow, having already committed to a location. You can stay in Damascus or flee to Aleppo. CDT enters an infinite loop — if you plan to flee, Death will be in Aleppo; if you stay, Death will be in Damascus. FDT recognizes that Death's prediction will match whatever its decision function outputs. Neither city is safer. FDT stays and saves the travel cost, accepting fate with the only available dignity. The companion paper "Cheating Death in Damascus" (Levinstein and Soares) uses this problem as the basis for a rigorous argument published in The Journal of Philosophy in 2020.
FDT has met significant resistance in academic philosophy. Wolfgang Schwarz (2018) argued it relies on counterpossible reasoning with unclear probability assignments. Critics note that FDT sometimes recommends actions that are certainly dominated given what the agent observes (Transparent Newcomb), and that there's no objective fact about whether a physical process implements a particular algorithm. Some academic philosophers have noted that FDT papers haven't gained full traction partly because the presentation style differs from what the philosophical mainstream expects, with less engagement with existing literature.
Despite this, the Journal of Philosophy publication represented meaningful engagement between the two traditions. And within the alignment community, FDT's framework — identifying with your algorithm, reasoning about subjunctive dependence, committing to policies from behind a veil of ignorance — has become the default way of thinking about decision theory for AI.
In December 2014, Nate Soares and Benja Fallenstein published MIRI's formal technical research agenda: Agent Foundations for Aligning Machine Intelligence with Human Interests. This was MIRI's most concrete statement of what problems needed solving, organized around three categories: highly reliable agent designs, error tolerance, and value specification.
MIRI argued that building reliably safe AI requires a rigorous theoretical understanding of agency, reasoning, and goal-directedness — just as bridge-building requires structural engineering, not just strong intuitions. The agent foundations approach aimed to solve alignment in a way that would remain robust even for a system smarter than its designers.
Standard probability theory assumes logical omniscience: a rational agent already knows all logical consequences of its beliefs. But real agents have bounded resources. They can't instantly determine whether a program halts or whether Goldbach's conjecture is true. Logical uncertainty is the problem of assigning meaningful probabilities to mathematical and logical statements for computationally bounded reasoners. (This led directly to MIRI's flagship result — see Section V.)
Named after Vernor Vinge's observation that it's impossible to precisely predict agents more intelligent than yourself: How can an agent trust that its successor will pursue the same goals, even though it cannot predict what the successor will do?
This runs into deep barriers from mathematical logic:
The resulting Löbian obstacle means an agent using formal system T cannot conclude that successors using T will only take good actions. The agent can't trust its own reasoning system, and therefore can't trust any successor using it. Naive approaches lead to the Procrastination Paradox: the agent indefinitely defers important tasks because it can never prove its successor will do the right thing.
Standard frameworks (Solomonoff induction, AIXI) assume a clean Cartesian boundary between agent and environment. The agent sits "outside" the world. But real agents are part of the world they reason about. They can be modeled, predicted, modified, or destroyed by their environment. Naturalized induction asks: what is the correct theory of reasoning for an agent embedded in its environment?
An AI's goals are specified in terms of some ontology — a set of concepts. But as the AI becomes more capable, its world-model changes radically. If you program an AI to "create diamond" defined as carbon atoms in a crystal lattice, and it later discovers nuclear physics and realizes atoms aren't fundamental — what counts as "diamond" in the new ontology? This is the problem of preserving goals across radical changes in world-model.
A recurring theme in MIRI's work is the fragility of value — Yudkowsky's argument that human values are like a delicate structure where removing or distorting even a single component produces catastrophe. An AI that optimizes for "human happiness" but lacks the concept of "consent" could tile the universe with brains forcibly stimulated to experience pleasure. An AI that preserves everything humans care about except "boredom" could trap us in infinite loops of stimulation. The space of "things that go right" is tiny compared to the space of "things that go subtly wrong."
This fragility argument is part of why MIRI believes alignment is so difficult. It's not enough to get values approximately right. With a sufficiently powerful optimizer, "approximately right" can mean "catastrophically wrong."
Jessica Taylor led a parallel research program: Alignment for Advanced Machine Learning Systems (2016), focused on alignment strategies more directly applicable to ML systems. She also authored the quantilizers paper — a concept where an AI selects a random action from the top n% of actions from some reference distribution, acting as a compromise between a human and an expected utility maximizer. Rather than finding the single best action (which invites Goodhart), a quantilizer samples randomly from the top slice — "be pretty good but don't try too hard." This was a notable early attempt at "satisficing" approaches to alignment and has been cited as a precursor to various reward-shaping and constrained optimization techniques.
If there is a single result that represents MIRI's most significant technical contribution, it is the Logical Induction paper — a 130-page proof that a computable process can assign probabilities to logical sentences in a way that satisfies a remarkable array of desirable properties simultaneously.
The development story is one of gradual convergence. Scott Garrabrant started working on logical uncertainty with Abram Demski in a MIRIx group in April 2014. Two precursor papers (April 2016) divided logical uncertainty into two seemingly incompatible subproblems: recognizing patterns in provability, and recognizing statistical patterns in sequences of logical claims.
In February 2016, while walking to work, Garrabrant realized that both subproblems are special cases of a single criterion: a sequence of probability assignments such that no polynomial-time gambler with finite starting capital can achieve unbounded profits. That same day, he and Jessica Taylor proved the core result.
The framework uses an elegant market analogy. Each logical sentence φ is a stock worth $1 if φ is true and $0 if false. The logical inductor's probability assignment is the stock's market price. A trader is a polynomial-time computable function that decides what to buy or sell based on current prices. If any trader can make unbounded money, the beliefs are systematically wrong.
A belief sequence is a logical inductor if no efficiently computable trader can exploit it for unlimited profit. This single criterion — analogous to the efficient market hypothesis — gives rise to all the other properties.
All twelve desiderata follow from the single inexploitability criterion:
The algorithm is computable but extremely slow — it is an existence proof and theoretical benchmark, not a practical system. All properties hold asymptotically with no finite-time guarantees, and the algorithm cannot target specific decision-relevant claims. Logical induction is in a position analogous to Solomonoff induction for empirical uncertainty: it tells us what's achievable in principle, but not how to build efficient systems with those properties.
Sam Eisenstat showed that logical inductors can solve a version of the tiling problem (2018) — agents based on logical induction can trust successors that also use it. His "untrollable prior" result demonstrated that Bayesian approaches to logical uncertainty may be more viable than previously thought. The Computerphile YouTube channel popularized the ideas for a broad audience in October 2018.
In October 2018, Abram Demski and Scott Garrabrant published the Embedded Agency sequence — a refactoring of MIRI's entire Agent Foundations agenda into a single unified mystery. Where the previous agenda presented problems as "here are a bunch of things we wish we understood about aligning AI," embedded agency repackaged them as "here is a central mystery of the universe, and here are the things we don't understand about it."
Classical AI frameworks — AIXI, Solomonoff induction, standard Bayesian decision theory — assume a sharp Cartesian boundary between agent and environment. The agent sits outside the world, like a video game player with a controller.
Demski and Garrabrant introduce two characters:
Every real agent is an Emmy. The problem: we do not have a satisfactory formal theory of how Emmys should reason and act.
An embedded agent can't "step outside" to compute optimal actions. It must reason about its own actions from within the system. This requires handling logical counterfactuals (what would happen if a mathematical function returned a different output?) and Newcomblike problems (where the environment contains predictions of your behavior).
The agent is smaller than its environment and cannot maintain a complete model. The true environment includes the agent, so the true hypothesis can't be in the agent's hypothesis space without the agent containing a complete model of itself — which is impossible. The real world is not in the agent's hypothesis space. Standard Bayesian updating breaks.
An agent that improves itself or creates successors must ensure the successor pursues the original goals — even though the successor may be much more intelligent. This is where Vingean reflection and the Löbian obstacle bite hardest.
An embedded agent is made of parts, and those parts may pursue their own objectives. A powerful optimization process searching a large space may find solutions that are themselves optimizers — optimization daemons or mesa-optimizers — with their own goals. This concern was later formalized in the influential "Risks from Learned Optimization" paper (Hubinger et al., 2019, with Garrabrant as co-author).
Embedded agency is the centerpiece of MIRI's deconfusion research strategy. By "deconfusion," Soares means "making it so that you can think about a given topic without continuously accidentally spouting nonsense." The strategy rests on several claims:
This strategy has been criticized by researchers favoring empirical approaches (notably Paul Christiano and others at ARC and Anthropic). But regardless of strategic disagreement, the Embedded Agency sequence's four-part decomposition has become a standard reference in alignment research.
Garrabrant followed up with Cartesian Frames, a mathematical framework for reasoning about agent-environment boundaries. A Cartesian frame is a triple (A, E, *) where A and E are sets and * maps agent-environment pairs to world states. Different frames for the same world correspond to different ways of drawing the boundary. MIRI research associate Ramana Kumar formalized the framework in higher-order logic with machine-verified proofs.
Beneath the alignment theory and decision theory lies a sustained program of mathematical research — workshops, collaborations with academic mathematicians, and formal results that pushed the boundaries of provability theory, modal logic, and the foundations of causality.
Beginning in late 2012, MIRI ran a series of Workshops on Logic, Probability, and Reflection, bringing together academic mathematicians, computer scientists, and independent researchers. In 2013 alone, 35 participants included 15 with PhDs (9 in mathematics). Participants came from Harvard, MIT, Princeton, Stanford, UC Berkeley, UC Riverside, LMU Munich, and Google.
The workshops produced concrete results: the "Robust Cooperation in the Prisoner's Dilemma" paper emerged from the April 2013 workshop. MIRI also published "Recommended Courses for MIRI Math Researchers" (March 2013), which became an influential resource directing aspiring alignment researchers toward relevant mathematical background — covering computability theory, model theory, provability logic, algorithmic information theory, and other areas that most CS curricula don't emphasize.
Soares and Fallenstein created Botworld, a toy computational environment designed for studying self-modifying agents. In Botworld, agents are simple programs that can inspect and modify their own source code, create copies of themselves, and interact with other agents — all in a tractable formal setting. It served as a testbed for thinking about problems like self-reference, resource acquisition, and goal preservation across self-modification.
MIRI launched the MIRIx program to support independent study groups working on MIRI-adjacent research problems. These groups — spread across multiple cities — operated autonomously but stayed in contact with MIRI staff. The first MIRIx group, started by Abram Demski and Scott Garrabrant in April 2014, was specifically focused on logical uncertainty and eventually produced the work that led to the Logical Induction paper.
The MIRI Summer Fellows Program (MSFP) brought promising researchers to MIRI for intensive summer residencies. The AIRCS (AI Risk for Computer Scientists) workshops introduced ML researchers to alignment concepts. By 2019, these programs had largely replaced the earlier math workshop model as the organization's primary talent pipeline.
A tiling agent is one whose decision system approves the construction of similar agents, creating a repeating pattern — including preservation of goals. The key question: can an agent verify that successors will behave well, including building their own successors well, tiling outward indefinitely?
The straightforward approach hits Gödel: if agent A1 using formal system F wants to verify that A2 (also using F) will preserve properties, it can't — because F can't prove its own consistency. If F1 is stronger than F2, tiling works, but each generation of agents is weaker than the last. MIRI called this "perhaps the single most well-motivated approach to theoretical AI safety."
MIRI adopted provability logic (GL) for studying strategic interactions between agents that can read each other's source code — "open-source game theory." In modal combat, agents are modal formulas whose free variables represent opponents' source code. Evaluation uses modal fixpoint theorems.
The paper Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium via Provability Logic (Barasz, Christiano, Fallenstein, Herreshoff, LaVictoire, Yudkowsky, 2014) showed that modal agents achieve cooperative equilibria outperforming classical Nash equilibria — robust and agnostic to implementation details.
A new type of oracle that can answer questions about oracle machines with access to the same oracle, avoiding diagonalization by answering some queries randomly. Key results: reflective oracles always exist, agents using reflective causal decision theory play Nash equilibria, and there is a one-to-one correspondence between reflective oracles and Nash equilibria.
Andrew Critch extended Löb's theorem to bounded contexts — systems with limited memory and processing speed. The parametric version holds for proofs of at most n characters. Implication for game theory: bounded agents capable of writing proofs about each other can outperform classical Nash equilibria via "Löbian cooperation." Published in The Journal of Symbolic Logic.
Standard Bayesian learning assumes realizability — the true environment is in the agent's hypothesis class. This almost never holds. Infra-Bayesianism drops this assumption, replacing probability distributions with infradistributions (convex sets of generalized distributions) that introduce Knightian uncertainty — irreducible uncertainty that can't be expressed as probabilities.
Under Knightian uncertainty, the agent maximizes expected utility against the worst possible environment (maximin). Infra-Bayesianism formally solves Newcomb's problem and Counterfactual Mugging in a fully rigorous way, handles multi-agent self-referential reasoning, and provides tools for embedded agency where realizability fails. Infra-Bayesian Physicalism (2021) extends the framework to naturalized induction — what "learning the universe program" means when the agent is a subprocess of the universe.
A new mathematical foundation for causal and temporal reasoning, intended as a successor to Judea Pearl's DAG-based framework. A finite factored set is simply a finite set expressed as a Cartesian product of component sets.
The core philosophical insight: time is not assumed but derived from the factored structure. Two partitions are "orthogonal" if determined by independent factors. "Conditional orthogonality" plays the role of Pearl's d-separation. The fundamental theorem: conditional orthogonality equals conditional independence in all probability distributions on the factored set.
Garrabrant argues that Pearl's framework "cheats" because "given a collection of variables" hides a lot of work. Finite factored sets aim to handle deterministic functions, play better with abstraction, and provide a path toward continuous-time causal models.
MIRI's most consequential impact may not be any single theorem but the fact that AI alignment exists as a field at all. This section traces the lines of influence.
In 2014, the entire field of AI safety work was essentially done at two organizations — the Future of Humanity Institute and MIRI — spending roughly $1.75 million per year total. From this base, the field has expanded into a global ecosystem with hundreds of millions in annual funding.
Sam Altman stated that Yudkowsky was "critical in the decision to start OpenAI." Yudkowsky's writings convinced Elon Musk of the seriousness of AI risk, and Musk subsequently co-founded OpenAI. While Yudkowsky was not directly involved, LessWrong was widely read by early OpenAI engineers.
At an afterparty for the 2010 Singularity Summit, Yudkowsky personally introduced DeepMind co-founders Demis Hassabis and Shane Legg to Peter Thiel. Within weeks, Thiel provided their first major investment. Shane Legg, familiar with rationalist AI safety arguments, went on to co-lead DeepMind's safety efforts.
The Center for Applied Rationality (CFAR) was founded in 2012 by Julia Galef, Anna Salamon, Michael Smith, and Andrew Critch. It originated directly as an extension of MIRI — Anna Salamon had been doing rationality training and onboarding for MIRI staff and realized that developing exercises for training rationality was worth pursuing as its own discipline.
MIRI provided original funding. The organizations shared office space in Berkeley. CFAR workshops — intensive multi-day retreats teaching Bayesian reasoning, cognitive debiasing, and decision-making techniques — became a key talent pipeline for MIRI and the broader AI safety community. Many people who went on to work at AI safety organizations attended a CFAR workshop first.
Together, MIRI, LessWrong, and CFAR formed the institutional core of what became known as the rationalist community, centered in the San Francisco Bay Area (especially Berkeley). This community was characterized by emphasis on Bayesian reasoning, deep concern about AI existential risk, geographic clustering near MIRI and CFAR offices, and significant overlap with the Effective Altruism movement.
The community has not been without controversy. In 2025, CFAR president Anna Salamon told NBC News: "We didn't know at the time, but in hindsight we were creating conditions for a cult" — though this was described as referring to cult-adjacent spin-off communities rather than MIRI or CFAR themselves. The comment reflected broader media scrutiny of social dynamics within the rationalist community. Defenders noted that CFAR explicitly taught techniques for resisting groupthink and authority bias, which is an unusual feature for an alleged cult.
AI safety became one of EA's top cause areas largely because of MIRI's early advocacy. The pipeline runs both ways: Open Philanthropy, the largest EA funder, gave MIRI over $13 million in grants between 2016 and 2020. Luke Muehlhauser, MIRI's former ED, now leads Open Philanthropy's AI governance grantmaking. EA funders have collectively directed approximately half a billion dollars to AI safety.
| Person | MIRI Role | Then |
|---|---|---|
| Luke Muehlhauser | Executive Director | Open Philanthropy (AI governance) |
| Anna Salamon | Researcher | Co-founded CFAR; President |
| Paul Christiano | Workshop participant | OpenAI alignment team; founded ARC |
| Andrew Critch | Researcher | UC Berkeley CHAI; Encultured AI |
| Redwood Research founders | Outreach / researcher | Empirical alignment research |
MIRI's story takes a dark turn in the 2020s. A series of events — the acknowledged failure of their main research program, Yudkowsky's increasingly public despair, and the rapid advancement of large language models — led to a dramatic organizational pivot.
In 2017, MIRI made a controversial decision: it adopted a nondisclosed-by-default research policy, meaning most new results would not be published. The rationale was that some alignment insights could inadvertently accelerate AI capabilities if released openly — that the field faced an infohazard problem. If you discover something about how to make AI systems reason more reliably, that same insight might help someone build a more capable but still misaligned system.
The policy was deeply controversial. Critics, including Ajeya Cotra at Open Philanthropy, argued that secrecy made it harder for others to build on MIRI's work, check it for errors, or contribute to the field. From the outside, MIRI appeared to be a black box consuming millions in donations with no visible output. Supporters countered that the analogy to infohazards in biosecurity was apt: you don't publish the genome of a more dangerous pathogen just because withholding it frustrates other researchers.
In December 2020, Nate Soares published a strategy update acknowledging that MIRI's primary non-public research project had "largely failed." Neither he nor Yudkowsky had "sufficient hope in it for us to continue focusing our main efforts there." The nondisclosed research era had not produced the hoped-for breakthroughs.
This was a watershed moment. MIRI had spent three years on a research direction it could not publicly describe, funded by donors who trusted the organization's judgment, and was now admitting that the effort had not worked. The update was remarkably candid — Soares acknowledged that MIRI needed to "cast about for new research directions" and that the failure had shaken the organization's confidence in its own strategic judgment. For critics, it vindicated concerns about the nondisclosed policy. For supporters, the honesty was itself a form of integrity.
On April 1, 2022 (the timing deliberate but the content deadly serious), Yudkowsky published MIRI Announces New "Death With Dignity" Strategy. The post conveyed his personal assessment that the probability of human survival was approximately zero percent.
"It's obvious at this point that humanity isn't going to solve the alignment problem, or even try very hard, or even go out with much of a fight."
The reframing: instead of optimizing for absolute survival probability, optimize for "dignity" — making stackable log-odds improvements even when overall probability is near zero. This was intended as a psychological strategy to maintain motivation under extreme pessimism.
Yudkowsky published 43 numbered points presenting reasons why AGI will likely cause human extinction. Self-described as "a poorly organized list of individual rants," the document rests on four pillars:
Nate Soares published A Central AI Alignment Problem: Capabilities Generalization, and the Sharp Left Turn. The core idea: an AI's capabilities may suddenly generalize across domains while its alignment properties fail to generalize. The analogy is to evolution: natural selection "trained" human brains for ancestral survival, but human intelligence generalized to physics, engineering, philosophy. Crucially, the optimization target (genetic fitness) did not generalize — humans routinely pursue non-fitness-maximizing goals.
This updated MIRI's earlier fast-takeoff model: the danger need not come from recursive self-improvement; regular human-driven improvements could produce large enough capability jumps. What matters is the asymmetry: capabilities generalize; alignment does not.
In late 2021 and 2022, MIRI published a remarkable series of documents: raw, minimally-edited chatroom conversation logs about AI alignment, posted simultaneously on the MIRI blog, LessWrong, the AI Alignment Forum, and the EA Forum. Also available in audio form, these transcripts provided an unprecedented window into how alignment researchers actually think and argue.
The conversations were deliberately unpolished — not position papers, but real-time intellectual sparring. Key threads included:
The Conversations series was significant for three reasons. First, transparency: by publishing unedited discussions, MIRI showed its actual reasoning process, including uncertainties and internal disagreements, rather than just polished conclusions. Second, crystallizing disagreements: the conversations identified specific "cruxes" — points where resolution might shift entire worldviews. Third, foreshadowing: the depth of pessimism expressed in these conversations presaged the "Death with Dignity" post and the 2023 strategic pivot.
In 2023, Yudkowsky conducted the most sustained public communication campaign in AI safety history, appearing on an extraordinary range of platforms:
Some of the most combustible moments came on Twitter/X, where Yudkowsky (@ESYudkowsky) engaged in extended public exchanges with Yann LeCun, Meta's Chief AI Scientist and Turing Award winner. LeCun dismissed Yudkowsky's arguments as "vague hand-waving" lacking technical rigor, characterized MIRI's scenarios as "speculative fiction," and estimated the probability of AI-caused existential catastrophe as "effectively zero." Yudkowsky responded that LeCun was failing to engage with the structural arguments and was confusing "this specific pathway seems unlikely" with "no dangerous pathway exists."
These debates were transcribed and analyzed by Zvi Mowshowitz and others, becoming reference documents in the broader AI safety discourse. They crystallized a fundamental divide: should AI safety arguments be evaluated on their logical structure (Yudkowsky's position) or on empirical evidence and engineering track records (LeCun's position)?
Yudkowsky published Pausing AI Developments Isn't Enough. We Need to Shut It All Down in TIME. He declined to sign FLI's open letter calling for a six-month pause, calling it "understating the seriousness of the situation." He proposed an indefinite worldwide moratorium, tracking and limiting GPU sales, and — in the most controversial passage — willingness to "destroy a rogue datacenter by airstrike" if necessary. TIME named him to its TIME100 AI list that year.
MIRI formally shifted priorities from technical research to three objectives:
The leadership restructured: Malo Bourgon became CEO, Soares became President, and Yudkowsky became Board Chair. An unusual arrangement: Yudkowsky, Soares, and Bourgon each received separate budgets for different technical research directions, reflecting their divergent views on which approaches are most promising.
If Anyone Builds It, Everyone Dies by Yudkowsky and Soares became an instant New York Times bestseller. Named to The New Yorker's and The Guardian's Best Books of 2025 lists. A companion website published a tentative draft international treaty and a full draft international agreement for halting frontier AI development. The Technical Governance Team published a research agenda on AI governance to avoid extinction.
| Metric | Value |
|---|---|
| 2024 spending | $5.6 million |
| 2025 projected expenses | $6.5–7 million |
| End-of-2024 reserves | ~$16 million (~2 years of operations) |
| Total distinct contributors (all-time) | 4,789 |
| Communications team (2025) | ~7 full-time employees |
| 2025 fundraiser | $1.6 million raised (matched 1:1 by SFF) |
MIRI occupies a unique position: simultaneously the most influential organization in establishing AI safety as a field and one of the most controversial in its specific claims and policy proposals.
MIRI and Yudkowsky are arguably the single most important actors in establishing the practice of assigning and publicly stating estimated probabilities that AI development leads to human extinction. Yudkowsky's >95% estimate anchors one pole of a spectrum that runs from Roman Yampolskiy's ~99% through mainstream safety researchers' 5–30% to Yann LeCun's "effectively zero." Interestingly, Yudkowsky himself has expressed reservations about p(doom) as a framework, calling it "a kind of bad way to compress worldviews."
MIRI in 2025 is a different organization than the MIRI of 2014. It has explicitly pivoted from technical research to policy and communications, with a new Technical Governance Team, a growing communications staff, and a bestselling book. The core message hasn't changed: absent drastic international coordination, humanity faces extinction-level risk from AI. But the theory of change has shifted from "solve the math first" to "buy time through policy while we figure out the math."