← Gallery
An Illustrated Compendium

The Architecture of Safe Intelligence

A survey of alignment research rendered in the spirit of the beautiful line — where organic form meets structured inquiry, and every curve serves a purpose.

Principal Observations

I

Alignment remains unsolved at scale

Current techniques for steering large language models show promise in narrow settings but have not been demonstrated to hold under recursive self-improvement or open-ended agency.

II

Interpretability advances but does not yet suffice

Mechanistic interpretability has uncovered meaningful internal structures, yet we can only explain a small fraction of any frontier model's behavior in human-legible terms.

III

Governance lags capability by years

International AI governance frameworks remain fragmented. Voluntary commitments have not been matched by enforceable standards, leaving a widening gap between capability and oversight.

IV

Economic incentives favor speed over caution

Competitive dynamics reward rapid capability gains. Safety-focused research constitutes less than 3% of total AI R&D spending across leading organizations.

The Present Landscape, in Figures

347 Published Papers
2.7% Budget for Safety
58 Research Groups
12 Governance Proposals

Research Domains

Mechanistic Interpretability

Tracing the circuits of thought within neural networks, understanding how features form, compose, and occasionally deceive.

Interpretability

Alignment Techniques

From RLHF to debate protocols, the evolving toolkit for ensuring advanced systems pursue objectives their creators intended.

Alignment

AI Governance

International frameworks, compute governance, and the institutional architecture needed to steward transformative AI development responsibly.

Governance

Scalable Oversight

Can humans supervise systems smarter than themselves? Exploring recursive reward modeling, debate, and market-based approaches.

Oversight

Existential Risk Modeling

Estimating the probability and pathways of catastrophic outcomes from advanced AI, and the interventions that could reduce them.

X-Risk

Value Learning

The deep challenge of specifying what we want: inverse reward design, cooperative inverse RL, and the philosophy of human values.

Values