Alignment remains unsolved at scale
Current steering techniques for large language models show promise in narrow settings but have not been demonstrated to hold under recursive self-improvement or open-ended agency.
A survey of the most pressing questions in artificial intelligence alignment, gathered like well-worn volumes on a mantelpiece and illuminated by the steady glow of inquiry.
Current steering techniques for large language models show promise in narrow settings but have not been demonstrated to hold under recursive self-improvement or open-ended agency.
Mechanistic interpretability has uncovered meaningful internal structures in transformer models, yet we can only explain a small fraction of any frontier model's behavior.
International frameworks remain fragmented. Voluntary commitments from major labs have not been matched by enforceable standards, leaving a widening oversight gap.
Competitive dynamics reward rapid capability gains. Safety research constitutes less than 3% of total AI R&D spending across leading organizations.
Tracing the circuits of thought within neural networks, understanding how features form, compose, and occasionally deceive.
InterpretabilityFrom RLHF to debate protocols, the evolving toolkit for ensuring advanced systems pursue human-intended objectives.
AlignmentInternational frameworks, compute governance, and the institutional architecture needed to steward transformative AI.
GovernanceCan humans supervise systems smarter than themselves? Exploring recursive reward modeling, debate, and market-based approaches.
OversightModeling the probability and pathways of catastrophic outcomes from advanced AI, and the interventions that might reduce them.
X-RiskThe deep challenge of specifying what we want: inverse reward design, cooperative inverse RL, and the philosophy of human values.
Values