A survey of alignment research rendered in the spirit of the beautiful line — where organic form meets structured inquiry, and every curve serves a purpose.
Current techniques for steering large language models show promise in narrow settings but have not been demonstrated to hold under recursive self-improvement or open-ended agency.
Mechanistic interpretability has uncovered meaningful internal structures, yet we can only explain a small fraction of any frontier model's behavior in human-legible terms.
International AI governance frameworks remain fragmented. Voluntary commitments have not been matched by enforceable standards, leaving a widening gap between capability and oversight.
Competitive dynamics reward rapid capability gains. Safety-focused research constitutes less than 3% of total AI R&D spending across leading organizations.
Tracing the circuits of thought within neural networks, understanding how features form, compose, and occasionally deceive.
InterpretabilityFrom RLHF to debate protocols, the evolving toolkit for ensuring advanced systems pursue objectives their creators intended.
AlignmentInternational frameworks, compute governance, and the institutional architecture needed to steward transformative AI development responsibly.
GovernanceCan humans supervise systems smarter than themselves? Exploring recursive reward modeling, debate, and market-based approaches.
OversightEstimating the probability and pathways of catastrophic outcomes from advanced AI, and the interventions that could reduce them.
X-RiskThe deep challenge of specifying what we want: inverse reward design, cooperative inverse RL, and the philosophy of human values.
Values