← Gallery
Fireside Compendium

Quiet Evenings with AI Safety Research

A survey of the most pressing questions in artificial intelligence alignment, gathered like well-worn volumes on a mantelpiece and illuminated by the steady glow of inquiry.


Key Findings

🔒

Alignment remains unsolved at scale

Current steering techniques for large language models show promise in narrow settings but have not been demonstrated to hold under recursive self-improvement or open-ended agency.

🔍

Interpretability advances but does not yet suffice

Mechanistic interpretability has uncovered meaningful internal structures in transformer models, yet we can only explain a small fraction of any frontier model's behavior.

🏛

Governance lags behind capability

International frameworks remain fragmented. Voluntary commitments from major labs have not been matched by enforceable standards, leaving a widening oversight gap.

Economic incentives favor speed over caution

Competitive dynamics reward rapid capability gains. Safety research constitutes less than 3% of total AI R&D spending across leading organizations.


By the Numbers

347 Published Papers
2.7% R&D on Safety
58 Research Groups
12 Governance Proposals

Research Volumes

🕯

Mechanistic Interpretability

Tracing the circuits of thought within neural networks, understanding how features form, compose, and occasionally deceive.

Interpretability
🔒

Alignment Techniques

From RLHF to debate protocols, the evolving toolkit for ensuring advanced systems pursue human-intended objectives.

Alignment
🏛

AI Governance

International frameworks, compute governance, and the institutional architecture needed to steward transformative AI.

Governance
🧠

Scalable Oversight

Can humans supervise systems smarter than themselves? Exploring recursive reward modeling, debate, and market-based approaches.

Oversight
🌍

Existential Risk

Modeling the probability and pathways of catastrophic outcomes from advanced AI, and the interventions that might reduce them.

X-Risk
📜

Value Learning

The deep challenge of specifying what we want: inverse reward design, cooperative inverse RL, and the philosophy of human values.

Values