AI Safety Research — Fireside Reading

← Gallery

Highlights

Key Findings

🔒

Alignment remains unsolved at scale

Current steering techniques for large language models show promise in narrow settings but have not been demonstrated to hold under recursive self-improvement or open-ended agency.

🔍

Interpretability advances but does not yet suffice

Mechanistic interpretability has uncovered meaningful internal structures in transformer models, yet we can only explain a small fraction of any frontier model's behavior.

🏛

Governance lags behind capability

International frameworks remain fragmented. Voluntary commitments from major labs have not been matched by enforceable standards, leaving a widening oversight gap.

⚠

Economic incentives favor speed over caution

Competitive dynamics reward rapid capability gains. Safety research constitutes less than 3% of total AI R&D spending across leading organizations.

Data

By the Numbers

347 Published Papers

2.7% R&D on Safety

58 Research Groups

12 Governance Proposals

Topics

Research Volumes

🕯

Mechanistic Interpretability

Tracing the circuits of thought within neural networks, understanding how features form, compose, and occasionally deceive.

Interpretability

🔒

Alignment Techniques

From RLHF to debate protocols, the evolving toolkit for ensuring advanced systems pursue human-intended objectives.

Alignment

🏛

AI Governance

International frameworks, compute governance, and the institutional architecture needed to steward transformative AI.

Governance

🧠

Scalable Oversight

Can humans supervise systems smarter than themselves? Exploring recursive reward modeling, debate, and market-based approaches.

Oversight

🌍

Existential Risk

Modeling the probability and pathways of catastrophic outcomes from advanced AI, and the interventions that might reduce them.

X-Risk

📜

Value Learning

The deep challenge of specifying what we want: inverse reward design, cooperative inverse RL, and the philosophy of human values.

Values