Back to Gallery
Research Highlights
Key Findings
1
Alignment remains unsolved. Despite rapid capability gains, the core problem of ensuring AI systems reliably pursue intended goals has not been cracked. The gap between what models can do and what we can verify keeps widening.
2
Interpretability is accelerating. Sparse autoencoders have identified over 10,000 meaningful features in frontier models, enabling targeted behavioral interventions and bringing real legibility to neural network internals.
3
Governance lags behind deployment. Only 12 of 38 major AI-producing nations have adopted binding safety evaluation standards. International coordination remains fragmented and reactive.
4
Timelines are compressed. Expert surveys now place median AGI arrival at 2035, down from 2050 estimates five years ago. This makes safety research not just important but urgent.
By the Numbers
Data Snapshot
12:1
Capability-Safety Ratio
The central tension: Training compute has grown 10x year-over-year for frontier models, while the number of safety researchers remains roughly 12x smaller than capability teams. Closing this gap is the defining challenge of the field.
Explore
Research Areas
Mi
Mechanistic Interpretability
Reverse-engineering neural network computation to understand learned algorithms, circuits, and features at the individual neuron level.
Interpretability
Rl
RLHF & Value Learning
Anchoring model behavior to human preferences through reinforcement learning, debate protocols, and constitutional training methods.
Alignment
Gv
AI Governance
Institutional frameworks, compute governance, international coordination, and regulatory design for responsible AI development.
Policy
Rb
Robustness Testing
Adversarial red-teaming, distribution shift analysis, and formal verification to ensure reliable behavior under pressure.
Security
Ev
Evaluation Design
Building benchmarks for dangerous capabilities, deception detection, and safety property verification in frontier models.
Measurement
Xr
Existential Risk
Quantifying catastrophic risk through decision theory, historical analogues, and formal threat models for advanced AI systems.
Theory