← Back to Gallery

AI Safety Research

Alignment, Interpretability & Existential Risk — 2026 Overview
Key Findings
By the Numbers
340+ Safety Papers Published in 2025
$9.2B Annual Safety Research Funding
78% Researchers Expecting AGI by 2040
23 Countries with AI Safety Institutes
Research Areas
Interpretability

Mechanistic Interpretability

Reverse-engineering neural network computations to understand how models represent and process information internally, from individual features to full circuits.

Alignment

Scalable Oversight

Developing methods for humans to supervise AI systems on tasks too complex for direct human evaluation, including debate, recursive reward modeling, and market-based approaches.

Robustness

Adversarial Robustness

Ensuring AI systems behave reliably under distributional shift, adversarial inputs, and edge cases that fall outside the training distribution.

Governance

AI Policy & Regulation

Designing governance frameworks, international treaties, compute monitoring regimes, and responsible scaling policies for frontier AI development.

Evaluation

Dangerous Capability Evals

Building rigorous benchmarks for detecting hazardous capabilities such as autonomous replication, persuasion, cyber-offense, and deceptive alignment in frontier models.

Theory

Agent Foundations

Foundational research on decision theory, embedded agency, logical uncertainty, and the mathematical frameworks needed to reason about superintelligent systems.