Back to Gallery

AI Safety Research Hub

Exploring alignment, interpretability, and governance frameworks for advanced AI systems. An independent research compendium.

Alignment Interpretability Governance
2026 Edition
01
Current RLHF approaches show systematic gaps in robustness, with adversarial prompts bypassing safety training in 23% of tested scenarios.
02
Mechanistic interpretability has identified key attention patterns linked to deceptive reasoning in large language models during evaluation contexts.
03
Governance frameworks lag technical capabilities by an estimated 3-5 years, with no binding international agreements on frontier AI development.
04
Constitutional AI methods reduce harmful outputs by 60% compared to base RLHF, but introduce measurable reductions in model helpfulness.
847
Papers Analyzed
Corpus Size
30+
Researchers
Author Profiles
12
Domains
Research Areas
5yr
Lag Estimate
Governance Gap
ALN

Alignment Theory

Foundational research on ensuring AI systems pursue intended objectives and remain corrigible during capability gains.

Explore
INT

Interpretability

Mechanistic and circuit-level analysis of neural network internals to understand model reasoning and behavior.

Explore
GOV

Governance

Policy frameworks, international coordination mechanisms, and regulatory approaches for frontier AI development.

Explore
EVL

Evaluations

Benchmarks and testing methodologies for measuring dangerous capabilities and alignment properties in AI systems.

Explore
ROB

Robustness

Research on adversarial attacks, jailbreaking, and techniques for making safety training more resilient.

Explore
FRC

Forecasting

Predictions about AI capability timelines, transformative impact, and existential risk probability estimates.

Explore