Exploring alignment, interpretability, and governance frameworks for advanced AI systems. An independent research compendium.
Foundational research on ensuring AI systems pursue intended objectives and remain corrigible during capability gains.
ExploreMechanistic and circuit-level analysis of neural network internals to understand model reasoning and behavior.
ExplorePolicy frameworks, international coordination mechanisms, and regulatory approaches for frontier AI development.
ExploreBenchmarks and testing methodologies for measuring dangerous capabilities and alignment properties in AI systems.
ExploreResearch on adversarial attacks, jailbreaking, and techniques for making safety training more resilient.
ExplorePredictions about AI capability timelines, transformative impact, and existential risk probability estimates.
Explore