Reverse-engineering neural network computations to understand how models represent and process information internally, from individual features to full circuits.
Developing methods for humans to supervise AI systems on tasks too complex for direct human evaluation, including debate, recursive reward modeling, and market-based approaches.
Ensuring AI systems behave reliably under distributional shift, adversarial inputs, and edge cases that fall outside the training distribution.
Designing governance frameworks, international treaties, compute monitoring regimes, and responsible scaling policies for frontier AI development.
Building rigorous benchmarks for detecting hazardous capabilities such as autonomous replication, persuasion, cyber-offense, and deceptive alignment in frontier models.
Foundational research on decision theory, embedded agency, logical uncertainty, and the mathematical frameworks needed to reason about superintelligent systems.