A comprehensive survey of alignment techniques, governance frameworks, and emerging risks in advanced AI systems. Mapping the territory between capability and safety.
Developing methods for humans to supervise AI systems that may exceed human-level performance on specific tasks. Includes debate, recursive reward modeling, and market-based approaches.
AlignmentReverse-engineering neural network computations to understand learned algorithms. Circuit-level analysis reveals how models represent and process information.
InterpretabilityDesigning institutional structures and policy mechanisms for responsible AI development. International coordination, compute governance, and liability regimes.
PolicyEnsuring AI systems behave reliably under distribution shift and adversarial attack. Formal verification methods and red-teaming protocols.
SecurityBenchmarks for dangerous capabilities.
MeasurementTheoretical work on the fundamental nature of goal-directed systems. Logical uncertainty, embedded agency, and value learning as formal problems.
Theory