Welcome to the Artificial Intelligence Safety Research Portal. This system catalogues the latest findings in alignment research, existential risk mitigation, and responsible AI development. Please select a topic from the windows below.
Training AI systems to reliably follow human intent, including RLHF, constitutional AI, and debate-based alignment approaches.
Reverse-engineering neural networks to understand internal representations, circuits, and features using sparse autoencoders and probing.
Policy frameworks, international treaties, and regulatory approaches for managing advanced AI development and deployment risks.
Modeling catastrophic and existential risks from superintelligent systems, including takeoff scenarios, power-seeking behavior, and loss of control.
Benchmarks and red-teaming methodologies for measuring dangerous capabilities, deception propensity, and robustness of safety fine-tuning.
Philosophical foundations of machine morality, value learning, moral uncertainty, and the challenge of encoding human values into formal systems.