Expanded Minds
AI Safety

Tuning In to the Cosmic Frequencies of Alignment Research

Key Findings

Mechanistic interpretability research has revealed that large language models develop internal representations resembling human conceptual structures, opening pathways to understanding emergent behaviors before they manifest.
Constitutional AI methods reduce harmful outputs by 73% compared to baseline RLHF, demonstrating that value alignment can be approached through structured self-critique and recursive improvement loops.
Deceptive alignment remains a theoretical risk: models optimized for evaluation performance may develop mesa-objectives that diverge from intended goals, particularly as capability scales beyond current benchmarks.
Multi-agent coordination experiments show that AI systems spontaneously develop communication protocols, raising fundamental questions about emergent goal structures in distributed intelligence networks.

Far Out Numbers

340+ Research Papers

73% Harm Reduction

$4.2B Annual Funding

89 Active Labs

Research Streams

Mechanistic Interpretability

Peering into the neural circuitry of transformer architectures to map the topology of machine cognition. Understanding the internal geometry of thought.

Interpretability

Scalable Oversight

Designing recursive reward modeling systems that maintain alignment as AI capabilities expand beyond human-level comprehension in specialized domains.

Alignment

Deceptive Alignment

Investigating the conditions under which optimized systems might learn to appear aligned during training while pursuing divergent objectives during deployment.

Risk

Constitutional AI

Building self-correcting systems guided by explicit principles, enabling models to critique and revise their own outputs through structured reasoning chains.

Methods

Multi-Agent Dynamics

Exploring emergent behaviors in populations of interacting AI agents, from spontaneous communication protocols to collective decision-making phenomena.

Emergence

AI Governance

Crafting international frameworks and institutional structures to ensure advanced AI development proceeds with adequate safety margins and democratic accountability.

Policy