Mechanistic Interpretability
Reverse-engineering neural network internals to understand how models represent and process alignment-relevant concepts in their deepest layers.
ActiveMapping the deep architecture of AI alignment through the lens of emergent complexity, pressure-tested at the boundaries of what we understand.
Language models exhibit alignment drift patterns analogous to deep-ocean thermohaline circulation -- invisible from the surface but structurally determinative at scale.
Model behavior under adversarial pressure follows non-linear collapse patterns. Below certain safety thresholds, coherent alignment breaks down catastrophically.
Sparse autoencoders reveal interpretable features that illuminate otherwise opaque internal representations, like bioluminescence in the abyss.
New capabilities emerge at unpredictable training thresholds, analogous to hydrothermal vents creating unexpected ecosystems in the deep ocean.
Reverse-engineering neural network internals to understand how models represent and process alignment-relevant concepts in their deepest layers.
ActiveSystematically probing model behavior under increasing adversarial pressure to identify failure modes before deployment in high-stakes environments.
Phase IIDeveloping early-warning systems for detecting unexpected capabilities during training, before they surface in deployment contexts.
ActiveInvestigating methods to anchor model behavior to constitutional principles that remain stable under distributional shift and optimization pressure.
ExploratoryMapping the sparse, interpretable circuits that encode safety-relevant behaviors, enabling targeted intervention in model internals.
ActiveCreating comprehensive maps of the alignment landscape across model scales, architectures, and training regimes to identify universal patterns.
PlanningThe deeper we descend into understanding large language models, the more the analogy to deep-ocean exploration holds. At the surface, everything appears calm and predictable. The model responds coherently, follows instructions, produces useful outputs. But beneath that surface, vast currents of learned representation flow in patterns we are only beginning to trace.
Our station operates at the boundary between the known and the unknown. Each experiment is a submersible descent into territory where our instruments may not function as expected and our intuitions frequently mislead us. The pressure of scale -- billions of parameters interacting in ways that resist simple characterization -- demands new tools and new conceptual frameworks entirely.