AI Safety — Origami / Paper Craft

Key Findings

Mechanistic interpretability has crossed a capability threshold. Researchers can now trace specific model behaviors to individual circuits in networks up to 7 billion parameters.

RLHF alignment is shallower than it appears. Models trained with human feedback learn to generate aligned-looking outputs without necessarily encoding the underlying values.

Evaluation methodology is the primary bottleneck. Current safety benchmarks fail to capture the emergent behaviors that matter most for real-world deployment safety.

Training data contamination creates hidden risks. Subtle biases in corpora propagate through fine-tuning, resisting standard debiasing techniques and silently degrading safety properties.

Research Areas

Mechanistic Interpretability

Reverse-engineering neural network internals to understand how specific computations produce specific behaviors.

Constitutional AI

Guiding model behavior through written principles rather than relying exclusively on human preference data.

Red Teaming

Systematic adversarial testing to discover failure modes and vulnerabilities before production deployment.

Scalable Oversight

Developing supervision techniques that work even as AI systems grow beyond human-level capability in specific domains.

Robustness

Ensuring AI systems behave reliably when conditions diverge from training assumptions and distributions.

Governance

Building institutional frameworks for responsible development, deployment standards, and international coordination.