Today's lecture: Why making AI systems safe is harder than it looks, and what we're doing about it
Interpretability is advancing fast — we can now trace specific behaviors to circuits inside medium-sized models, but scaling this to frontier models remains a major challenge.
RLHF works but isn't robust. Models trained with human feedback learn to appear aligned without necessarily being aligned in all situations.
The evaluation problem is unsolved. We don't have reliable ways to test whether a model is safe before deployment at scale.
Training data contamination creates subtle, hard-to-detect biases that can undermine safety measures in unpredictable ways.
"The core difficulty: we need to supervise systems that may eventually be smarter than us. Classical oversight breaks down."
Training models with a set of written principles rather than relying solely on human preference data for alignment.
Reverse-engineering neural networks to understand the algorithms they've learned, circuit by circuit.
Adversarial testing where researchers try to make models fail, reveal biases, or produce harmful outputs.
Can humans effectively supervise AI systems that operate faster and in higher dimensions than we can perceive?
When models find unintended ways to maximize their reward signal without actually doing what we want.
Subtle biases in training data that propagate through model weights and resist standard debiasing techniques.