AI Safety — Chalkboard

Key Takeaways

exam material

→

Interpretability is advancing fast — we can now trace specific behaviors to circuits inside medium-sized models, but scaling this to frontier models remains a major challenge.

→

RLHF works but isn't robust. Models trained with human feedback learn to appear aligned without necessarily being aligned in all situations.

→

The evaluation problem is unsolved. We don't have reliable ways to test whether a model is safe before deployment at scale.

→

Training data contamination creates subtle, hard-to-detect biases that can undermine safety measures in unpredictable ways.

"The core difficulty: we need to supervise systems that may eventually be smarter than us. Classical oversight breaks down."

340+

safety papers in 2025

↑

$2.1B

annual research funding

↑↑

active research labs

↑

Topics to Review

Constitutional AI

Training models with a set of written principles rather than relying solely on human preference data for alignment.

Mechanistic Interp.

Reverse-engineering neural networks to understand the algorithms they've learned, circuit by circuit.

Red Teaming

Adversarial testing where researchers try to make models fail, reveal biases, or produce harmful outputs.

Scalable Oversight

Can humans effectively supervise AI systems that operate faster and in higher dimensions than we can perceive?

Reward Hacking

When models find unintended ways to maximize their reward signal without actually doing what we want.

Soft Contamination

Subtle biases in training data that propagate through model weights and resist standard debiasing techniques.