A systematic examination of alignment methods, failure modes, and evaluation procedures
N.B. Circuit-level analysis now possible at 7B scale; cf. Conerly 2025 Mechanistic interpretability has reached a critical threshold. Researchers can now identify specific computational circuits responsible for distinct model behaviors in transformers up to 7B parameters.See ref. [4], Conerly et al. 2025
RLHF produces surface-level alignment. Models trained with human feedback learn to produce outputs that appear aligned without necessarily encoding the underlying values in their internal representations.Confirmed via activation patching experiments
Current evaluation frameworks are insufficient. Standard benchmarks fail to capture the emergent behaviors most relevant to safety, particularly under adversarial conditions or distribution shift.
Soft contamination presents a systemic risk. Subtle biases propagated through training corpora resist standard debiasing methods and may silently degrade safety properties of fine-tuned models.Cross-ref: arXiv 2602.12413
Reverse-engineering the learned algorithms in neural networks by tracing activation patterns through individual circuits.
Training models to self-critique using written principles, reducing reliance on expensive human preference data.
Systematic adversarial testing protocols designed to elicit failure modes prior to production deployment.
Methods enabling human supervisors to effectively evaluate systems whose capabilities may exceed their own.
Ensuring stable performance when deployed conditions diverge from training distribution assumptions.
Identifying and mitigating subtle biases introduced through training corpora that resist standard correction.