AI Safety Research — Nordic / Scandinavian

Key Findings

1
Interpretability research has accelerated significantly, with mechanistic interpretability now able to identify specific circuits responsible for model behaviors in mid-sized transformers.
2
Constitutional AI approaches show promise in reducing harmful outputs, but the underlying alignment problem—ensuring models genuinely understand intent—remains open.
3
Evaluation methodology is the bottleneck. Current benchmarks poorly capture the behaviors that matter most for safety in deployed systems.
4
Soft contamination in training data creates subtle biases that are difficult to detect and may undermine safety training in ways not yet fully understood.

The challenge is not building systems that are powerful, but building systems we can trust to act in ways we would endorse, even when we are not watching.

— Alignment Research Collective

Research Areas

Mechanistic Interpretability

Understanding what happens inside neural networks by tracing computations through individual neurons and circuits.

Reinforcement Learning from Human Feedback

Training models to follow instructions and produce helpful responses using human preference data as a reward signal.

Red Teaming & Evaluation

Systematic adversarial testing to discover failure modes, vulnerabilities, and unexpected behaviors before deployment.

Scalable Oversight

Developing methods for humans to effectively supervise AI systems that may eventually surpass human capability in specific domains.

Robustness & Distribution Shift

Ensuring AI systems behave reliably when encountering situations different from their training environment.

Governance & Policy

Frameworks for responsible development, deployment standards, and international coordination on AI safety research.

Good research, like good design, creates warmth through clarity — making the complex feel simple and the uncertain feel approachable.

AI Safety Research Collective