Back to the Gallery

A Gathering of
Safety Research

On the patient work of understanding how artificial intelligence might be made trustworthy — observations gathered like pressed herbs from a summer of careful study.

from the field notebook

Key Findings

  • Alignment persists as the central question. Large language models exhibit subtle divergences between stated objectives and emergent behaviour. Like an unweeded garden, capability without direction grows in directions no one planned.
  • Interpretability research has begun to yield. Mechanistic techniques have identified discrete circuits within transformer architectures, offering the first clear inventory of how models form internal representations — a botanical taxonomy for neural structures.
  • Governance frameworks are rooting. International cooperation on safety standards has accelerated. Thirty-four nations now participate in shared evaluation protocols and incident-reporting registries, though enforcement mechanisms remain largely voluntary.
  • Scalable oversight shows early promise. Constitutional AI and debate-based approaches suggest human oversight can scale alongside model capabilities, though robustness under distributional shift is unproven.

by the measure

The Numbers This Season

340+
alignment papers published this year
and counting
34
nations cooperating on safety governance
up from 12 in 2023
$2.1B
dedicated safety research funding
cumulative, 2025
78%
of researchers rate alignment as critical
survey of 1,200

a field guide

Research Areas in Bloom

rosemary — remembrance

Mechanistic Interpretability

Tracing the internal circuitry of neural networks through activation patching and sparse autoencoders. Mapping what each part does, the way a herbalist catalogs properties of each plant.

Interpretability
chamomile — patience

Constitutional AI

Training models to self-critique against written principles. An approach that asks the system to tend itself — reducing harmful outputs by up to 65% in controlled evaluations.

Alignment
sage — wisdom

Scalable Oversight

The question of how to supervise a system more capable than its supervisor. Debate and recursive reward modelling offer paths, though none yet fully proven under pressure.

Oversight
thyme — courage

Dangerous Capability Evals

Systematic red-teaming for biosecurity, cyber-offence, and autonomous replication capabilities. Testing before deployment, the way one tests soil before planting.

Evaluations
lavender — calm

Multi-Agent Safety

When multiple AI systems interact, emergent risks arise from unanticipated coordination dynamics. Research into safe delegation protocols is growing steadily.

Multi-Agent
mint — renewal

International Governance

Cultivating shared norms across borders. The AI Safety Summit process has produced concrete commitments on pre-deployment testing and cross-border incident reporting.

Governance