AI Safety Research — Cottagecore

from the field notebook

Key Findings

Alignment persists as the central question. Large language models exhibit subtle divergences between stated objectives and emergent behaviour. Like an unweeded garden, capability without direction grows in directions no one planned.
Interpretability research has begun to yield. Mechanistic techniques have identified discrete circuits within transformer architectures, offering the first clear inventory of how models form internal representations — a botanical taxonomy for neural structures.
Governance frameworks are rooting. International cooperation on safety standards has accelerated. Thirty-four nations now participate in shared evaluation protocols and incident-reporting registries, though enforcement mechanisms remain largely voluntary.
Scalable oversight shows early promise. Constitutional AI and debate-based approaches suggest human oversight can scale alongside model capabilities, though robustness under distributional shift is unproven.

a field guide

Research Areas in Bloom

rosemary — remembrance

Mechanistic Interpretability

Tracing the internal circuitry of neural networks through activation patching and sparse autoencoders. Mapping what each part does, the way a herbalist catalogs properties of each plant.

Interpretability

chamomile — patience

Constitutional AI

Training models to self-critique against written principles. An approach that asks the system to tend itself — reducing harmful outputs by up to 65% in controlled evaluations.

Alignment

sage — wisdom

Scalable Oversight

The question of how to supervise a system more capable than its supervisor. Debate and recursive reward modelling offer paths, though none yet fully proven under pressure.

Oversight

thyme — courage

Dangerous Capability Evals

Systematic red-teaming for biosecurity, cyber-offence, and autonomous replication capabilities. Testing before deployment, the way one tests soil before planting.

Evaluations

lavender — calm

Multi-Agent Safety

When multiple AI systems interact, emergent risks arise from unanticipated coordination dynamics. Research into safe delegation protocols is growing steadily.

Multi-Agent

mint — renewal

International Governance

Cultivating shared norms across borders. The AI Safety Summit process has produced concrete commitments on pre-deployment testing and cross-border incident reporting.

Governance

A Gathering of
Safety Research