On the patient work of understanding how artificial intelligence might be made trustworthy — observations gathered like pressed herbs from a summer of careful study.
from the field notebook
by the measure
a field guide
Tracing the internal circuitry of neural networks through activation patching and sparse autoencoders. Mapping what each part does, the way a herbalist catalogs properties of each plant.
InterpretabilityTraining models to self-critique against written principles. An approach that asks the system to tend itself — reducing harmful outputs by up to 65% in controlled evaluations.
AlignmentThe question of how to supervise a system more capable than its supervisor. Debate and recursive reward modelling offer paths, though none yet fully proven under pressure.
OversightSystematic red-teaming for biosecurity, cyber-offence, and autonomous replication capabilities. Testing before deployment, the way one tests soil before planting.
EvaluationsWhen multiple AI systems interact, emergent risks arise from unanticipated coordination dynamics. Research into safe delegation protocols is growing steadily.
Multi-AgentCultivating shared norms across borders. The AI Safety Summit process has produced concrete commitments on pre-deployment testing and cross-border incident reporting.
Governance