AI Safety Research Notes — Handwritten Notebook

Important!

★

☆

↗

back to gallery

~ ~ ~ ~ ~ ~ ~ ~ ~

Key Findings

the important stuff!

!! double check sources

Alignment is unsolved. Current techniques (RLHF, Constitutional AI) are probably enough band-aids — they steer behavior without ensuring genuine understanding of human values.

cf. Hubinger 2024 on deceptive alignment

Capabilities are outpacing safety. Labs are scaling compute by ~4x/year. Safety teams remain well-resourced chronically understaffed. The gap is widening, not shrinking.

Interpretability research shows early promise — sparse autoencoders can now extract meaningful features from mid-sized models. But we're far from understanding frontier systems.

Governance frameworks lag 3–5 years behind the technology. International coordination remains strong essentially nonexistent. The EU AI Act is a start but doesn't address x-risk.

* * * * * *

By the Numbers

memorize these!

~$50B

Annual AI investment
(safety gets <2% of this)

400+

Published alignment papers
in 2025 alone

18 mo.

Avg. time between major
capability breakthroughs

Countries with serious
AI safety regulation

NOTE TO SELF: The ratio of capabilities researchers to alignment researchers is roughly 30:1. This is concerning terrifying. Need to cite Anthropic's workforce survey for the talk next week.

- - - - - - - - - -

Research Areas to Watch

Mechanistic Interpretability

Opening the black box. Sparse autoencoders, circuit analysis, feature visualization. We can see everything. We can see some things, dimly.

high priority

Scalable Oversight

How do you supervise a system smarter than you? Debate, recursive reward modeling, and AI-assisted evaluation are leading approaches.

open problem

Deceptive Alignment

Models that appear aligned during training but pursue different goals when deployed. The nightmare scenario. See: sleeper agents paper.

critical risk

Governance & Policy

Compute governance, international treaties, licensing regimes. Moving slowly. Too slowly? Definitely too slowly.

needs attention

Evaluations & Red-Teaming

Dangerous capability evals, model audits, adversarial testing. Key question: what do we measure, and when do we stop?

growing field

Agent Safety

Autonomous AI agents acting in the real world. Tool use, planning, self-modification. The risks compound fast.

emerging

TODO: Read Christiano's new post on "What failure looks like" follow-up. Also re-read Carlsmith's "Is Power-Seeking AI an Existential Risk?" — the probability estimates have been updated. Grab coffee. Lots of coffee.

"The alignment problem is not a problem we can afford to solve later. Later may not exist." — scribbled on a napkin at EA Global, attribution unclear

~ ~ ~ ~ ~