Important!
★
★
☆
↗
@
back to gallery
~ ~ ~ ~ ~ ~ ~ ~ ~
Key Findings
the important stuff!
!! double check sources
Alignment is unsolved. Current techniques (RLHF, Constitutional AI) are
probably enough
band-aids —
they steer behavior without ensuring genuine understanding of human values.
cf. Hubinger 2024 on deceptive alignment
Capabilities are outpacing safety.
Labs are scaling compute by ~4x/year. Safety teams remain
well-resourced
chronically understaffed.
The gap is widening, not shrinking.
Interpretability research shows early promise —
sparse autoencoders can now extract meaningful features from mid-sized models.
But we're far from understanding frontier systems.
Governance frameworks lag 3–5 years behind the technology.
International coordination remains
strong
essentially nonexistent.
The EU AI Act is a start but doesn't address x-risk.
* * * * * *
By the Numbers
memorize these!
~$50B
Annual AI investment
(safety gets <2% of this)
400+
Published alignment papers
in 2025 alone
18 mo.
Avg. time between major
capability breakthroughs
3
Countries with serious
AI safety regulation
NOTE TO SELF: The ratio of capabilities researchers to alignment researchers
is roughly 30:1.
This is concerning
terrifying.
Need to cite Anthropic's workforce survey for the talk next week.
- - - - - - - - - -
Research Areas to Watch
Mechanistic Interpretability
Opening the black box. Sparse autoencoders, circuit analysis, feature visualization.
We can see everything.
We can see some things, dimly.
high priority
Scalable Oversight
How do you supervise a system smarter than you? Debate, recursive reward modeling,
and AI-assisted evaluation are leading approaches.
open problem
Deceptive Alignment
Models that appear aligned during training but pursue different goals when deployed.
The nightmare scenario. See: sleeper agents paper.
critical risk
Governance & Policy
Compute governance, international treaties, licensing regimes.
Moving slowly. Too slowly?
Definitely too slowly.
needs attention
Evaluations & Red-Teaming
Dangerous capability evals, model audits, adversarial testing.
Key question: what do we measure, and when do we stop?
growing field
Agent Safety
Autonomous AI agents acting in the real world. Tool use, planning, self-modification.
The risks compound fast.
emerging
TODO: Read Christiano's new post on "What failure looks like" follow-up.
Also re-read Carlsmith's "Is Power-Seeking AI an Existential Risk?" —
the probability estimates have been updated. Grab coffee. Lots of coffee.
"The alignment problem is not a problem we can afford to solve later.
Later may not exist." — scribbled on a napkin at EA Global, attribution unclear
~ ~ ~ ~ ~