Art Nouveau — AI Safety Research

Findings

Principal Observations

Alignment remains unsolved at scale

Current techniques for steering large language models show promise in narrow settings but have not been demonstrated to hold under recursive self-improvement or open-ended agency.

Interpretability advances but does not yet suffice

Mechanistic interpretability has uncovered meaningful internal structures, yet we can only explain a small fraction of any frontier model's behavior in human-legible terms.

III

Governance lags capability by years

International AI governance frameworks remain fragmented. Voluntary commitments have not been matched by enforceable standards, leaving a widening gap between capability and oversight.

Economic incentives favor speed over caution

Competitive dynamics reward rapid capability gains. Safety-focused research constitutes less than 3% of total AI R&D spending across leading organizations.

Areas of Inquiry

Research Domains

❁

Mechanistic Interpretability

Tracing the circuits of thought within neural networks, understanding how features form, compose, and occasionally deceive.

Interpretability

❀

Alignment Techniques

From RLHF to debate protocols, the evolving toolkit for ensuring advanced systems pursue objectives their creators intended.

Alignment

☙

AI Governance

International frameworks, compute governance, and the institutional architecture needed to steward transformative AI development responsibly.

Governance

✦

Scalable Oversight

Can humans supervise systems smarter than themselves? Exploring recursive reward modeling, debate, and market-based approaches.

Oversight

✿

Existential Risk Modeling

Estimating the probability and pathways of catastrophic outcomes from advanced AI, and the interventions that could reduce them.

X-Risk

⚘

Value Learning

The deep challenge of specifying what we want: inverse reward design, cooperative inverse RL, and the philosophy of human values.

Values

The Architecture of Safe Intelligence