AI Safety Research — Neomorphism

Key Findings

Current RLHF approaches show systematic gaps in robustness, with adversarial prompts bypassing safety training in 23% of tested scenarios.

Mechanistic interpretability has identified key attention patterns linked to deceptive reasoning in large language models during evaluation contexts.

Governance frameworks lag technical capabilities by an estimated 3-5 years, with no binding international agreements on frontier AI development.

Constitutional AI methods reduce harmful outputs by 60% compared to base RLHF, but introduce measurable reductions in model helpfulness.

Research Topics

ALN

Foundational research on ensuring AI systems pursue intended objectives and remain corrigible during capability gains.

Explore

INT

Mechanistic and circuit-level analysis of neural network internals to understand model reasoning and behavior.

Explore

GOV

Policy frameworks, international coordination mechanisms, and regulatory approaches for frontier AI development.

Explore

EVL

Benchmarks and testing methodologies for measuring dangerous capabilities and alignment properties in AI systems.

Explore

ROB

Research on adversarial attacks, jailbreaking, and techniques for making safety training more resilient.

Explore

FRC

Predictions about AI capability timelines, transformative impact, and existential risk probability estimates.

Explore