AI Safety Research — Memphis Design Style

Key Findings

Mechanistic interpretability has uncovered polysemantic neurons and superposition in large language models, enabling early-stage circuit-level understanding of model behavior and failure modes.
Constitutional AI and RLHF techniques show promise for alignment but remain vulnerable to reward hacking, specification gaming, and emergent goal misgeneralization in out-of-distribution scenarios.
The governance landscape is rapidly evolving: the EU AI Act, US executive orders, and frontier model evaluations are establishing the first binding regulatory frameworks for advanced AI systems.
Compute scaling trends suggest transformative AI capabilities may arrive 5-15 years sooner than consensus estimates from 2020, compressing timelines for safety research and policy coordination.

Research Areas

Interpretability

Mechanistic Interpretability

Reverse-engineering neural network computations to understand how models represent and process information internally, from individual features to full circuits.

Alignment

Scalable Oversight

Developing methods for humans to supervise AI systems on tasks too complex for direct human evaluation, including debate, recursive reward modeling, and market-based approaches.

Robustness

Adversarial Robustness

Ensuring AI systems behave reliably under distributional shift, adversarial inputs, and edge cases that fall outside the training distribution.

Governance

AI Policy & Regulation

Designing governance frameworks, international treaties, compute monitoring regimes, and responsible scaling policies for frontier AI development.

Evaluation

Dangerous Capability Evals

Building rigorous benchmarks for detecting hazardous capabilities such as autonomous replication, persuasion, cyber-offense, and deceptive alignment in frontier models.

Theory

Agent Foundations

Foundational research on decision theory, embedded agency, logical uncertainty, and the mathematical frameworks needed to reason about superintelligent systems.