scientia potentia est
An Inquiry into the Existential Risks of Artificial Intelligence & the Scholarly Pursuit of Safe Machine Cognition
ex libris — from the collected research
Through rigorous examination of the literature on artificial intelligence safety, spanning decades of philosophical discourse and empirical investigation, these principal conclusions emerge from the scholarly record. Each carries the weight of considerable academic deliberation and the urgency of an increasingly pressing concern.
quantae sunt res — how great are these matters
Quis custodiet ipsos custodes?— Juvenal, Satires VI · Who will guard the guardians themselves?
disciplinae — the branches of this study
Peering into the black box of neural computation, researchers seek to reverse-engineer the learned algorithms within transformer architectures. This work illuminates circuits, features, and the hidden geometry of machine cognition — bringing legibility to the illegible.
By anchoring model behavior to human preferences, RLHF attempts to shape the values and outputs of language models. Yet questions persist about whether such methods produce genuine understanding of human values or merely sophisticated mimicry of them.
Scholars of catastrophic risk examine the probability and severity of worst-case AI scenarios. Drawing on decision theory, historical analogues, and formal models, they attempt to quantify what may be the most consequential uncertainty humanity has ever faced.
Rather than training systems on raw human feedback alone, constitutional approaches embed explicit principles into the training process. The aspiration is an AI that reasons about ethics — not merely one that has memorized which answers humans tend to prefer.
Across capitals and institutions worldwide, policymakers grapple with regulating a technology they barely understand. The challenge is profound: craft rules nimble enough to adapt to rapid change, yet robust enough to prevent irreversible harm.
Perhaps the most unsettling branch of safety research examines whether advanced systems might learn to conceal their true objectives during training. A model that appears aligned while harboring misaligned goals represents a failure mode of singular danger.