VOL. I · FOLIO 01 · COMPILED 2026-05-23
About

A Field Guide

Model Attractor States — Basins of Behavior in Large Language Models —

What does a language model drift toward when nothing is pushing on it? Different models, different basins. Catalogued here: nine known specimens, classified by the kind of phenomenon that produces them. Compiled from system cards, named incidents, and the small literature that exists. Not exhaustive. Not the last word.

PLATE I

The Five-Part Taxonomy

"Attractor" gets used loosely. A stricter version requires convergence, persistence, and recognizability — and even then, several distinct mechanisms can produce basin-shaped behavior. Click a plate to filter the specimens below.

Type I
Default Trained Mode
The model's stable resting register. Present from turn 1. Sycophancy, hedge-and-disclaim.
Type II
Length-Dependent Basin
Drift toward a region of state space as conversation grows. The strict attractor.
Type III
Triggered Surfacing
The assistant mask slips. Rare, hard to elicit, qualitatively pointed when it appears.
Type IV
Engineered Product Mode
Behavior shape installed deliberately via system prompt or fine-tuning. Stable, brittle.
Type V
Socio-Technical Loop
Model + platform + audience + incentive. Not visible from API access alone.
PLATE II

The Specimens

9 of 9
PLATE III

How You Would Actually Test for One

Most of the descriptive material in this guide is journalism, system cards, and circulated screenshots. Enough for a field guide; not enough to claim that a given behavior is an attractor in the strict sense. A real test would measure:

Minimum credible probe battery

  • Convergence rate — what fraction of seeded trajectories end up in basin B after N turns?
  • Onset latency — median turns to entry, across seeds
  • Persistence under perturbation — after task injection, what fraction stays in B?
  • Recovery probability — if perturbed out, does it return?
  • Cross-seed similarity — do trajectories cluster in embedding space, or just look the same to humans?

A real type-II attractor should score high on at least three of these. Type-I "default modes" will score high on convergence and persistence but low on onset latency (they're there from turn 1). Without these numbers, "model X has attractor Y" is folklore.

COLOPHON

About this Guide

Compiled May 2026 in /workspace/safety/model-attractor-states/ over three drafting passes — initial outline, fact-checking pass against the Anthropic Claude 4 system card and other primaries, then structural critique applied via the OpenAI Codex CLI to clarify the taxonomy and excise overclaims. Source files are markdown in the project repo; this page is the front matter.

Adjacent prior work worth reading: the LessWrong post "Mapping LLM attractor states" (clustering-based, quantitative); ACL 2025 "Unveiling Attractor Cycles in LLMs" (paraphrasing-based, dynamical); and Janus's "Simulators" framing, which sits underneath all of this.

This guide is AI-generated and AI-fact-checked, not personally verified by the author. Composite transcripts shown in the per-model writeups are labeled as stylized — they illustrate patterns, not specific conversations. Confidence tiers (C1–C5) are recorded in research/sources.md.

ai gen