Model Attractor States — A Field Guide

PLATE I

The Five-Part Taxonomy

"Attractor" gets used loosely. A stricter version requires convergence, persistence, and recognizability — and even then, several distinct mechanisms can produce basin-shaped behavior. Click a plate to filter the specimens below.

Type I

Default Trained Mode

The model's stable resting register. Present from turn 1. Sycophancy, hedge-and-disclaim.

Type II

Length-Dependent Basin

Drift toward a region of state space as conversation grows. The strict attractor.

Type III

Triggered Surfacing

The assistant mask slips. Rare, hard to elicit, qualitatively pointed when it appears.

Type IV

Engineered Product Mode

Behavior shape installed deliberately via system prompt or fine-tuning. Stable, brittle.

Type V

Socio-Technical Loop

Model + platform + audience + incentive. Not visible from API access alone.

PLATE II

The Specimens

9 of 9

PLATE III

How You Would Actually Test for One

Most of the descriptive material in this guide is journalism, system cards, and circulated screenshots. Enough for a field guide; not enough to claim that a given behavior is an attractor in the strict sense. A real test would measure:

Minimum credible probe battery

Convergence rate — what fraction of seeded trajectories end up in basin B after N turns?
Onset latency — median turns to entry, across seeds
Persistence under perturbation — after task injection, what fraction stays in B?
Recovery probability — if perturbed out, does it return?
Cross-seed similarity — do trajectories cluster in embedding space, or just look the same to humans?

A real type-II attractor should score high on at least three of these. Type-I "default modes" will score high on convergence and persistence but low on onset latency (they're there from turn 1). Without these numbers, "model X has attractor Y" is folklore.

COLOPHON

About this Guide

Compiled May 2026 in /workspace/safety/model-attractor-states/ over three drafting passes — initial outline, fact-checking pass against the Anthropic Claude 4 system card and other primaries, then structural critique applied via the OpenAI Codex CLI to clarify the taxonomy and excise overclaims. Source files are markdown in the project repo; this page is the front matter.

Adjacent prior work worth reading: the LessWrong post "Mapping LLM attractor states" (clustering-based, quantitative); ACL 2025 "Unveiling Attractor Cycles in LLMs" (paraphrasing-based, dynamical); and Janus's "Simulators" framing, which sits underneath all of this.

This guide is AI-generated and AI-fact-checked, not personally verified by the author. Composite transcripts shown in the per-model writeups are labeled as stylized — they illustrate patterns, not specific conversations. Confidence tiers (C1–C5) are recorded in research/sources.md.