← Back to collection
Technical Report No. 2026-0324 • Restricted Circulation

On the Safety Properties of
Large-Scale Neural Systems

§

A systematic examination of alignment methods, failure modes, and evaluation procedures

Filed: 24 March 2026 • Classification: Artificial Intelligence • Rev. 3
Principal Observations
Table I — Field Measurements (2025)
340
Publications
± 15%
47
Research Labs
active
2.1B
USD Funding
annual est.
12
Open Models
w/ safety data
— ◊ —
Training Data → Pre-training → Fine-tuning → RLHF → Deployment
      ↓            ↓          ↓         ↓
   Contamination   Capability   Alignment   Evaluation
Plate IV — Critical Points in the Safety Pipeline
— ◊ —
Appendix A — Research Specimens
Specimen A-1 INTERP

Mechanistic Interpretability

Reverse-engineering the learned algorithms in neural networks by tracing activation patterns through individual circuits.

Ref: Elhage et al., 2024
Specimen A-2 CONST

Constitutional Methods

Training models to self-critique using written principles, reducing reliance on expensive human preference data.

Ref: Bai et al., 2023
Specimen A-3 EVAL

Red Team Evaluation

Systematic adversarial testing protocols designed to elicit failure modes prior to production deployment.

Ref: Ganguli et al., 2024
Specimen A-4 OVER

Scalable Oversight

Methods enabling human supervisors to effectively evaluate systems whose capabilities may exceed their own.

Ref: Bowman et al., 2024
Specimen A-5 ROBU

Distribution Robustness

Ensuring stable performance when deployed conditions diverge from training distribution assumptions.

Ref: Hendrycks et al., 2023
Specimen A-6 CONT

Data Contamination

Identifying and mitigating subtle biases introduced through training corpora that resist standard correction.

Ref: arXiv 2602.12413