Index
Mechanistic Interpretability

Transformers as Computational Circuits

The Transformer Circuits Framework reframes language models as deterministic computational systems - structured state-evolving circuits composed of identifiable, composable sub-computations. This changes everything about how we think about what neural networks are doing.

Posted  08-Jan 2026
Framework At a Glance
Core abstractionResidual stream
Attention decompositionQK + OV circuits
Update ruleAdditive only
Information flowLinear superposition
Computation modelAlgorithm execution

The Core Claim

Most interpretability work treats neural networks as opaque functions to be probed from the outside - feed inputs, observe outputs, draw correlations. This is the wrong frame. A transformer is a deterministic computational system. Its weights encode an algorithm. Its forward pass executes that algorithm. The question is not "what does this model correlate with?" but "what program is stored in these weights?"

The Transformer Circuits Framework, introduced by Elhage et al. at Anthropic, makes this precise. It provides a decomposition of transformers into interpretable sub-circuits - components with definable computational roles - and demonstrates that complex model behaviours arise from the composition of simple linear operations. The residual stream is a shared communication channel. Attention heads are information-routing and value-transformation units. Layers do not compute in sequence; they accumulate into a shared state.

The fundamental reframing: we are not analyzing a statistical model that "learned patterns." We are reverse engineering a program whose source code has been compressed into floating-point matrices.

The Residual Stream as Shared State

Additive updates, not overwriting

Every transformer layer receives a vector - the residual stream - and writes an additive update back to it. No component owns the stream. No component erases prior information. The residual stream is a shared register: every attention head and MLP layer reads from it and adds to it.

xℓ+1 = x + Attn(x) + MLP(x + Attn(x))

This structure is not accidental. It enforces linear superposition: information from different sub-computations coexists in the stream as a sum of vectors. The residual stream at any position is a superposition of all prior contributions - token embedding, positional information, every prior layer's update - and every subsequent layer reads from this full accumulated state.

This is the central architectural insight. The residual stream does not compress or gate or route. It simply accumulates. As a result, the transformer is structurally closer to a state-space system with additive update equations than to a classical feedforward network where each layer's output is the next layer's sole input.

Tokens: [t₁] [t₂] [t₃] │ │ │ Embedding: [e₁] [e₂] [e₃] │ │ │ Attn Δ₁: [+a₁] [+a₂] [+a₃] ← additive update, not replacement │ │ │ MLP Δ₁: [+m₁] [+m₂] [+m₃] ← again additive │ │ │ Stream₁: [e₁+a₁+m₁] [e₂+a₂+m₂] [e₃+a₃+m₃] ← superposition of all contributions

Why this structure matters

The additive structure means that the contributions of individual components are, in principle, isolable. If attention head h at layer writes vector v to the stream, and some later component reads the stream and produces output f(... + v + ...), we can ask: what does the model compute counterfactually if we ablate v? The linear structure makes such interventions clean. This is the foundation of mechanistic interpretability.

Compare this to architectures without skip connections. In a vanilla deep network, layer ℓ+1 receives only the output of layer - the entire representation has been non-linearly transformed. Causal attribution is computationally intractable. With residual connections, the information from each sub-circuit is preserved as a separable additive term in the stream, making composition and decomposition tractable.

Attention Heads: Routing and Value Transformation

The QK circuit: where to look

An attention head computes a distribution over positions: given query vector q at position i and key vector k at position j, the attention weight αij is determined by their inner product. The QK circuit is fundamentally a routing mechanism. It answers the question: which positions in the sequence should contribute to the update at position i?

αij = softmaxj( xiT WQT WK xj / √d )    [QK circuit: determines attention pattern]

The product WQT WK - a single matrix - fully characterises the attention pattern. You do not need to reason about WQ and WK independently. Their product maps pairs of residual stream vectors to a scalar score, and the QK circuit is precisely this bilinear form. Crucially, this tells you nothing about what computation occurs at attended positions - only which positions get attended to.

The OV circuit: what to do

Once attention weights are fixed, the head computes a weighted sum of value vectors, then projects back to the residual stream. The OV circuit - WV WO, again a single matrix - characterises what the head actually computes. It maps a source position's residual stream vector to the update written at the destination position.

Δxi = Σj αij · WO WV xj    [OV circuit: determines what gets written]

This decomposition is not merely notational. QK and OV circuits can be analyzed independently. An attention head might have a QK circuit that detects syntactic subject-verb agreement (routing from verb to subject position) and an OV circuit that copies specific token embeddings. These are two orthogonal properties of the same head, discoverable by examining the respective matrix products.

CircuitMatrix productComputational roleQuestion answered
QKWQT WKAttention routingWhere to look?
OVWO WVValue transformationWhat to write?
Full headQK ∘ OVInformation transportMove what, from where?

Attention as information transport

An attention head, viewed through this lens, is an information transport mechanism. It selects a source position (via QK routing), reads a function of that position's stream vector (via the OV linear map), and writes the result as an additive update to the destination position's stream. The entire operation is linear in the stream vectors - the nonlinearity is only in the softmax that produces attention weights.

This means that for a fixed attention pattern, the head's contribution is a linear function of the residual stream. And because the stream is a linear superposition of prior contributions, the head's output is a linear function of all prior sub-circuit outputs. Complex multi-head attention is, at its core, a sum of linear transport operations.

Computation as Circuit Composition

Layers compose, they do not replace

In classical feedforward thinking, layer ℓ+1 processes the output of layer . In the transformer circuits view, layers compose over the shared residual stream. Layer ℓ+1 reads the accumulated state and adds a further update. The model's computation is not a pipeline - it is an accumulation.

This matters because it enables multi-layer circuits: sub-computations that span multiple heads across multiple layers. A head at layer 2 can read a feature that was written to the stream by a head at layer 1. This creates a computation graph where nodes are attention heads and MLP sublayers, and edges represent which components read features written by prior components. Elhage et al. call these circuits - identifiable subgraphs of the full computation that implement recognisable algorithms.

two-layer induction head circuit (pseudocode)Conceptual
# Layer 1: Previous-Token Head (PTH) # QK circuit: attends from position i to position i−1 # OV circuit: copies token identity of source to destination stream # Layer 2: Induction Head (IH) # QK circuit: current query "what came before me previously?" # → compares current token embedding with the PTH-written # previous-token information at all other positions # → attends to positions where [t_{k}] = [t_{i-1}] # OV circuit: copies the token that followed that prior occurrence # Result: IH(position i) = copy of token that followed the last # occurrence of the current token's predecessor # → implements in-context sequence completion

The induction head is the canonical example from the paper. It is a two-layer circuit: one head writes a "previous token" signal to the stream, and a second head reads that signal to implement a pattern-matching operation. Neither head alone implements the full algorithm. The computation is distributed across the circuit topology.

The compositionality principle: simple linear components, composed through the shared residual stream, implement algorithms that are qualitatively more complex than any individual component. This is the mechanism by which transformers exhibit emergent behaviour - not magic, but circuit composition.

What Mechanistic Interpretation Means

Reverse engineering, not correlation mining

Mechanistic interpretability is the program of identifying the specific circuits - sub-networks with definable computational roles - that implement observed model behaviours. It is fundamentally different from the probing and correlation paradigm that dominates interpretability literature.

Probing asks: does some representation contain information about feature X? It does not ask whether that information is used or how. A probe can succeed even if the information is a byproduct of some other computation, never read by any downstream component. Mechanistic interpretability requires identifying the causal path: which component writes feature X, which components read it, and what computation do those reading components perform as a result?

ApproachQuestionEvidenceLimitation
ProbingIs X represented?Linear classifier accuracyCorrelation, not causation
Attention viz.Where does head look?Attention weight patternsQK without OV is incomplete
CircuitsWhat algorithm runs?Weight matrix analysis + ablationsCausally grounded

The circuit identification methodology

Identifying a circuit requires three things: (1) isolating the behaviour - specifying precisely what input-output relationship you want to explain; (2) locating the components - determining which heads and MLP neurons causally affect that behaviour, typically via activation patching; (3) validating the mechanism - showing that the identified components implement the claimed algorithm by analysing their weight matrices directly.

Step (3) is where the framework's mathematical structure pays off. Because the QK and OV matrices have direct interpretations as bilinear routing forms and linear value maps, we can read off what a head computes by examining these matrices - their eigenvectors, their relationship to the embedding matrix, whether they implement operations like copy, shift, or negate. This is reverse engineering, not statistics.

The Deep Structure: State Evolution and Algorithm Execution

Core Insight

A transformer's forward pass is a discrete-time state evolution system. The residual stream at each layer is the state. Each layer applies a structured update to that state. The final state is decoded into a prediction. This is not a metaphor - it is precisely what the architecture computes.

The key structural property is that updates are additive and the state is a linear superposition of contributions from all prior operations. This means the system evolves by accumulating structured information, rather than by repeatedly transforming a compressed representation. It is closer to a Markov chain over a high-dimensional vector space than to a classical deep network.

Analogy: recurrence relations and Pell-type systems

Consider a linear recurrence of the form xn+1 = A xn + b. The state at step n is a superposition of the initial state and all accumulated forcing terms. The evolution is deterministic, structured, and - crucially - the contribution of any single forcing term is traceable through the system via the transition matrix A.

The residual stream evolves analogously, with the additional structure that each additive update is itself a function of the current state (through the attention pattern computation). The update at layer is not a fixed forcing term - it is computed from x. But the additive accumulation structure is preserved, and this is what makes individual contributions isolable.

State evolution: xℓ+1 = x + f(x) Expanded over all layers: xL = x0 + Σ f(x) Each f decomposes as: f = Σh Attnh + MLP The final logits are linear in xL: logits = WU xL = WU x0 + Σℓ,h WU Attnh(x)
Logits as sum of per-head contributions - the virtual weights formulation

The final equality is striking: the model's output logits decompose exactly into a sum of per-head and per-layer contributions, each mapped to logit space by the unembedding matrix WU. There is no approximation here. This is an exact identity. It means every head has a direct, isolable contribution to the output - a fact that would be completely obscured by viewing the model as a monolithic function.

From pattern learning to algorithm execution

The standard ML framing posits that training causes a model to "learn patterns" in data - a statistical compression of observed co-occurrence statistics. The circuits framework replaces this with a sharply different picture: training causes gradient descent to store algorithms in weight matrices. The forward pass does not recall patterns; it executes a program.

This is not a semantic distinction. Algorithms are composable, generalizable, and interpretable in ways that statistical patterns are not. When an induction head implements "copy the token that followed the last occurrence of the current context," it is executing a subroutine that works identically on novel sequences. The behaviour is not interpolation over a training distribution - it is an algorithm that generalises by construction.

The emergence of complex capabilities with scale, the sharp capability thresholds, the in-context learning: these phenomena are natural consequences of circuit composition. As models grow larger, they can store and compose more complex sub-algorithms. Phase transitions occur when gradient descent discovers a new circuit that enables qualitatively different computation - not when a statistical threshold is crossed.

A Minimal Example: The Copy Mechanism

What copying looks like in weight space

Consider the simplest non-trivial algorithm a transformer could implement: copy a token from one position to another. What does this look like in the framework?

The QK circuit must route attention from the destination position (where we want the copy to appear) to the source position (where the token we want to copy lives). For this, the query at the destination must have high inner product with the key at the source. If we expand this out: the bilinear form WQT WK must map (destination stream, source stream) pairs to a high score whenever source and destination have the right positional or content relationship.

The OV circuit must implement the actual copy. The matrix WO WV maps the source position's residual stream to the update written at the destination. For a pure copy, this matrix should behave as a scaled identity on the token embedding subspace - it should map the source token's embedding to itself, so that the destination's stream accumulates a copy of the source token's information.

Copy head: weight-space signature
QK circuit signatureHigh score for (target position, source position) pairs satisfying copy condition
OV circuit signatureWOWV ≈ λI on embedding subspace (approximately identity)
Result written to streamΔxdest ≈ α · xsource
Logit contributionWU · Δxdest ≈ α · WU xsource = α · logits(source token)
InterpretationHead predicts "repeat source token" with weight α

The key insight from this example is that the copying behaviour is fully visible in the matrices. You do not need to run the model on inputs to know that a head implements copying - you can read it from WO WV directly. And the contribution to the output logits is exactly WU WO WV applied to the source stream - a chain of three matrices, a single linear map from source stream to output logits, entirely characterisable from the weights.

Why This Matters for Quantitative Systems

Signal transformation and hidden structure

Quantitative practitioners work with systems where signals are transformed through complex pipelines: order flow aggregation, factor model composition, risk decomposition. The circuits framework offers a methodological parallel: just as a transformer's output decomposes exactly into per-head contributions to logits, a portfolio's P&L decomposes exactly into per-factor contributions. Both are linear superpositions; both admit exact attribution.

The deeper parallel is structural. A transformer is a state-evolution system where each step applies a learned, data-dependent linear transformation to a shared state. Many financial systems share this architecture: a latent state (order book imbalance, regime indicator, factor exposure) evolves by accumulating structured signals. The circuits framework demonstrates that such systems, despite their apparent complexity, can be decomposed into interpretable components with well-defined computational roles.

The value of exact decompositions

The virtual weights formulation - where output logits decompose exactly into a sum of per-head contributions - has a direct analogue in factor models: a portfolio's expected return and risk decompose exactly into a sum of factor contributions. In both cases, the exact decomposition is more powerful than an approximate one because it enables causal interventions: ablate a head (or factor), observe the exact change in output.

More broadly, the circuits framework demonstrates that the apparent opacity of deep systems is not fundamental - it is an artefact of analysis method. Systems that appear black-box under correlation analysis often have transparent, decomposable structure under the right analytical lens. The residual stream formulation provides that lens for transformers. Finding analogous decompositions for other complex systems - including financial ones - is an open and commercially significant problem.

Emergence from composition

The induction head example carries a quantitative lesson: behaviours that appear to require global information can sometimes be implemented by composing two local operations. The induction head combines a "previous-token writer" with a "pattern matcher" - neither head alone does anything remarkable, but their composition implements in-context sequence completion. In signal processing, the analogous insight is that convolutions compose: two simple filters in sequence implement a more complex filter. In factor models, factor interactions compose: a momentum-quality interaction is the product of two simpler signals.

Understanding which apparent complexities in a system reduce to circuit composition - and which do not - is a powerful analytical primitive. The circuits framework makes this precise for transformers. The methodology transfers.

Closing: Computation, Not Correlation

The Transformer Circuits Framework is not, ultimately, a paper about transformers. It is a paper about what it means to understand a computational system. The central claim - that a transformer is a circuit, not a black box - is a methodological commitment: to treat model behaviours as having explanations at the level of specific weight matrices and their mathematical properties, not just correlations in activation space.

This distinction matters beyond interpretability research. If a model's behaviour is implemented by an identifiable circuit, it is predictable, auditable, and modifiable in principled ways. If the behaviour is merely "what the model does," it is none of those things. The former is engineering; the latter is superstition dressed in statistical language.

What makes the framework technically deep is not its specific results - the induction head is elegant but not surprising once you see it. What is deep is the decomposition structure it reveals: that a 100-billion-parameter model's output logits are an exact linear superposition of contributions from thousands of attention heads, each characterisable by a pair of matrix products, each implementing a specific routing and value-transformation function. Complexity emerges from composition, not from inscrutability.

The final insight: gradient descent, applied to the right architecture with sufficient compute, converges on algorithms. The residual stream is a shared register. Attention heads are instruction executers. Layers are program steps. What appeared to be a statistical model is, on inspection, a computed program - one we are only beginning to learn how to read.