---
title: 【DL輪読会】Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
tags: 
author: [Deep Learning JP](https://docswell.com/user/DeepLearning2023)
site: [Docswell](https://www.docswell.com/)
thumbnail: https://bcdn.docswell.com/page/87DKRM3NJG.jpg?width=480
description: 【DL輪読会】Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small by Deep Learning JP
published: June 04, 26
canonical: https://docswell.com/s/DeepLearning2023/K3J7YP-2026-06-16-115609
---
# Page. 1

![Page Image](https://bcdn.docswell.com/page/87DKRM3NJG.jpg)

Interpretability in the Wild:bA Circuit for Indirect Object
Identiﬁcation in GPT-2 Small
LIANG RUIQI, Matsuo-Iwasawa Lab M1
1


# Page. 2

![Page Image](https://bcdn.docswell.com/page/VJPKWR4NE8.jpg)

Bibliography Information
Title
Interpretability in the Wild: A Circuit for Indirect Object Identiﬁcation in GPT-2 Small
Authors
Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt
Aﬃliations
Redwood Research; UC Berkeley
Publication
ICLR 2023 (arXiv, Nov 2022)
arXiv
https://arxiv.org/abs/2211.00593
Summary
• Fully reverse-engineers how GPT-2 small solves a real language task (IOI) — the largest end-to-end circuit found &#039;in the
wild&#039; at the time.
• Identiﬁes 26 attention heads in 7 functional classes using causal interventions (path patching, knockouts).
• Proposes 3 quantitative criteria — faithfulness, completeness, minimality — to validate circuit explanations.
2


# Page. 3

![Page Image](https://bcdn.docswell.com/page/2EVV8RXYEQ.jpg)

Outline
• 1. Background — what is mechanistic interpretability &amp;
why circuits
• 2. The task: Indirect Object Identiﬁcation (IOI)
• 3. Tools: residual stream, attention heads, knockouts, path
patching
• 4. The discovered circuit — 26 heads, 7 classes
• 5. Validation — faithfulness / completeness / minimality
• 6. Surprises, limitations &amp; takeaways
3


# Page. 4

![Page Image](https://bcdn.docswell.com/page/57GL5MVWEL.jpg)

Mechanistic Interpretability: Goal &amp; Gap
• Goal: reverse-engineer model weights into human-understandable
algorithms — explain behavior via internal components.
• Why it matters: predict out-of-distribution behavior, ﬁnd &amp; ﬁx errors,
anticipate emergent capabilities → safer deployment.
• The gap this paper ﬁlls:
– Prior work explained simple behaviors in tiny models, OR large models only in
&#039;broad strokes&#039;.
– Here: a complete, end-to-end mechanism for a natural-language task in a real
LM (GPT-2 small).
• Approach: ﬁnd a circuit — a subgraph of the model&#039;s computational
graph responsible for the task.
4


# Page. 5

![Page Image](https://bcdn.docswell.com/page/4EQYZR6QJP.jpg)

The IOI Task
• Indirect Object Identiﬁcation: complete a
sentence with the correct name.
• Human algorithm: of the two names, output
the one that is NOT the subject of the last
clause.
• Built from 15 templates with random
single-token names / places / items.
Example
&quot;When Mary and John went to the store,
John gave a drink to ___&quot;
→ correct completion: &quot;Mary&quot;
IO = Mary · S1/S2 = John (subject, repeated)
• Two metrics:
– Logit diﬀerence = logit(IO) − logit(S); mean
3.56 (IO &gt; S in 99.3%).
– IO probability; mean 49% (over 100,000
examples).
5


# Page. 6

![Page Image](https://bcdn.docswell.com/page/KJ4W384Y71.jpg)

Building Blocks: Residual Stream &amp; Attention Heads
• GPT-2 small: decoder-only, 12 layers × 12 heads = 144 attention heads.
• Residual stream: shared &#039;workspace&#039; every layer reads from and writes
to.
• Each head = two low-rank maps:
– QK circuit → where to attend (the attention pattern).
– OV circuit → what information to write into the residual stream.
• A circuit C is a subgraph of the full computational graph M responsible
for the task.
• Scope: this work studies attention heads only (MLPs / LayerNorm left for
future work).
6


# Page. 7

![Page Image](https://bcdn.docswell.com/page/LE1Y124N7G.jpg)

Knockouts: Turning Heads Oﬀ
• To test if a node matters, &#039;knock it out&#039; and measure the performance
drop.
• Zero ablation (set output = 0): noisy — 0 is arbitrary and breaks implicit
biases.
• Mean ablation (this paper): replace output with its average
activation.
– Averaged over p_ABC: same templates but THREE unrelated names (A, B,
C).
– Removes task-relevant info (which name) while preserving grammar /
structure.
– Mean computed per-template, so grammatical role stays constant.
7


# Page. 8

![Page Image](https://bcdn.docswell.com/page/GEWG8RZMJ2.jpg)

Path Patching — the Core Technique
• Goal: separate a head&#039;s DIRECT eﬀect on a
target from indirect eﬀects through other
heads.
• Take two inputs: x_orig (from p_IOI) and
x_new (names swapped, from p_ABC).
• Run forward on x_orig, but along chosen
paths h → R inject h&#039;s activation from x_new.
• Measure the change in logit diﬀerence:
– A large drop ⇒ that path is critical to solving
IOI.
Fig.1 — circuit (orange) + causal validation (path patching, knockout)
• Iterate backwards from the logits to trace the
whole circuit, head by head.
8


# Page. 9

![Page Image](https://bcdn.docswell.com/page/47ZL8R1MJ3.jpg)

The Human-Interpretable Algorithm
• The circuit implements a simple 3-step algorithm:
– 1. Identify all previous names in the sentence (Mary, John, John).
– 2. Remove the name that is duplicated (John).
– 3. Output the remaining name (Mary).
• Each step maps onto a class of attention heads — next slide shows
the full circuit.
9


# Page. 10

![Page Image](https://bcdn.docswell.com/page/YJ6WPRL5JV.jpg)

The Discovered Circuit: 26 Heads, 7 Classes
Fig.2 — 26 heads (1.1% of all head–position pairs) implement IOI in GPT-2 small
• Information ﬂows left→right: detect duplicate name → inhibit it → copy the other name to the output.
10


# Page. 11

![Page Image](https://bcdn.docswell.com/page/GJ5MK81GJ4.jpg)

Name Mover Heads — write the answer
Heads 9.9, 9.6, 10.0 · active at END
• Found by path patching directly back from the
logits.
• (i) Attend to a name token (avg attention on IO
= 0.59), and
• (ii) copy whatever they attend to into the
output.
• Copy score &gt; 95% (vs &lt; 20% for an average
head).
• Thanks to S-Inhibition, they attend to IO over S
→ output the correct name.
11


# Page. 12

![Page Image](https://bcdn.docswell.com/page/LE3WZ81PE5.jpg)

Negative Name Mover Heads
Heads 10.7, 11.10 · write AGAINST the correct answer
• Same behavior as Name Movers, but with opposite sign:
they DECREASE the logit of the name they attend to.
• Large negative copy score (98%).
• Interpretation: the model &#039;hedges&#039; — softening
conﬁdence to avoid huge cross-entropy loss when wrong.
• Lesson: components can actively work against the task —
explanations must account for them.
12


# Page. 13

![Page Image](https://bcdn.docswell.com/page/8EDKRMX37G.jpg)

S-Inhibition Heads — suppress the subject
Heads 7.3, 7.9, 8.6, 8.10 · active at END
• Found by path patching the QUERY of the Name Mover
Heads.
• Active at END, attend to the S2 token.
• Write into the Name Movers&#039; query a signal that removes
attention to the subject (S1, S2).
• Net eﬀect: Name Movers are biased toward IO instead of S
— this makes the copy step correct.
13


# Page. 14

![Page Image](https://bcdn.docswell.com/page/V7PKWRPPJ8.jpg)

Detecting the Duplicate Name
• Duplicate Token Heads (0.1, 3.0): active at S2, attend to S1, signal
&#039;this token already appeared&#039;.
• Induction Heads (5.5, 6.9): reach the same &#039;duplicate&#039; signal via an
induction mechanism (S1+1 → S1).
• Previous Token Heads (2.2, 4.11): copy info from S1 to the next token
S1+1, enabling induction.
• Together they feed the S-Inhibition Heads — telling them WHICH name is
the repeated subject.
• Note: known &#039;induction heads&#039; appear here in an unexpected role — main
function ≠ full picture.
14


# Page. 15

![Page Image](https://bcdn.docswell.com/page/2JVV8R2VJQ.jpg)

Backup Name Mover Heads
Heads 9.0, 9.7, 10.1, 10.2, 10.6, 10.10, 11.2, 11.9
• Normally they do NOT move the IO name to the output.
• But if the regular Name Mover Heads are ablated, they
&#039;wake up&#039; and take over the job.
• ⇒ The model has built-in redundancy / self-repair (the &#039;Hydra
eﬀect&#039;).
• Implication: ablating a component reveals a DIFFERENT
structure than is normally used — complicating the search for
complete mechanisms.
15


# Page. 16

![Page Image](https://bcdn.docswell.com/page/5EGL5MR1JL.jpg)

Is the Circuit Really Correct? — 3 Criteria
Faithfulness
The circuit alone performs the
task about as well as the full
model.
Completeness
The circuit contains ALL nodes
used for the task — no important
node is missing.
Minimality
The circuit contains NO irrelevant
nodes — every node plays a role.
Built on F(C) = logit diﬀerence recovered by circuit C; completeness/minimality use F over subsets K.
16


# Page. 17

![Page Image](https://bcdn.docswell.com/page/4JQYZRVN7P.jpg)

Validation Results
• Faithfulness: ✓ the circuit alone recovers 87% of the full
model&#039;s logit diﬀerence (gap |F(M)−F(C)| = 0.46 out of 3.56).
• Completeness: partial. under random / by-class knockouts the
circuit looks complete and clearly beats the naïve circuit, BUT a
greedy adversarial search still ﬁnds subsets with incompleteness
up to 3.09 (≈87% of F(M)); the naïve circuit fails this greedy test too.
• Minimality: mostly OK — for most heads, removing its class
shows it matters; a few need carefully chosen subsets.
• Honest conclusion: the criteria support the circuit but also expose
real remaining gaps in our understanding.
17


# Page. 18

![Page Image](https://bcdn.docswell.com/page/K74W38M3E1.jpg)

Three Surprises for Interpretability
• Redundancy: Backup Name Movers take over when Name
Movers are ablated → ablation can mislead.
• Repurposed structure: induction heads are used here for
duplicate-token detection, not their &#039;usual&#039; job.
• Anti-helpful components: Negative Name Movers
deliberately write against the correct answer.
• Takeaway: real circuits are messier than clean diagrams suggest
— rigorous causal validation is essential.
18


# Page. 19

![Page Image](https://bcdn.docswell.com/page/LJ1Y128ZEG.jpg)

Limitations &amp; Future Work
• Single small model (GPT-2 small) and a single, narrow task (IOI).
• MLPs, LayerNorm and embeddings are not analyzed —
attention heads only.
• Circuit fails the hardest completeness test → not a fully
complete explanation.
• Manual, labor-intensive discovery → motivated later automation
• Open question: do these motifs scale to larger models and
more complex tasks?
19


# Page. 20

![Page Image](https://bcdn.docswell.com/page/GJWG8RZ672.jpg)

Takeaways
First end-to-end reverse-engineering of a natural-language
task in a real LM — 26 heads, 7 classes.
•
• Path patching = a reusable causal tool for tracing circuits;
mean ablation over p_ABC for clean knockouts.
• Faithfulness / completeness / minimality give interpretability a
falsiﬁable standard of evidence.
• A landmark for mechanistic interpretability — and a reminder that
full, scalable understanding is still open.
20


# Page. 21

![Page Image](https://bcdn.docswell.com/page/4EZL8R1R73.jpg)

Thank You For Your Listening
21