【DL輪読会】Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

145 Views

June 04, 26

#メカニスティック解釈可能性 #GPT-2 #注意ヘッド #回路 #間接目的語特定

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 92.4K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 71.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61.6K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 55.2K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 52.1K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 50.2K

各ページのテキスト

Interpretability in the Wild:bA Circuit for Indirect Object Identiﬁcation in GPT-2 Small LIANG RUIQI, Matsuo-Iwasawa Lab M1 1

Bibliography Information Title Interpretability in the Wild: A Circuit for Indirect Object Identiﬁcation in GPT-2 Small Authors Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt Aﬃliations Redwood Research; UC Berkeley Publication ICLR 2023 (arXiv, Nov 2022) arXiv https://arxiv.org/abs/2211.00593 Summary • Fully reverse-engineers how GPT-2 small solves a real language task (IOI) — the largest end-to-end circuit found 'in the wild' at the time. • Identiﬁes 26 attention heads in 7 functional classes using causal interventions (path patching, knockouts). • Proposes 3 quantitative criteria — faithfulness, completeness, minimality — to validate circuit explanations. 2

https://arxiv.org/abs/2211.00593

Outline • 1. Background — what is mechanistic interpretability & why circuits • 2. The task: Indirect Object Identiﬁcation (IOI) • 3. Tools: residual stream, attention heads, knockouts, path patching • 4. The discovered circuit — 26 heads, 7 classes • 5. Validation — faithfulness / completeness / minimality • 6. Surprises, limitations & takeaways 3

Mechanistic Interpretability: Goal & Gap • Goal: reverse-engineer model weights into human-understandable algorithms — explain behavior via internal components. • Why it matters: predict out-of-distribution behavior, ﬁnd & ﬁx errors, anticipate emergent capabilities → safer deployment. • The gap this paper ﬁlls: – Prior work explained simple behaviors in tiny models, OR large models only in 'broad strokes'. – Here: a complete, end-to-end mechanism for a natural-language task in a real LM (GPT-2 small). • Approach: ﬁnd a circuit — a subgraph of the model's computational graph responsible for the task. 4

[beta]

The IOI Task
• Indirect Object Identiﬁcation: complete a
sentence with the correct name.
• Human algorithm: of the two names, output
the one that is NOT the subject of the last
clause.
• Built from 15 templates with random
single-token names / places / items.

Example
"When Mary and John went to the store,
John gave a drink to ___"
→ correct completion: "Mary"
IO = Mary · S1/S2 = John (subject, repeated)

• Two metrics:
– Logit diﬀerence = logit(IO) − logit(S); mean
3.56 (IO > S in 99.3%).
– IO probability; mean 49% (over 100,000
examples).
5

Building Blocks: Residual Stream & Attention Heads • GPT-2 small: decoder-only, 12 layers × 12 heads = 144 attention heads. • Residual stream: shared 'workspace' every layer reads from and writes to. • Each head = two low-rank maps: – QK circuit → where to attend (the attention pattern). – OV circuit → what information to write into the residual stream. • A circuit C is a subgraph of the full computational graph M responsible for the task. • Scope: this work studies attention heads only (MLPs / LayerNorm left for future work). 6

Knockouts: Turning Heads Oﬀ • To test if a node matters, 'knock it out' and measure the performance drop. • Zero ablation (set output = 0): noisy — 0 is arbitrary and breaks implicit biases. • Mean ablation (this paper): replace output with its average activation. – Averaged over p_ABC: same templates but THREE unrelated names (A, B, C). – Removes task-relevant info (which name) while preserving grammar / structure. – Mean computed per-template, so grammatical role stays constant. 7

Path Patching — the Core Technique • Goal: separate a head's DIRECT eﬀect on a target from indirect eﬀects through other heads. • Take two inputs: x_orig (from p_IOI) and x_new (names swapped, from p_ABC). • Run forward on x_orig, but along chosen paths h → R inject h's activation from x_new. • Measure the change in logit diﬀerence: – A large drop ⇒ that path is critical to solving IOI. Fig.1 — circuit (orange) + causal validation (path patching, knockout) • Iterate backwards from the logits to trace the whole circuit, head by head. 8

The Human-Interpretable Algorithm • The circuit implements a simple 3-step algorithm: – 1. Identify all previous names in the sentence (Mary, John, John). – 2. Remove the name that is duplicated (John). – 3. Output the remaining name (Mary). • Each step maps onto a class of attention heads — next slide shows the full circuit. 9

10.

The Discovered Circuit: 26 Heads, 7 Classes Fig.2 — 26 heads (1.1% of all head–position pairs) implement IOI in GPT-2 small • Information ﬂows left→right: detect duplicate name → inhibit it → copy the other name to the output. 10

11.

Name Mover Heads — write the answer Heads 9.9, 9.6, 10.0 · active at END • Found by path patching directly back from the logits. • (i) Attend to a name token (avg attention on IO = 0.59), and • (ii) copy whatever they attend to into the output. • Copy score > 95% (vs < 20% for an average head). • Thanks to S-Inhibition, they attend to IO over S → output the correct name. 11

12.

Negative Name Mover Heads Heads 10.7, 11.10 · write AGAINST the correct answer • Same behavior as Name Movers, but with opposite sign: they DECREASE the logit of the name they attend to. • Large negative copy score (98%). • Interpretation: the model 'hedges' — softening conﬁdence to avoid huge cross-entropy loss when wrong. • Lesson: components can actively work against the task — explanations must account for them. 12

13.

S-Inhibition Heads — suppress the subject Heads 7.3, 7.9, 8.6, 8.10 · active at END • Found by path patching the QUERY of the Name Mover Heads. • Active at END, attend to the S2 token. • Write into the Name Movers' query a signal that removes attention to the subject (S1, S2). • Net eﬀect: Name Movers are biased toward IO instead of S — this makes the copy step correct. 13

14.

Detecting the Duplicate Name • Duplicate Token Heads (0.1, 3.0): active at S2, attend to S1, signal 'this token already appeared'. • Induction Heads (5.5, 6.9): reach the same 'duplicate' signal via an induction mechanism (S1+1 → S1). • Previous Token Heads (2.2, 4.11): copy info from S1 to the next token S1+1, enabling induction. • Together they feed the S-Inhibition Heads — telling them WHICH name is the repeated subject. • Note: known 'induction heads' appear here in an unexpected role — main function ≠ full picture. 14

15.

Backup Name Mover Heads Heads 9.0, 9.7, 10.1, 10.2, 10.6, 10.10, 11.2, 11.9 • Normally they do NOT move the IO name to the output. • But if the regular Name Mover Heads are ablated, they 'wake up' and take over the job. • ⇒ The model has built-in redundancy / self-repair (the 'Hydra eﬀect'). • Implication: ablating a component reveals a DIFFERENT structure than is normally used — complicating the search for complete mechanisms. 15

16.

Is the Circuit Really Correct? — 3 Criteria Faithfulness The circuit alone performs the task about as well as the full model. Completeness The circuit contains ALL nodes used for the task — no important node is missing. Minimality The circuit contains NO irrelevant nodes — every node plays a role. Built on F(C) = logit diﬀerence recovered by circuit C; completeness/minimality use F over subsets K. 16

17.

Validation Results • Faithfulness: ✓ the circuit alone recovers 87% of the full model's logit diﬀerence (gap |F(M)−F(C)| = 0.46 out of 3.56). • Completeness: partial. under random / by-class knockouts the circuit looks complete and clearly beats the naïve circuit, BUT a greedy adversarial search still ﬁnds subsets with incompleteness up to 3.09 (≈87% of F(M)); the naïve circuit fails this greedy test too. • Minimality: mostly OK — for most heads, removing its class shows it matters; a few need carefully chosen subsets. • Honest conclusion: the criteria support the circuit but also expose real remaining gaps in our understanding. 17

18.

Three Surprises for Interpretability • Redundancy: Backup Name Movers take over when Name Movers are ablated → ablation can mislead. • Repurposed structure: induction heads are used here for duplicate-token detection, not their 'usual' job. • Anti-helpful components: Negative Name Movers deliberately write against the correct answer. • Takeaway: real circuits are messier than clean diagrams suggest — rigorous causal validation is essential. 18

19.

Limitations & Future Work • Single small model (GPT-2 small) and a single, narrow task (IOI). • MLPs, LayerNorm and embeddings are not analyzed — attention heads only. • Circuit fails the hardest completeness test → not a fully complete explanation. • Manual, labor-intensive discovery → motivated later automation • Open question: do these motifs scale to larger models and more complex tasks? 19

20.

Takeaways First end-to-end reverse-engineering of a natural-language task in a real LM — 26 heads, 7 classes. • • Path patching = a reusable causal tool for tracing circuits; mean ablation over p_ABC for clean knockouts. • Faithfulness / completeness / minimality give interpretability a falsiﬁable standard of evidence. • A landmark for mechanistic interpretability — and a reminder that full, scalable understanding is still open. 20

21.

Thank You For Your Listening 21