【DL輪読会】Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

>100 Views

June 04, 26

スライド概要

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

Interpretability in the Wild:bA Circuit for Indirect Object Identification in GPT-2 Small LIANG RUIQI, Matsuo-Iwasawa Lab M1 1

2.

Bibliography Information Title Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small Authors Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt Affiliations Redwood Research; UC Berkeley Publication ICLR 2023 (arXiv, Nov 2022) arXiv https://arxiv.org/abs/2211.00593 Summary • Fully reverse-engineers how GPT-2 small solves a real language task (IOI) — the largest end-to-end circuit found 'in the wild' at the time. • Identifies 26 attention heads in 7 functional classes using causal interventions (path patching, knockouts). • Proposes 3 quantitative criteria — faithfulness, completeness, minimality — to validate circuit explanations. 2

3.

Outline • 1. Background — what is mechanistic interpretability & why circuits • 2. The task: Indirect Object Identification (IOI) • 3. Tools: residual stream, attention heads, knockouts, path patching • 4. The discovered circuit — 26 heads, 7 classes • 5. Validation — faithfulness / completeness / minimality • 6. Surprises, limitations & takeaways 3

4.

Mechanistic Interpretability: Goal & Gap • Goal: reverse-engineer model weights into human-understandable algorithms — explain behavior via internal components. • Why it matters: predict out-of-distribution behavior, find & fix errors, anticipate emergent capabilities → safer deployment. • The gap this paper fills: – Prior work explained simple behaviors in tiny models, OR large models only in 'broad strokes'. – Here: a complete, end-to-end mechanism for a natural-language task in a real LM (GPT-2 small). • Approach: find a circuit — a subgraph of the model's computational graph responsible for the task. 4

5.
[beta]
The IOI Task
• Indirect Object Identification: complete a
sentence with the correct name.
• Human algorithm: of the two names, output
the one that is NOT the subject of the last
clause.
• Built from 15 templates with random
single-token names / places / items.

Example
"When Mary and John went to the store,
John gave a drink to ___"
→ correct completion: "Mary"
IO = Mary · S1/S2 = John (subject, repeated)

• Two metrics:
– Logit difference = logit(IO) − logit(S); mean
3.56 (IO > S in 99.3%).
– IO probability; mean 49% (over 100,000
examples).
5

6.

Building Blocks: Residual Stream & Attention Heads • GPT-2 small: decoder-only, 12 layers × 12 heads = 144 attention heads. • Residual stream: shared 'workspace' every layer reads from and writes to. • Each head = two low-rank maps: – QK circuit → where to attend (the attention pattern). – OV circuit → what information to write into the residual stream. • A circuit C is a subgraph of the full computational graph M responsible for the task. • Scope: this work studies attention heads only (MLPs / LayerNorm left for future work). 6

7.

Knockouts: Turning Heads Off • To test if a node matters, 'knock it out' and measure the performance drop. • Zero ablation (set output = 0): noisy — 0 is arbitrary and breaks implicit biases. • Mean ablation (this paper): replace output with its average activation. – Averaged over p_ABC: same templates but THREE unrelated names (A, B, C). – Removes task-relevant info (which name) while preserving grammar / structure. – Mean computed per-template, so grammatical role stays constant. 7

8.

Path Patching — the Core Technique • Goal: separate a head's DIRECT effect on a target from indirect effects through other heads. • Take two inputs: x_orig (from p_IOI) and x_new (names swapped, from p_ABC). • Run forward on x_orig, but along chosen paths h → R inject h's activation from x_new. • Measure the change in logit difference: – A large drop ⇒ that path is critical to solving IOI. Fig.1 — circuit (orange) + causal validation (path patching, knockout) • Iterate backwards from the logits to trace the whole circuit, head by head. 8

9.

The Human-Interpretable Algorithm • The circuit implements a simple 3-step algorithm: – 1. Identify all previous names in the sentence (Mary, John, John). – 2. Remove the name that is duplicated (John). – 3. Output the remaining name (Mary). • Each step maps onto a class of attention heads — next slide shows the full circuit. 9

10.

The Discovered Circuit: 26 Heads, 7 Classes Fig.2 — 26 heads (1.1% of all head–position pairs) implement IOI in GPT-2 small • Information flows left→right: detect duplicate name → inhibit it → copy the other name to the output. 10

11.

Name Mover Heads — write the answer Heads 9.9, 9.6, 10.0 · active at END • Found by path patching directly back from the logits. • (i) Attend to a name token (avg attention on IO = 0.59), and • (ii) copy whatever they attend to into the output. • Copy score > 95% (vs < 20% for an average head). • Thanks to S-Inhibition, they attend to IO over S → output the correct name. 11

12.

Negative Name Mover Heads Heads 10.7, 11.10 · write AGAINST the correct answer • Same behavior as Name Movers, but with opposite sign: they DECREASE the logit of the name they attend to. • Large negative copy score (98%). • Interpretation: the model 'hedges' — softening confidence to avoid huge cross-entropy loss when wrong. • Lesson: components can actively work against the task — explanations must account for them. 12

13.

S-Inhibition Heads — suppress the subject Heads 7.3, 7.9, 8.6, 8.10 · active at END • Found by path patching the QUERY of the Name Mover Heads. • Active at END, attend to the S2 token. • Write into the Name Movers' query a signal that removes attention to the subject (S1, S2). • Net effect: Name Movers are biased toward IO instead of S — this makes the copy step correct. 13

14.

Detecting the Duplicate Name • Duplicate Token Heads (0.1, 3.0): active at S2, attend to S1, signal 'this token already appeared'. • Induction Heads (5.5, 6.9): reach the same 'duplicate' signal via an induction mechanism (S1+1 → S1). • Previous Token Heads (2.2, 4.11): copy info from S1 to the next token S1+1, enabling induction. • Together they feed the S-Inhibition Heads — telling them WHICH name is the repeated subject. • Note: known 'induction heads' appear here in an unexpected role — main function ≠ full picture. 14

15.

Backup Name Mover Heads Heads 9.0, 9.7, 10.1, 10.2, 10.6, 10.10, 11.2, 11.9 • Normally they do NOT move the IO name to the output. • But if the regular Name Mover Heads are ablated, they 'wake up' and take over the job. • ⇒ The model has built-in redundancy / self-repair (the 'Hydra effect'). • Implication: ablating a component reveals a DIFFERENT structure than is normally used — complicating the search for complete mechanisms. 15

16.

Is the Circuit Really Correct? — 3 Criteria Faithfulness The circuit alone performs the task about as well as the full model. Completeness The circuit contains ALL nodes used for the task — no important node is missing. Minimality The circuit contains NO irrelevant nodes — every node plays a role. Built on F(C) = logit difference recovered by circuit C; completeness/minimality use F over subsets K. 16

17.

Validation Results • Faithfulness: ✓ the circuit alone recovers 87% of the full model's logit difference (gap |F(M)−F(C)| = 0.46 out of 3.56). • Completeness: partial. under random / by-class knockouts the circuit looks complete and clearly beats the naïve circuit, BUT a greedy adversarial search still finds subsets with incompleteness up to 3.09 (≈87% of F(M)); the naïve circuit fails this greedy test too. • Minimality: mostly OK — for most heads, removing its class shows it matters; a few need carefully chosen subsets. • Honest conclusion: the criteria support the circuit but also expose real remaining gaps in our understanding. 17

18.

Three Surprises for Interpretability • Redundancy: Backup Name Movers take over when Name Movers are ablated → ablation can mislead. • Repurposed structure: induction heads are used here for duplicate-token detection, not their 'usual' job. • Anti-helpful components: Negative Name Movers deliberately write against the correct answer. • Takeaway: real circuits are messier than clean diagrams suggest — rigorous causal validation is essential. 18

19.

Limitations & Future Work • Single small model (GPT-2 small) and a single, narrow task (IOI). • MLPs, LayerNorm and embeddings are not analyzed — attention heads only. • Circuit fails the hardest completeness test → not a fully complete explanation. • Manual, labor-intensive discovery → motivated later automation • Open question: do these motifs scale to larger models and more complex tasks? 19

20.

Takeaways First end-to-end reverse-engineering of a natural-language task in a real LM — 26 heads, 7 classes. • • Path patching = a reusable causal tool for tracing circuits; mean ablation over p_ABC for clean knockouts. • Faithfulness / completeness / minimality give interpretability a falsifiable standard of evidence. • A landmark for mechanistic interpretability — and a reminder that full, scalable understanding is still open. 20

21.

Thank You For Your Listening 21