【DL輪読会】NN の内部挙動分析のためのCircuit について

6.8K Views

July 04, 24

#ニューラルネットワーク #深層学習 #モデル解釈性 #Transformer #Circuit

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 90.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 67.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61.2K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 50K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 47.3K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 47.2K

各ページのテキスト

DEEP LEARNING JP [DL Papers] NNの内部挙動分析のためのCircuitについて Keno Harada, the University of Tokyo http://deeplearning.jp/ 1

http://deeplearning.jp/

Transformerの挙動分析 Ferrando et al., 2024, “A PRIMER ON THE INNER WORKINGS OF TRANSFORMER-BASED LANGUAGE MODELS” 2

https://arxiv.org/pdf/2405.00208

Attention headの分析と活⽤ • Contextの活⽤には Retrieval headが影響 Wu et al., 2024, “Retrieval Head MechanisNcally Explains Long-Context Factuality” • Attention sink現象 → sliding windowの際に最初のtokenも含める Xiao et al., 2023, “EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS” 3

Neurons in Feedforward blockの分析と活⽤ • 特定のtoken, n-gram, positionに発⽕するneuronが存在する Voita et al., 2023, “Neurons in Large Language Models: Dead, N-gram, Positional ” • 特定の⾔語ニューロンの特定 →発⽕パターンで⾔語制御 Tang et al., 2024, “Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models” 4

Residual Streamの分析と活⽤ • Sparse autoencoderにより解釈可能性を向上させ、 feature steeringによる出⼒制御 Templeton et al., 2024, “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet ” 5

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Transformerの挙動分析 Ferrando et al., 2024, “A PRIMER ON THE INNER WORKINGS OF TRANSFORMER-BASED LANGUAGE MODELS” 6

https://arxiv.org/pdf/2405.00208

• Circuitはどのように発⾒できるか • どのようなcircuitが⾒つかっているか • Circuitは何に役⽴つか︖ • Circuitはどのように獲得されるか︖ 7

Circuits • “A subgraph of a neural network. Nodes correspond to neurons or directions (linear combinations of neurons). Two nodes have an edge between them if they are in adjacent layers. The edges have weights which are the weights between those neurons (or n1Wn2T if the nodes are linear combinations). For convolutional layers, the weights are 2D matrices representing the weights for different relative positions of the layers.” Olah et al., 2020, “Zoom In: An Introduction to Circuits” 8

https://distill.pub/2020/circuits/zoom-in/

Circuitsの例 Olah et al., 2020, “Zoom In: An IntroducNon to Circuits” 9

https://distill.pub/2020/circuits/zoom-in/

10.

Induction Heads Elhage et al, 2021, “A Mathematical Framework for Transformer Circuits” …[A][B]…[A]の続きは[B]とするようなHeads In-context learningとの関連性が指摘されている Ferrando et al., 2024, “A PRIMER ON THE INNER WORKINGS OF TRANSFORMER-BASED LANGUAGE MODELS” 10

11.

Induction Heads Olsson et al., 2022, “In-context Learning and Induction Heads” 11

https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

12.

有名なCircuit Conmy et al., 2023, “Towards Automated Circuit Discovery for Mechanistic Interpretability” 12

https://arxiv.org/pdf/2304.14997

13.

IOI circuit in GPT-2 Wang et al., 2022, “INTERPRETABILITY IN THE WILD: A CIRCUIT FOR INDIRECT OBJECT IDENTIFICATION IN GPT-2 SMALL” 13

https://arxiv.org/abs/2211.00593

14.

Indirect Object Identification(IOI) task “When Mary and John went to the store, John gave a drink to ___ “ 上記を解くのに必要な⼯程 • ⽂中に出てくる名前を全て把握(Mary, John, John) • 被っている名前を除去(John) • 残りの名前を出⼒(Mary) 14

15.

IOI circuitに含まれる重要なhead • Duplicate Token Heads • ⽂中にすでに出たtokenを特定するhead • Token duplicationが起きていることを⽰すsignal • S-Inhibition Heads • Name Mover Headsのattentionからduplicate tokensを除去するhead • Name Mover Heads • 残りの名前を出⼒するhead • ⽂中の名前にattendする • Negative Name Move Heads • 正しい答えとは反対⽅向を⽰すhead • 出⼒を間違えた時にCross-entropy lossがデカくならないようにhedgeするhead 15

16.

重要なheadに関連する機能を⽰すhead • Previous Token Heads • Subjectを表すtokenの情報を次に続くtokenにcopyするhead • Induction Heads • Backup Name Mover Heads • Name Mover Headsの予備 16

17.

Circuitの⾒つけ⽅: path patching Forward passの⼀部を、別の⼊⼒での発⽕パターンに置き換える Logitの変化を⾒て、patchによる変化が⼤きいhead を絞り込む絞り込んだheadについてattention patternを分析 Wang et al., 2022, “INTERPRETABILITY IN THE WILD: A CIRCUIT FOR INDIRECT OBJECT IDENTIFICATION IN GPT-2 SMALL” 17

https://arxiv.org/abs/2211.00593

18.

絞り込んだheadはどこから影響を受けるか︖ • 絞り込んだheadのq, k, vに影響を与えるheadの特定を再度path patchingを⾏い特定 Wang et al., 2022, “INTERPRETABILITY IN THE WILD: A CIRCUIT FOR INDIRECT OBJECT IDENTIFICATION IN GPT-2 SMALL” 18

https://arxiv.org/abs/2211.00593

19.

特定した重要headをknock out後、どうなる︖ • knock outしたheadと同じような役割を果たす headが⾒つかる • 学習時のdropoutが影響︖ 19

20.

Circuitsの妥当性の確認 • Faithfulness • モデル全体と同じようにcircuit単体でtaskを解けるか • Completeness • Taskを解く上で必要となるnodeを全て含む • Circuitと分類されているものの⼀部をknockoutさせたとき、その下がり幅がモデル全体、circuit集合から抜いた場合でそこまで差が出ないか • Minimality • 関係ないnodeを含まない • Circuit中から何かknockoutしたときに性能の変わり幅はどれほどか 20

21.

Automatic Circuit DisCovery (ACDC) Conmy et al., 2023, “Towards Automated Circuit Discovery for Mechanistic Interpretability” 21

https://arxiv.org/pdf/2304.14997

22.

Circuitsの効率良い⾒つけ⽅ • Patchingではなくattribution(importance)を元に特定 Ferrando et al., 2024, “InformaNon Flow Routes: AutomaNcally InterpreNng Language Models at Scale” 22

https://arxiv.org/abs/2403.00824

23.

Circuitsの効率良い⾒つけ⽅: attributionの計算 • z_jとyの距離が⼩さい → z_jの情報がyにたくさん → z_j重要 Ferrando et al., 2024, “Information Flow Routes: Automatically Interpreting Language Models at Scale” 23

https://arxiv.org/abs/2403.00824

24.

Circuitsの効率良い⾒つけ⽅:FFN edges Ferrando et al., 2024, “Information Flow Routes: Automatically Interpreting Language Models at Scale” 24

https://arxiv.org/abs/2403.00824

25.

Circuitsの効率良い⾒つけ⽅:Attention edges Ferrando et al., 2024, “Information Flow Routes: Automatically Interpreting Language Models at Scale” 25

https://arxiv.org/abs/2403.00824

26.

Circuitsの効率良い⾒つけ⽅ • 特定の値以上のedgeを残す Ferrando et al., 2024, “Information Flow Routes: Automatically Interpreting Language Models at Scale” 26

https://arxiv.org/abs/2403.00824

27.

Circuitsの⾒つけ⽅: Causal Abstraction • 理想なcircuitsを仮定して、それに対応するものを⾒つけ出す Geiger et al., 2021, “Causal Abstractions of Neural Networks” 27

https://arxiv.org/abs/2106.02997

28.

Circuitsは何に役⽴つか︖ • 出⼒制御による信頼性・安全性向上 • 重みの初期化・モデルマージ・蒸留/枝刈り・データ戦略 Samragh et al., 2023, “Weight Subcloning: Direct Initialization of Transformers Using Larger Pretrained Ones” Cheng et al., 2024, “Instruction Pre-Training: Language Models are Supervised Multitask Learners” 28

29.

Circuitsの獲得とタスク性能の関係 Syntactic Attention Structure (SAS)の例 • 特定の係り受けを担うheadの獲得 → ⽂法構造に関する知識問題の正答率上昇 Chen et al., 2024, “SUDDEN DROPS IN THE LOSS: SYNTAX ACQUISITION, PHASE TRANSITIONS, AND SIMPLICITY BIAS IN MLMS” 29

https://arxiv.org/pdf/2309.07311

30.

Circuitsの獲得とタスク性能の関係 Syntactic Attention Structure (SAS)の例 • ⼈間に容易に解釈可能な形ではない場合も Chen et al., 2024, “SUDDEN DROPS IN THE LOSS: SYNTAX ACQUISITION, PHASE TRANSITIONS, AND SIMPLICITY BIAS IN MLMS” 30

https://arxiv.org/pdf/2309.07311

31.

おすすめ資料 • https://transformer-circuits.pub/ • https://distill.pub/2020/circuits/ • https://arxiv.org/abs/2404.14082 • https://arxiv.org/abs/2405.00208 31