160 Views
June 23, 23
スライド概要
2023/6/23
Deep Learning JP
http://deeplearning.jp/seminar-2/
DL輪読会資料
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation Shohei Taniguchi, Matsuo Lab 1
Deep Transformers without Shortcuts ॻࢽใ ஶऀ • Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, Yee Whye Teh (DeepMind) ֓ཁ • TransformerΛlayer normalizationskip connectionͳ͠ͰֶशͰ͖ΔΑ͏ʹվྑ • ICLR 2023 accepted 2
ൃද֓ཁ • എܠ • ؔ࿈ڀݚ • ख๏ • ࣮݁ݧՌ • ·ͱΊ 3
എܠ Transformer • TransformerAttentionͱMLPͷ܁Γฦ͠ • ֤ϞδϡʔϧͰskip connectionͱlayer normalizationΛ ద༻͢Δͷ͕Ұൠత • ͜ΕΒ͕࣮ࡍʹͲ͏͍͏ׂΛՌ͍ͨͯ͠Δ͔ ݱঢ়ෆ໌ • ͏·ֶ͘श͢ΔͨΊͷςΫχοΫͱ͍͏Ґஔ͚ 4
ؔ࿈ڀݚ Normalization-free Network • MLPCNNͰɼskip connectionਖ਼نԽ͕ͳͯ͘ਂ͍ωοτϫʔΫΛֶश Ͱ͖Δํ๏͕ΒΕ͍ͯΔ • جຊతʹɼޯফࣦ/രൃ͕͜ىΒͳ͍Α͏ʹదʹॏΈͷॳظԽΛߦ͑ ਖ਼نԽͳͲΛΘͳͯ͘େৎ • Dynamic isometryͱ͍͏֓೦͕ಛʹॏཁ 5
Isometry ํੑ LͷMLPΛߟ͑Δͱɼೖྗ͔Βग़ྗͷϠίϏߦྻJɼ֤ͷॏΈͷߦྻੵ x = ϕ (h ), h = W x l l l l l−1 L L ∂x l l J= 0 = DW ∏ ∂h l=1 l ͨͩ͠ɼD Dij = l ϕ′ (hi ) δijΛຬͨ͢ର֯ߦྻ  l 6 l +b
Isometry ํੑ L L ∂x l l J= 0 = DW ∏ ∂h l=1 • ͜ͷϠίϏߦྻ͕ফࣦ/രൃ͍ͯ͠ͳ͚Εɼ҆ఆֶͯ͠शͰ͖Δͣ ߦྻͷಛҟ͕1ۙʹͳ͍ͬͯΕྑ͍ l • W ͷಛҟͷฏ͕ۉ1ͷͱ͖ɼํੑΛຬͨ͢ l • D ʹ͍ͭͯɼ׆ੑԽ͕ؔ߃ͰۙݪؔͳΒํతʢtanhͳͲʣ 7
Dynamic Isometry ಈతํੑ L L ∂x l l J= 0 = DW ∏ ∂h l=1 • ͞Βʹɼͯ͢ͷಛҟ͕1ͷͱ͖ɼಈతํੑΛຬͨ͢ • ͜ΕΛຬͨ͢ͷɼॏΈ͕ަߦྻͷͱ͖ ަॳظԽΛ͢Εɼޯফࣦ/രൃ͠ͳ͍ʂ 8
ؔ࿈[ ڀݚ1] • MLPͰCIFAR-10ͷྨ • ަॳظԽ + tanhଞΑΓ͘ऩଋ͢Δ 9
ؔ࿈[ ڀݚ2] CNNͷ߹ • CNNಈతํੑΛຬͨ͢Α͏ʹॳظԽ͢Εɼਂ͍ϞσϧΛਖ਼نԽͳ͠Ͱ ֶशՄೳ • ΈࠐΈΧʔωϧͷதԝͷΈΛަॳظԽͯ͠ɼΓͯ͢0ͰॳظԽ • 1x1 convΛަॳظԽͯ͠ɼͦͷपΓΛ0ຒΊ͢Δܗ • ΈࠐΈॲཧશମΛߦྻԋࢉͱʹ͖ͱͨݟަߦྻʹͳΔ 10
ؔ࿈[ ڀݚ2] CNNͷ߹ • MNISTΛ4,000ͷCNNͰֶश • ਖ਼نԽskip connectionೖΕͳ͍ • ਖ਼نͰॳظԽ͢ΔΑΓֶश͕͘ͳΔ 11
ؔ࿈[ ڀݚ2] CNNͷ߹ • MNISTͱCIFAR-10Ͱ༷ʑͳਂ͞ͷϞσϧΛֶश • 10,000·Ͱ૿ֶͯ͠शͰ͖Δ • ͨͩ͠ɼCIFAR-10Ͱςετͷਫ਼͕ανΔ ਖ਼نԽskip connectionֶशͷ҆ఆԽΑΓ ൚Խʹد༩͍ͯ͠Δ͜ͱΛࣔࠦ 12
ؔ࿈[ ڀݚ3] ReZero • Skip connectionΛೖΕΔ߹ͰɼಈతํੑΛຬͨ͢Α͏ʹ ॳظԽ͢Εɼ͞ΒʹੑೳΛ্͛ΒΕͦ͏ xi+1 = xi + αiF (xi) • ௨ৗ αi = 1 ʹ͢Δ͕ɼαi = 0 ͰॳظԽͯ͠αiֶशύϥϝʔλʹ͢Δ • ॳظԽ࣌Ͱɼxi+1 = xi ͳͷͰɼ໌Β͔ʹಈతํੑΛຬͨ͢ 13
ؔ࿈[ ڀݚ3] ReZero • CIFAR-10Ͱ32ͷMLPΛֶश • ਖ਼نԽͳ͠Ͱ͔ͳΓֶश͕͘ͳΔ 14
ؔ࿈[ ڀݚ3] ReZero • CIFAR-10ͰResNetΛֶश • ֶश͕͘ͳΓɼੑೳ্͕Δ 15
ؔ࿈[ ڀݚ4] ReLUͷ߹ • ReLUͷ߹ɼަॏΈͷҰ෦Λసͤ͞ΕಈతํੑΛຬͨͤΔ • ײతʹɼReLUͰෛͷʹͳͬͨೖྗ৴߸͕ͯ͢0ʹःஅ͞ΕΔͷͰɼ ͦΕΛଧͪফ͢Α͏ʹූ߸Λసͤ͞Εྑ͍ͱ͍͏͜ͱ 16
ؔ࿈[ ڀݚ5] Transformerͷrank collapse • MLP, skip connection, LayerNormͷͳ͍ attentionͷΈͷTransformerɼॳظԽͷ ࣌ͰϞσϧશମͷߦྻ͕ʹରͯ͠ ࢦతʹϥϯΫམͪ͢Δ͜ͱ͕ཧతʹ ࣔͤΔ • AttentionͷΈͰTransformerֶशͰ͖ ͳ͍͜ͱΛࣔࠦ 17
Deep Transformers without Shortcuts • TransformerͰਖ਼نԽskip connectionͳ͠ͰֶशͰ͖Δʁ ؤுΕͰ͖Δ • ७ਮʹਖ਼نԽͱskipΛൈ͘ͱޯ͕ രൃ͢Δ • ఏҊ๏͍ͩͿ͑ΒΕ͍ͯΔ
Deep Transformers without Shortcuts Γेେ͖͍ఆ Attn(X) = A(X)V(X) A(X) = softmax M ∘ ( 1 k d ⊤ Q(X)K(X) − Γ(1 − M) ) • ຊจͰɼGPTͰܥΘΕΔΑ͏ͳCausal masked attentionΛରʹ͢Δ • ະདྷͷྻܥΛࢀর͠ͳ͍Α͏ʹMi,j = 1i≥jͰϚεΫ͢Δ
Deep Transformers without Shortcuts • ·ͣɼMLPͷͳ͍attention-onlyͷϞσϧΛߟ͑ΔͱɼLͷಛྔ XL = [ALAL−1…A1] X0W, W = • Σl = ⊤ XlXl , L V O Wl Wl ∏ l=1 Πl = AlAl−1…A1ͱ͓͘ͱɼW͕ަߦྻͷͱ͖ Σl = Πl ⋅ Σ0 ⋅ ⊤ Πl
Deep Transformers without Shortcuts • Σl = ⊤ XlXl , Πl = AlAl−1…A1ͱ͓͘ͱɼW͕ަߦྻͷͱ͖ Σl = Πl ⋅ Σ0 ⋅ ⊤ Πl • Σl͕୯Ґߦྻʹ͚ۙΕɼޯ͕҆ఆ͢Δ ͦΕ͕͜ىΔΑ͏ʹAlΛઃ͍ͨ͠ܭ • ͨͩ͠ɼAlཁૉ͕ඇෛͷԼࡾ֯ߦྻͱ͍͏੍͖
Deep Transformers without Shortcuts • Al = ⊤ −1 −1 −1 LlLl−1ͱ͓͘ͱɼL0 Σ0L0 = IT͕ΓཱͭͱͰ Σl = ⊤ LlLl • ͜ΕίϨεΩʔղʹ૬͢Δ ଥͳΣlΛઃͯ͠ܭɼͦͷίϨεΩʔղLlΛٻΊΕɼ݅Λຬͨ͢AlΛ ࡞ΕΔ
Deep Transformers without Shortcuts U-SPA Σl (ρl) = (1 − ρl) IT + ρl11 ⊤ • ର͕֯1ͰͦΕҎ֎͕ρlͷߦྻ • 0 ≤ ρ0 ≤ ρ1 ≤ ⋯ ≤ ρL < 1Λຬͨͤɼ݅Λຬͨ͢ • ϥϯΫམͪ͛Δ
Deep Transformers without Shortcuts E-SPA | | Σ γ = exp −γ i − j ( ) ( ) l l l ( ) i,j • ର͕֯1ͰͦΕҎ֎ର֯ઢ͔ΒͷͰڑ͕ఆ·Δߦྻ • γ0 ≥ γ1 ≥ ⋯ ≥ γL > 0Λຬͨͤɼ݅Λຬͨ͢ • ϥϯΫམͪ͛Δ
Deep Transformers without Shortcuts Attentionͷ࠶ఆٛ • લड़ͷΣ͔Βͨͬ࡞ͯ͠ࢉٯAΛɼA = DPͱղ • Dਖ਼ͷର֯ߦྻɼP֤ߦͷ͕1ͷԼࡾ֯ߦྻ • B = log(P)ͱ͓͍ͯɼҎԼͷΑ͏ʹattentionΛ࠶ఆٛ Q • Q(X)ͷॏΈW Λ0ͰॳظԽ͢Δ͜ͱͰɼॳظʹ͓͍ͯΣ͕ॴͷͳʹܗΔ Attn(X) = DP(X)V(X), P(X) = softmax M ∘ [ Q(X)K(X) + B − Γ(1 − M) ] dk 1 ⊤
࣮ݧ WikiText-103 • 36ͷTransformerΛֶश • ૉʹskipΛͳͨ͘͠ͷɼશֶ͘शͰ͖ͳ͍ • ఏҊ๏ɼͪΌΜͱֶशͰ͖ͯΔ • ͨͩ͠ɼskip + LNΛೖΕͨ௨ৗͷͷΑΓ ֶश͕͍ͩͿ͍
࣮ݧ C4σʔληοτ • 32ͷTransformerΛֶश • ֶश࣌ؒΛ৳ͤɼskip + LN͋Γͷੑೳʹ౸ୡ͢Δ • 5ഒ͘Β͍͕͔͔࣌ؒΔ • Transformerʹ͓͍ͯɼskipLNֶशͷ ߴԽʹد༩͍ͯ͠Δʁ
࣮ݧ C4σʔληοτͰͷ࣮ݧ • Skip connectionΛೖΕΔͱఏҊ๏͕ϕʔεϥΠϯͷskip + LNͷͷʹউͭ • ΓTransformerͰskip connection͕ ॏཁʁ
·ͱΊ • MLPCNNͰɼಈతํੑΛຬͨ͢Α͏ʹॳظԽΛߦ͑ɼਖ਼نԽskip connectionͳ͠Ͱɼਂ͍ωοτϫʔΫΛֶशͰ͖Δ • TransformerͰɼಉ͡Α͏ʹॳظԽΛஸೡʹΕɼskipLNͳ͠ͰֶशͰ ͖Δ͜ͱ͕Θ͔ͬͨ • ͨͩ͠ɼֶश͕͔࣌ؒͳΓ͔͔Δ
ײ • एׯແཧΓײ൱Ίͳ͍ • ݁ظॳہԽ࣌ͷattention͕୯Ґߦྻʹۙ͘ͳΔΑ͏ʹ͢Εྑ͍ͱ͍͏͜ͱ ͷͣ • ͬͱγϯϓϧͳํ๏͋Γͦ͏ͳ͕͢ؾΔ • ֶश͕͘ͳΔݪҼ͕Ͳ͜ʹ͋Δͷ͔͕͋·ΓΘ͔͍ͬͯͳ͍
ࢀߟจݙ [1] Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. "Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice." Advances in neural information processing systems 30 (2017). [2] Xiao, Lechao, et al. "Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks." International Conference on Machine Learning. PMLR, 2018. [3] Bachlechner, Thomas, et al. "Rezero is all you need: Fast convergence at large depth." Uncertainty in Artificial Intelligence. PMLR, 2021. 31 APA
ࢀߟจݙ [4] Burkholz, Rebekka, and Alina Dubatovka. "Initialization of relus for dynamical isometry." Advances in Neural Information Processing Systems 32 (2019). [5] Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. "Attention is not all you need: Pure attention loses rank doubly exponentially with depth." International Conference on Machine Learning. PMLR, 2021. [6] He, Bobby, et al. "Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation." The Eleventh International Conference on Learning Representations. 2023. 32