[DLHacks 実装] The statistical recurrent unit

>100 Views

September 28, 17

#deep learning #SRU #GRU #LSTM #Statistics #Sequence Model

スライド概要

Deep Learning JP:
http://deeplearning.jp/hacks/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 89.5K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 64.4K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 60.7K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 45.4K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 45.2K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 43.4K

各ページのテキスト

The Statistical Reccurent Unit Akuzawa Kei DLhacks August 28, 2017

Contents 1 書誌情報 2 Introduction 3 Model 4 Experiments 5 Discussion 6 実装上のポイント 2 / 16

書誌情報 Authors: Junier B. Oliva, Barnabas Poczos, Jeﬀ Schneider Conferrence: ICML2017 選定理由: シンプルかつ高精度の系列モデル。LSTM や GRU との比較が楽しみ。 3 / 16

Introduction 従来のモデル（GRU, LSTM）系列モデルの訓練は、系列が長いと勾配消失の危険性 MemoryCell, Gate によって長期間の依存関係を保持することができる提案モデル（SRU）隠れ層に統計量の移動平均のみを保持する（Gate はいらない）いくつかの重みで移動平均をとる直観的利点: 移動平均の組み合わせにより様々な過去の統計量を表現できる多くの設定で GRU, LSTM を outperforming 4 / 16

model: graph and equations — = [—(¸1 ) ; —(¸2 ) ; ::::::; —(¸m ) ] f() は relu — がいわゆる隠れ層 5 / 16

model: interpretaions data-driven statistics —(¸) ; ’ を統計量として考える（データセットを表現する変数くらいの意味）これらは a-priori に定められた統計量とは違い、データから自動で学習されるので好ましい multi-scaled statistics (¸) —t (¸) = ¸—t`1 + (1 ` ¸)’t = (1 ` ¸)(’t + ¸’t`1 + ¸2 ’t`2 ::::::) 上式から、¸ が小さいほど、より現在の統計量に重みを置いていると解釈できる 6 / 16

model: interpretaions Viewpoints of the Past 適当な重み wj ; wk を用いて、wj —(¸j ) ` wk —(¸k ) を考えるこれにより、様々な過去の時点を参照できるようになる (’t + 0:2’t`1 + :::) ` (’t + 0:1’t`1 + :::) ı 0:1’t`1 5(’t + 0:1’t`1 + 0:12 ’t`2 :::) + 15(’t + 0:2’t`1 + 0:22 ’t`2 :::) ` 10(’t + 0:3’t`1 + 0:32 ’t`2 :::) ı c’t`2 7 / 16

model: interpretations Vanishing Gradiants 勾配消失を避ける二つの工夫その 1. Relu その 2. ¸ による BPTT のコントロール 8 / 16

experiments MNIST 28x28 の画像を x1 ; x2 ; :::; x748 の系列データと見て、分類を行うハイパーパラメタは hyperopt で Bayesian Optimzation GRU と LSTM を outperform 9 / 16

10.

experiments MNIST A = f0; 0:5; 0:9; 0:99; 0:999g を変化させた場合 A の変化に敏感なことがわかる 10 / 16

11.

experiments MNIST iid: rdims = 0 ^ A = f0:99g recur: A = f0:99g multi: rdims = 0 この結果から、recurrent statistics(r) と muti-scaled statistics(複数の ¸) 両方の必要性がわかる 11 / 16

12.

論文まとめ系列情報を保持した統計量を導入複数の ¸ により過去の様々な時点を参照可能これらの工夫により、long term dependencies をうまく扱うことができた 12 / 16

13.

実装上のポイント: mu の更新式 — の更新式を全ての ¸ について同時に行いたい (¸) —t (¸) = ¸—t`1 + (1 ` ¸)’t — = [—(0) ; —(0:5) ; —(0:9) ; —(0:99) ; —(0:999) ] = (A ˙ I’ ) ˛ —t`1 + (A ˙ I’ ) ˛ (IA ˙ ’) 13 / 16

14.

実装上のポイント: parameter の tuning hyperopt をつかった tuning 50epoches の試行を、30 通りのパラメータに試す（論文中では、 10k iterations x batchsize（不明）を 100 回）但し、今回はあくまで hyperopt の練習自体を目的とし、一部のパラメータは論文中で報告された値に固定している得られた best parameter で 200Epochs 回す系列長 784 のデータなので、勾配消失・爆発が起きやすい。いくつかの工夫が必要 1. forget gate bias を大きくする（gru, lstm 特有） 2. gradient clipping を加える 3. RNN 系は計算時間めっちゃかかるので、cost が爆発したり学習が見られない時は早期打ち切り 14 / 16

15.

実験結果 SRU: 95.6, GRU: 98.4, LSTM: 97.8 ただしまだ収束していないみたいで、SRU が上回る可能性もある（実験終わらずすみません）やはりある程度精度は出そうなので、あとはタスクの得手不得手、チューニングの難しさが SRU が流行るかどうかの鍵になるのではないかメリット: weight initialization が GRU, LSTM より簡単デメリット: phi-size, r-size, out-size, A など、ハイパーパラメタ反省が多いパラメータのチューニングはめっちゃ時間かかるのでもっと早めに準備すればよかった 15 / 16

16.

References The Statistical Recurrent Unit, JunierB.Oliva BarnabasPoczos JeﬀSchneider, ICML2017 （画像はここから (p13 以外)） A Simple Way to Initialize Recurrent Networks of Rectified Linear Units, Le, Q. V., Jaitly, N., and Hinton （pixel-by-pixel sequence of MNIST の元ネタ） 16 / 16