[DL輪読会]Training RNNs as Fast as CNNs

119 Views

October 02, 17

#deep learning #Deep Learning #RNN #CNN #Training Speed #Neural Networks

スライド概要

2017/10/2
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 86.4K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 57.2K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 40.5K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 35.1K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 34.7K

各ページのテキスト

DEEP LEARNING JP [DL Papers] “Training RNNs as Fast as CNNs” Hiroki Kurotaki, Matsuo Lab http://deeplearning.jp/ 1

http://deeplearning.jp/

目次 • • • • • • 概要 1 Introduction 2 Method 3 Related Works 4 Experiments 5 Conclusion 2

目次 • • • • • • 概要 1 Introduction 2 Method 3 Related Works 4 Experiments 5 Conclusion 3

書誌情報 • • • • • • • Training RNNs as Fast as CNNs Tao Lei, Yu Zhang 9/12/2017(v1: 9/8/2017) https://arxiv.org/abs/1709.02755v2 https://github.com/taolei87/sru Arxiv Sanityで Last monthのtop hype #2 (329 tweets) 1st authorはICML, EMNLPなどに通している – Deriving neural architectures from sequence and graph kernels 4

提案手法 • RNNのゲートに前の時間の情報を入れない – 大幅な並列化が可能 • cuDNNで最適化されたLSTMに比べ、5-10x 速い • PyTorch and CNTKでオープンソース公開 5

主な結果 • 平均実行時間の比較 • CuDNNのLSTM実装より10倍速い 6

目次 • • • • • • 概要 1 Introduction 2 Method 3 Related Works 4 Experiments 5 Conclusion 7

1 Introduction • 深層学習の研究開発において、実行時間は大きな障害 • LSTMは並列化の恩恵を最大限に受け取れていない – h_tがh_{t-1}に依存しているため、並列化が不可能 • 依存項をカットした、Simple Recurrent Unitを提案 • CUDAレベルで最適化した実装を公開した – conv2dと同等の速度を達成した 8

目次 • • • • • • 概要 1 Introduction 2 Method 3 Related Works 4 Experiments 5 Conclusion 9

10.

2.1 SRU implementation • ベース：LSTM+頻出のテクニック二つ – Highway connection • 下のh’_tの式。r_tがreset gateと呼ばれるもの – Variational dropout : • 入力x_tに時間で変わらないマスク • 細かいこと – Forget gateは、i = 1-fとする – h_tに普通のdropout – g(・)は活性化関数 10

11.

2.2 Speeding-up the recurrence • 従来のボトルネック – 隠れ状態の各次元が、他を参照してしまい、並列化が不可能 – h_{t-1}の全体が計算されるまで待たないと、h_tを計算不可 • 提案: ゲートにおける時間t-1の参照をカット – ボトルネックは(3)-(5)の行列計算のみ 11

12.

2.3 CUDA level optimization • (3)-(5)式の行列演算は一つにまとめる 12

13.

2.3 CUDA level optimization • 計算が並列化できるようになる 13

14.

目次 • • • • • • 概要 1 Introduction 2 Method 3 Related Works 4 Experiments 5 Conclusion 14

15.

3 Related Work • 系列処理の効率化 – Recurrent convolution (RCNN) (Lei et al., 2015, 2016) – kernel network (KNN) (Lei et al., 2017) – Quasi-RNN (Bradbury et al., 2017) • カットによる表現力の減少有無 – 単純化RNNのcapacityの調査(Balduzzi and Ghifary (2016)) – SRUやword-level CNNは、系列類似度関数→隠れ空間の埋め込み (Lei et al. (2017)) 15

16.

目次 • • • • • • 概要 1 Introduction 2 Method 3 Related Works 4 Experiments 5 Conclusion 16

17.

4 Experiments • 提案手法のSRUを、先行研究やCuDNNのLSTM実装と比較 • SRUの、レイヤーを積み増すバージョンで、良い精度と速度を出した • 実装は4.5以外PyTorch、4.5はCNTK 17

18.

4.1 Classification • データセット – movie reviews (MR) (Pang and Lee, 2005) – subjectivity data (SUBJ) (Pang and Lee, 2004) – customer reviews (CR) (Hu and Liu, 2004) – TREC questions (Li and Roth, 2002) – opinion polarity from MPQA data (Wiebe et al., 2005) – Stanford sentiment treebank (SST) (Socher et al., 2013) • モデル、準備 – 2レイヤー、隠れ128次元 • SSTデータセットでは4レイヤー – CNNでも比較 • (Convolutional neural networks for sentence classification) 18

19.

4.1 Classification • 良い結果と速度が出た 19

20.

4.1 Classification 20

21.

4.2 Question answering • データセット – Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) • wikipedia からの100,000 QAペア • モデル、準備 – Document Reader model (Chen et al., 2017) • LSTM版とSRU版を作って比較 – 50エポック、32バッチ、隠れ128次元、 – ドロップアウト入力0.5、SRU0.2、LSTM0.3 21

22.

4.2 Question answering • LSTMは69.6%マッチ、78.9% F1スコア • SRUは70.3%マッチ、79.5% F1スコア • 6～10倍の高速化 22

23.

4.3 Language modeling • データセット – Penn Treebank corpus (PTB) • 1Mトークン、10k辞書 • truncated BPTTで学習 • モデル、前準備 – truncated BPTTが35エポック、バッチサイズ32、dropout0.75 – 300エポックの訓練 23

24.

4.3 Language modeling • Perplexitiesで先行研究やcuDNN LSTMを上回る 24

25.

4.4 Machine translation • データセット – WMT’14 English→German translation – 4Mの翻訳ペア • モデル、前処理 – OpenNMTという翻訳システムをSRUに拡張した – seq2seq w/ attention • h_{t-1}は並列化を妨げるため、次の時間の入力には追加しない – 15エポック、バッチ64、word embeddings size 500 – dropout rateを、よく使われるものより小さい0.1に落とした 25

26.

4.4 Machine translation • BLEUスコアで、元論文を上回る 26

27.

4.5 Speech recognition • データセット – Switchboard-1 corpus (Godfrey et al., 1992) • 4,870会話(300時間) 話者520人 • モデルなど – MFCC、Kaldiを使用 – Computational Network Toolkit (CNTK)で実装 27

28.

4.5 Speech recognition • SOTAの結果 28

29.

目次 • • • • • • 概要 1 Introduction 2 Method 3 Related Works 4 Experiments 5 Conclusion 29

30.

5 Conclusion • Simple Recurrent Unit (SRU)を提案 – ゲートのh_{t-1}参照項をカット • 5つのタスクで性能を確認した • 従来のCuDNNのLSTM実装などに比べ、最大10倍の高速化 – 精度も向上した 30

31.