【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representations

465 Views

April 12, 21

#@deep learning jp #Deep Learning #Reinforcement Learning #Self-Predictive Representations #Data Efficiency #Matsuo Lab

スライド概要

2021/04/09
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 87.3K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.9K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 58.4K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 41.4K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 37.9K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 37.3K

各ページのテキスト

DEEP LEARNING JP [DL Papers] Data-Efficient Reinforcement Learning with Self-Predictive Representations Xin Zhang, Matsuo Lab http://deeplearning.jp /

http://deeplearning.jp

目次 1. 書誌情報 2. Introduction 3. Self-Predictive Representation 4. Related Works 5. Experiment Evaluation 6. Discussion 2

書誌情報 ● タイトル： ○ Data-Efficient Reinforcement Learning with Self-Predictive Representations ● 著者 ○ Max Schwarzer, Ankesh Anand, Rishab Goel, R Devon Hjelm, Aaron Courville, Philip Bachman ● 所属：Mila, Université de Montréal, Microsoft Research ● 投稿日：2020/7/12 (arXiv), ICRL2021 Spotlight (7776) ● 概要 ○ 強化学習のサンプル効率をあげるため、表現学習をSelf-supervisedで行った。 ○ k step後の状態を予測する、状態予測ダイナミックスモデルを学習する。 ○ ただし、状態の潜在空間上において、予測を行うことで、複雑度を下げる。 3

https://arxiv.org/abs/2007.05929

Introduction 強化学習におけるサンプル効率問題 - Atrai game, 10~50 years. OpenAI Five 45000 years of experience. - 実世界では許されないので、サンプル効率を上げないといけない！ - CVとNLPでは、自己教師表現学習が有効で、業績残している。 - 強化学習における表現学習が有効。前から研究されていた。 - - 強化学習のための状態表現学習(松嶋さんDL輪読会) 未来の状態が予測できるような、状態の表現が学習できないか？ - 自己教師で.. - データ拡張が使えて.. 4

https://www.slideshare.net/DeepLearningJP2016/dl-124128933

Self-Predictive Representations（SPR） Kステップ後の表現を予測できるように学習した状態表現 1. Online encoder and target encoder 2. Transition Model 3. Projection Heads 4. Prediction Loss 5

Self-Predictive Representations（SPR） Target encoder, using EMA of online encoder. 1. Online encoder and target encoder 2. Transition Model 3. Projection Heads 4. Prediction Loss 6

Self-Predictive Representations（SPR） 1 ステップずつ、Kステップ分の状態表現を予測する。 1. Online encoder and target encoder 2. Transition Model 3. Projection Heads 4. Prediction Loss 7

Self-Predictive Representations（SPR） Projection で小さい次元に圧縮する。predictionでさらに予測。 1. Online encoder and target encoder 2. Transition Model 3. Projection Heads 4. Prediction Loss 8

Self-Predictive Representations（SPR）ステップごとのCosine Similarity Lossを取る。 1. Online encoder and target encoder 2. Transition Model 3. Projection Heads 4. Prediction Loss 9

10.

Self-Predictive Representations. 10

11.

Related Works ● Data-Efficient RL ○ SiMPle：pixel-level transition model. ○ Data-Efficient Rainbow(DER) and OTRainbow： ○ 再構築Lossで潜在空間モデルを学習 ○ DrQ, RAD：image augmentationすることで多くのモデルベースよりも精度が良い ○ Data augmentionはマルチタスク、転移学習における汎化性の向上に有効 SPRのアプローチの方が、data-augmentationをさらに有効に使える。 11

12.

Related Works ● Representation Learning in RL： ○ CURL：image augmentation + contrastive loss. ■ Image augmentationの方が効いる？（by RAD） ○ CPC, ST-DIM, DRIML：temporal contrastive losses. ○ DeepMDP, trains a transition model with L2 loss. ■ online encoder to prediction target. prone to representational collapse. ■ add observation reconstruction objective. ○ PBL：directly predicts representations of future states. ■ Two target networks. Focus on multi-task generalization. 100 times data as SPR. SPRはself-supervised, trained in latent space, uses a normalized loss. Target encoder. Augmentations. 12

13.

Experiments. Atari Human-Normalized scores：人間のスコアを1.0 にして評価する基準。 SPRは、データ拡張しなくてもSOTA。（＊はデータ拡張。100k steps or 400k frames per game.) 13

14.

Experiments SimPLeも良さそうだが、結果の分布で見るとわかりやすい。SPRはSOTA。 14

15.

Experiments Dynamics modeling consistently improving performance. 15

16.

Discussion 考察 - The target encoderは重要 - データ拡張がある時は、T=0. 並行して２つのencoderを学習する。 - 拡張がない時は、T=0.99. でほぼ固定 - Dynamics modelingは重要, K = 5. - 流行っているContrastive lossesよりは良い。今後の方向性 - CVとNLPを見ると、RLにも大規模なデータセットで事前学習し、fine tuningする流れもやってくるのでは？ - SPRで学習したモデルで、モデルベースの学習をやる。 16

17.

感想 - サンプル効率問題に向けて、自己教師あり学習でモデルを学習するアプローチは面白いと思って、読んだ。 - 思ったより、たくさんの研究があって、新規性をどう出すのか？ - Self-Supervised ＊ Model-based あたりが可能性高いと思っている。 17

18.

参考文献 - https://zhuanlan.zhihu.com/p/164842371 - https://arxiv.org/pdf/2006.07733.pdf 18