[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based Control

>100 Views

May 11, 18

#deep learning #Deep Learning #Meta-Learning #Model-Based RL #Adaptive Control #Reinforcement Learning

スライド概要

2018/05/11
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 83.3K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.3K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 53K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 38.3K

【拡散モデル勉強会】拡散モデルのサンプラーまとめ

Deep Learning JP 33.3K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 30.6K

各ページのテキスト

DEEP LEARNING JP “Learning to Adapt: Meta-Learning for Model-Based Control", Ignasi Clavera, Anusha Nagabandi, Ronald S. Fearing, Pieter Abbeel, [DL Papers] Sergey Levine, Chelsea Finn Presentater: Kei Akuzawa 1 http://deeplearning.jp/

http://deeplearning.jp/

書誌情報 • 投稿先: arxiv, 2018/03 • プロジェクトページ: https://sites.google.com/berkeley.edu/metaadaptivecontrol • 著者: Ignasi Clavera, Anusha Nagabandi, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, Chelsea Finn • 選定理由: • メタ学習への興味 • 実環境で動くエージェントを作るためにオンラインで適応させるのは筋が良いように思えた

https://sites.google.com/berkeley.edu/metaadaptivecontrol

概要 • モチベーション: 深層強化学習(DRL)において，環境にオンライン適応するエージェントを育てたい • なぜオンライン適応が必要か • テスト環境が訓練環境と違う • テストエピソード中に環境に急激な変化が生じる • 人間はオンラインで適応しているっぽい • 未知の重さの物体を持ち上げる • 雪の上を歩く • 提案手法: • DRL + オンライン適応では，サンプル効率が問題となる • メタ学習を用いて，環境のモデルを効率的に適応させる

イントロ • 実環境では急激な環境の変化が起きる • ロボットが一部故障 • 坂道での勾配の変化

イントロ • 貢献: • こうした変化に素早く適応するために，二つの手法を組み合わせた • Model-Based RL • 環境のモデル: 𝑠Ƹ 𝑡 = 𝑓𝜃 (𝑠 𝑡 , 𝑎𝑡 ) • 教師信号𝑠 𝑡+1 が各ステップごとに得られる • Meta-Learning • ここでは，「別の環境に効率的に適応できるような学習則(メタ知識)を学習する」くらいの意味

提案手法 1. 環境のモデル𝜃と， 𝜃の更新則𝑢のメタ訓練 2. 𝜃を用いたアクションの選択 3. 更新則𝑢について 1. Recurrence-Based Adaptive Control 2. Gradient-Based Adaptive Control

環境のモデル𝜃と， 𝜃の更新則𝑢のメタ訓練 • Meta-Agentは更新則𝑢を用いて，直近の遷移(s, a) をもとに，少し先の未来の予測誤差が少なくなるような環境のモデル𝜃 ′ を得る ③少し先の予測誤差を最小化 ②環境のモデルを更新し ①最近の遷移をもとに • 各タイムステップごとに𝜃 ′ を更新する(オンライン適応) • 更新則𝑢の詳細は後述

𝜃を用いたアクションの選択(Model Predictive Control, MPC) • 環境のモデル𝜃 ′ をもとに，t期からt+H期までのアクションをPlanning する環境のダイナミクスf𝜃′ に従うという制約下で累積報酬が最大になるアクションを選ぶ • しかし，Planning時の予測𝑠Ƹ 𝑡+ℎ と実際の𝑠 𝑡+ℎ は当然ずれる • そこで，一度アクションを取るたびにPlanningを再度行う(MPC) • 直感的説明: Hタイムステップ先までのアクションを計画するけど，計画通りに行かないことはわかっているので，各タイムステップでプランを練り直す • (MPC自体は提案ではない)

ここまでのアルゴリズムまとめ ①それぞれのタイムステップで ②環境のモデルを更新し ③アクションを決定する

10.

更新則𝑢について • 二つの更新則𝑢について検証する • Recurrence-Based Adaptive Control (RBAC) • Santoro et al., 2016 • Duan et al., 2016 • Gradient-Based Adaptive Control (GBAC) • Finn et al., 2017

11.

Recurrence-Based Adaptive Control (RBAC) • (s, a)を入力としてRNNを更新していく • 重みパラメータ: 𝑢 • Hidden State: 𝜃 • 詳細は載っていないが，参考になりそうなもの • • • • Santoro et al., 2016 Duan et al., 2016 Mishra et al., 2018 Finn and Levin, 2018

12.

Gradient-Based Adaptive Control (GBAC) • Model Agnostic Meta-Learning(MAML)を基盤 • 更新則𝑢を以下で定める • 直感的には，直近の環境のモデルの予測誤差を修正するように𝜃を更新 • 実験では𝛼を固定した(適応的にすることもできる) • 利点: GBACはメタ訓練環境の分布の外にも適応できそう • c.f., Finn and Levin, 2018

13.

実験 • 目的 • 様々な環境の変化にオンラインで適応できるか確認 • GBACとRBACを様々な設定で比較 • 訓練環境の分布の外での振る舞いを確認 • 設定 (ビデオ: https://sites.google.com/view/metalearning4ac/) • half-cheetah における環境の変化: • トルクを固定する, 床の傾きを変える, Dynamics(摩擦，湿度?)を変える • Ant: • 足の長さを変える，トルクを固定する • 7-DoF arm: • 運んでいる物体に様々な大きさの力(重力?)をかける

https://sites.google.com/view/metalearning4ac/

14.

実験 • 比較手法: • MB: モデルベース(Nagabandi et al., 2017) • MB + DE: モデルベースとDynamic Evaluation(Krause et al., 2017) の組み合わせ • MF: モデルフリー(TRPO)

15.

実験: Static • 訓練環境とテスト環境のパラメータを同じ分布からサンプリングする設定

16.

実験: Dynamic • テストエピソードの実行中に急に環境が変化する設定 • e.g.，途中で急に足が故障する • 訓練時にはこのような急激な変化は起こさない

17.

実験: Generalize • 訓練環境の分布の外側のパフォーマンスを比較 • e.g., めっちゃ急な坂道 • GBACは，変化が大きい(e.g., アリの足が使えない + 長さが縮む)ときに， RBACに対して優位になりそう

18.

結論 • メタ学習とModel-Based RLを組み合わせたオンライン適合手法を提案 • メタ学習の二つの手法(Gradient-Based, Recurrent-Based)について検証 • Future Works • Gradient-Based と Recurrent-Basedを組み合わせる • 実環境での実験

19.

Reference • Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065, 2016. • Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and Abbeel, Pieter. Rl$ˆ2$: Fast reinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779, 2016. • Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. CoRR, abs/1703.03400, 2017. • Nagabandi, Anusha, Kahn, Gregory, Fearing, Ronald S., and Levine, Sergey. Neural network dynamics for model-based deep reinforcement learning with model-free finetuning. CoRR, abs/1708.02596, 2017. • Krause, Ben, Kahembwe, Emmanuel, Murray, Iain, and Renals, Steve. Dynamic evaluation of neural sequence models. CoRR, abs/1709.07432, 2017. • Finn, Chelsea and Levine, Sergey. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. International Conference on Learning Representations(ICLR), 2018. • Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. International Conference on Learning Representations (ICLR), 2018.