[DL輪読会]Deep Dynamics Models for Learning Dexterous Manipulation

267 Views

October 11, 19

#deep learning #Deep Learning #Reinforcement Learning #Dexterous Manipulation #PDDM #Model Predictive Control

スライド概要

2019/10/11
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 89.3K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 63.9K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 60.6K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 45K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 44.1K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 42.5K

各ページのテキスト

DEEP LEARNING JP [DL Papers] Deep Dynamics Models for Learning Dexterous Manipulation(PDDM) Keno Harada, UT, B3 http://deeplearning.jp/ 1

http://deeplearning.jp/

書誌情報 ● 著者情報: Anusha Nagabandi, Kurt Konoglie, Sergey Levine, Vikash Kumar ○ Google Brain ● 論文リンク: https://arxiv.org/pdf/1909.11652.pdf(CoRL 2019?) ● Blog: ○ Google: https://sites.google.com/view/pddm/ ○ BAIR: https://bair.berkeley.edu/blog/2019/09/30/deep-dynamics/ ● CS285(http://rail.eecs.berkeley.edu/deeprlcourse/)のLecture10, 11で PDDMに関係する技術の詳しい解説がなされています ○ 2

デモ gif from https://sites.google.co m/view/pddm/ 3

https://sites.google.com/view/pddm/

研究概要 ● 複数本の指でのdexterous manipulation task 難しい ○ 複数の方向から同時に対象物体に力を及ぼすことが可能でないと達成が難しい ○ 多数の関節を制御し複雑な力を与える必要性 ○ 接触が生じたり, 消えたりが繰り返されるため, 正確な物理モデルが必要とされる解析的な手法では難しい -> 学習ベースに成功の可能性が ● モデルベース強化学習 ○ 環境のダイナミクスを学習する ○ 必要となるデータ数はmodel-freeより少ないため実用的 ○ dexterous manipulation taskのような難しいタスクへの適用はまだあまりなされていない 4

研究概要 ● Online planning with deep dynamics models(PDDM) ○ Model Predictive Control ■ Neural network dynamics for modelbased deep reinforcement learning with model-free fine-tuning(https://arxiv.org/pdf/1708.02596.pdf) ○ Ensembles for model uncertainty estimation ■ Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models(https://papers.nips.cc/paper/7725-deepreinforcement-learning-in-a-handful-of-trials-using-probabilisticdynamics-models.pdf) ● 一言で言うと: 不確実性を考慮に入れたダイナミクスの予測をブートストラップアンサンブルで行い，行動の選択をMPCによって行う ● 個々の手法は既存のものだが，組み合わせは新しく, 肝だとしている 5

アウトライン ● Learning the Dynamics ○ モデルベース強化学習の課題 ○ 不確実性の考慮 ○ ブートストラップアンサンブル ● Model Predictive Control ○ Random Shooting ○ Iterative Random-Shooting with Refinement ○ Filtering and Reward-Weighted Refinement ● PDDM ● 実験結果 6

Learning the Dynamics モデルベース強化学習の課題 image from CS285 Lecture 11 slide ● モデルフリーの手法に比べてパフォーマンス劣る ○ モデルベースは学習されたモデルを基にPlanningする ■ ダイナミクスモデルが誤っていても，そのモデルにおいて報酬が高く得られるような行動を選択する ■ 高次元になるほどモデルが誤った予測をする可能性が高くなる(らしい) ■ モデルが予測に自信がないところを把握したい-> 不確実性の考慮 7

Learning the Dynamics 不確実性の考慮 image from CS285 Lecture 11 ● aleatoric or stochastic uncertainty slide ○ 環境自身の持つ不確実性 ○ データに対する不確実性 ■ データ自体にノイズがある ● epistemic or model uncertainty ○ 十分に環境の遷移データが得られず, NNの学習が十分でない不確実性 8

Learning the Dynamics 不確実性の考慮 image from CS285 Lecture 11 slide ● 環境自身の持つ不確実性の対処 ○ -> 確率分布のパラメータをNNで出力し，サンプリングすることで対処 ● 十分に環境の遷移データが得られず, NNの学習が十分でない不確実性への対処 ○ -> ダイナミクスモデルを複数用意することで対処(ブートストラップアンサンブル) 9

10.

Learning the Dynamics ブートストラップアンサンブル image from CS285 Lecture 11 slide ● 複数のダイナミクスモデルを用いて遷移を予測し，一連の行動を行った際の報酬の平均から，対象となる行動系列の評価を行う 10

11.

Learning the Dynamics ブートストラップアンサンブル 11

12.

Model Predictive Control 12 Slide from CS285 Lecture 11

13.

Model Predictive Control Random shooting Slide from CS285 Lecture 10, 11 ● ある系列長のactionの系列をいくつか候補として挙げる ● その中で最も報酬が高く得られたaction系列を採用する ○ どれくらい報酬が得られるかは学習したモデルを使用し評価 ○ Model Predictive Controlでは最初のactionだけ採用し, また次のstepでRandom shootingを行う 13

14.

Model Predictive Control Iterative Random-Shooting with Refinement image from CS285 Lecture 10 slide ● 候補に挙げるアクション系列を，報酬が高く得られた範囲からとるようにし，確度を高めていく ○ 何度かサンプリングを行い，最終的にアクション系列を定める 14

15.

Model Predictive Control Filtering and Reward-Weighted Refinement 報酬による重み付けを行い分布を更新 Time step間の相関の考慮(?) filtering ● time step間の相関を考慮に入れ，アクション系列のサンプリングを行う時絞り込む分布の更新をよりサンプル全体を考慮して有効的に行う 15

16.

PDDM Model Predictive Control ブートストラップアンサンブル 16

17.

実験結果(モデルデザイン) 17

18.

実験結果 ● Valve Turning: 9-DoFのハンドでvalveを回す ● In-hand Reorientation: キューブをある指定の方向へ移動させる ● Handwriting: 正確な操作が求められる ● Boading Balls: 落とさずに二つのボールを回転させる 18

19.

Valve Turning 19

20.

In-hand reorientation 20

21.

Handwriting 21

22.

Baoding Balls 22

23.

Baoding Balls(real) 23

24.

まとめ ● Dexterous manipulation taskを実用的に解けるような，ブートストラップアンサンブルで不確実性を考慮し，Filtering and Reward-Weighted Refinementによって行動系列を選択してMPCを行う，既存手法をうまく組み合わせたモデルベース強化学習手法PDDMを提案 24

25.

実験設定詳細 25