[DL輪読会] off-policyなメタ強化学習

>100 Views

April 15, 19

スライド概要

2019/04/05
Deep Learning JP:
http://deeplearning.jp/seminar-2/

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

(ダウンロード不可)

関連スライド

各ページのテキスト
1.

off-policyͳϝλ‫ڧ‬Խֶशख๏ ɾEfficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables ɾGuided Meta-Policy Search Presenter: Tatsuya Matsushima @__tmats__ , Matsuo Lab 1

2.

ຊൃදʹ͍ͭͯ • off-policyʹϝλ‫ڧ‬Խֶश͢Δख๏͕࠷ཱۙͯଓ͚ʹarXiv্Ͱൃද͞Ε͍ͯͨ • Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables [Rakelly+ 2019] (2019/3/19) • Guided Meta-Policy Search [Mendonca+ 2019] (2019/4/1) • MAMLΛ༻͍ͨ‫ط‬ଘͷϝλ‫ڧ‬Խֶशख๏Ͱ͸ɼmeta-trainingλεΫͰ͍ͬͺ͍ࢼߦࡨ‫͞ޡ‬ ͤΔඞཁ͕͋Γɼoff-policyͳख๏ͷ։ൃ͸‫ڀݚ‬ͷൃలʹͭͳ͕Δ͔΋ 2

3.

લఏ஌ࣝ 3

4.

ϝλֶश ‫ػ‬ցֶश෼໺ʹ͓͚Δϝλֶश ʮ͋Δܾ·ͬͨόΠΞεɼ͢ͳΘͪԾઆۭؒͷத͔ΒɼࣄྫʹԠͯ͡ɼద੾ͳԾઆ Λ֫ಘ͢Δී௨ͷֶश‫ث‬Λϕʔεֶश‫͏͍ͱث‬ɽͦͷ্ҐͰɼֶशର৅ͷλεΫ΍υ ϝΠϯʹԠͯ͡ɼֶश‫ث‬ͷόΠΞεΛܾఆ͢ΔͨΊͷϝλ஌ࣝΛ֫ಘ͢Δͷ͕ϝλֶ श (meta learning)ʯ㱺σʔλυϦϒϯʹֶश‫ث‬ͷόΠΞεΛܾΊ͍ͨ • ग़య: ग࡝ͷైWiki http://ibisforest.org/index.php?%E3%83%A1%E3%82%BF%E5%AD%A6%E7%BF%92 • [DLྠಡձ]Meta-Learning Probabilistic Inference for Prediction (Ѩ‫ٱ‬ᖒ͞Μ) ʹৄ͍͠આ໌ • https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-forprediction-126167192 4

5.

MAML MAML (Model Agnostic Meta-Learning) [Finn+ 2017] • ޯ഑๏ʹ‫ͮ͘ج‬ϝλֶशख๏ ϕ𝒯Λ‫ٻ‬Ίadapt͢Δ • ύϥϝʔλͷॳ‫ظ‬஋ɹ͔Βɼޯ഑߱Լ๏Λ࢖ͬͯλεΫ‫ݻ‬༗ͷύϥϝʔλɹ θ • MAMLͷֶश min θ ∑ 𝒯 ℒ (θ − α ∇θ ℒ (θ, 𝒟tr𝒯), 𝒟val 𝒯 ) = min θ ∑ 𝒯 ℒ (ϕ𝒯, 𝒟val 𝒯) • ࣮૷্ɼ2࣍ޯ഑͕ग़ͯ͘Δ • 1࣍ۙࣅ͢Δ‫[ڀݚ‬Nichol+2018]΋͋Δ • meta-testͱ͖͸ɼޯ഑๏ʹैͬͯߋ৽ͨ͠ύϥϝʔλͰ λεΫΛղ͘ ϕ𝒯test = θ − α ∇θ ℒ (θ, 𝒟tr𝒯test) 5

6.

ϝλ‫ڧ‬Խֶश MAMLΛϕʔεͱͨ͠ϝλ‫ڧ‬Խֶश • lossͱͯ͠‫ڧ‬Խֶशͷloss(ෛͷใु࿨)Λ༻͍Δ ℒRL (ϕ, 𝒟𝒯i) = − 1 𝒟𝒯i ∑ st,at∈𝒟 ri (st, at) 1 H = − 𝔼st,at∼πϕ,q𝒯 ri (st, at) i[H ∑ ] t=1 • MAMLΛmodel-basedͳ‫ڧ‬Խֶश[Nagabandi+ 2018]ɼ୳ࡧख๏[Gupta+ 2018]ʹ༻͍ͨ‫ڀݚ‬ ΋͋Δ • [DLྠಡձ]Meta Reinforcement Learning (ॳ୩͞Μ)ʹϝλ‫ڧ‬Խֶशʹؔͯ͠ৄ͍͠આ໌ • https://www.slideshare.net/DeepLearningJP2016/dl-130067084 6

7.

(ࢀߟ) On-policy v.s. Off-policy On-policy (ํࡦΦϯ) • ‫ڍ‬ಈํࡦͱλʔήοτํࡦ(Ձ஋ͷਪఆͷͨΊͷํࡦ)͕ಉ͡ख๏ • ʹֶश͍ͯ͠ΔํࡦͱαϯϓϧΛੜ੒͢Δํࡦ͕ಉ͡ • ྫ) ε-greedyํࡦ Off-policy (ํࡦΦϑ) • ‫ڍ‬ಈํࡦͱλʔήοτํࡦ͕ҟͳΔख๏ • ʹֶश͍ͯ͠ΔํࡦͱαϯϓϧΛੜ੒͢Δํࡦ͕ҟͳΔ ※ MAMLͷ৔߹ɼtrainͰλεΫ͝ͱʹগͳ͍αϯϓϧΛ΋ͱʹύϥϝʔλΛੜ੒͢ΔͷͰɼͦͷσʔ λͷ෼෍͕testͱେ͖͘ҟͳ͍ͬͯͨΒࠔΔ(=φΠʔϒʹoff-policyͰ͖ͳ͍)ͱ͍͏͜ͱͩͱࢥ͍·͢ 7

8.

ᶃ Efficient Off-Policy Meta-Reinforcement ɹ Learning via Probabilistic Context Variables 8

9.

঺հ͢Δ࿦จᶃ Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables • https://arxiv.org/abs/1903.08254 (Submitted on 19 Mar 2019) • Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Sergey Levine • ͓ͳ͡Έ UC Berkeley (BAIR) • ΋͸΍Deep࢖ͬͨRL͸”UC Berkeley”͍ͬͯ͏෼໺ʹͳΓͭͭ͋Δ • ஶऀ࣮૷ • • https://github.com/katerakelly/oyster (BAIRʹͯ͠͸ΊͣΒ͘͠)PyTorchɽrlkitΛར༻ɽ 9

10.

TL; DR • ‫ڧ‬Խֶशʹ͓͚Δmeta learningʹ͓͍ͯɼoff-policyͳख๏(PEARL)ΛఏҊ • ֬཰తͳจ຺(context)ͷજࡏม਺Λਪ࿦ • permutation invariantʹ‫ݧܦ‬Λू໿͢Δ͜ͱͰɼ௕͍λεΫʹ΋ա৒ద߹ͤͣʹૉૣ͘࠷దԽ͢ Δ • ‫ط‬ଘͷϝλ‫ڧ‬Խֶशख๏ͱൺֱͯ͠ɼ20-100ഒαϯϓϧޮ཰͕ߴ·ͬͨ 10

11.

໰୊ҙࣝ ‫ط‬ଘͷϝλ‫ڧ‬Խֶश(ओʹMAMLϕʔεͷ)ख๏ͷܽ఺ • meta-trainingɾadaptationͱ΋ʹon-policyͳσʔλʹґଘ͓ͯ͠Γαϯϓϧޮ཰ੑ͕௿͍ • MAMLͰ͸ɽmeta-trainͱmeta-testͰಉ͡ૢ࡞Λ͠ͳ͚Ε͹ͳΒͣφΠʔϒʹoff-policyʹ͸Ͱ͖ͳ͍ • ৽͍͠λεΫʹadapt͢Δͱ͖ʹɼλεΫͷෆ࣮֬ੑʹؔ͢Δਪ࿦͕Ͱ͖ͳ͍ • ใु͕εύʔεͳͱ͖ʹ໰୊ͱͳΔ 11

12.

ఏҊख๏ 12

13.

ఏҊख๏ͷ֓ཁ ຊ࿦จͷఏҊख๏ͷ֓ཁ • off-policyͳRLΞϧΰϦζϜ(soft actor-critic, SAC [Haarnoja+ 2018])ͷ΋ͱͰɼ contextͷ֬཰ม਺ΛΦϯϥΠϯͰਪ࿦͢Δख๏(PEARL)ΛఏҊ • Meta-trainingͷαϯϓϧޮ཰޲্ͱૉૣ͍adaptͷ྆ํΛ໨ࢦ͢ • meta-trainͰ͸ɼΤϯίʔμ͸ա‫ڈ‬ͷ‫ݧܦ‬Λ༻͍ͯɼpolicy͕λεΫΛ࣮ߦͰ͖ΔΑ͏ͳ֬ ཰తͳcontextม਺Λਪ࿦͢Δ • meta-testͰ͸ɼcontextม਺Λαϯϓϧ͠ɼΤϐιʔυ಺Ͱ‫ݻ‬ఆɽ৚͚݅ͮΒΕͨpolicyΛ ༻͍ͯ৽͍͠λεΫʹadapt͢Δ • ݁Ռͱͯ͠ɼpolicy͸off-policyͳσʔλΛ༻͍ͯ࠷దԽɼΤϯίʔμ͸meta-trainͱmetatestͷ෼෍ͷϛεϚονΛ‫ݮ‬Β͢Α͏ʹon-policyʹ࠷దԽ 13

14.

໰୊ઃఆ p(𝒯) Λߟ͑Δ MDPͳλεΫ্ۭؒͷ෼෍ɹɹ • ֤λεΫɹ͸ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹͰߏ੒͞ΕΔ 𝒯 𝒯 = {p (s0), p (st+1 | st, at), r (st, at)} • ͦΕͧΕɼॳ‫ظ‬ঢ়ଶͷ֬཰෼෍ɼঢ়ଶભҠ֬཰ɼใुؔ਺ • ͜ͷ໰୊ઃఆΛஔ͘͜ͱͰɼঢ়ଶભҠ֬཰͕ҧ͏ɾใुؔ਺͕ҧ͏λεΫΛ‫ؚ‬Ίͨ෼෍Λ ԾఆͰ͖Δ • ঢ়ଶભҠ֬཰͕ҧ͏ྫ: ҟͳΔμΠφϛΫεΛ࣋ͭϩϘοτ • ใुؔ਺͕ҧ͏ྫ: ҧ͏৔ॴ΁ͷφϏήʔγϣϯ c𝒯 𝒯 • λεΫɹʹ͓͚Δ1εςοϓͷભҠΛɹɹɹɹɹɹɹɹͱ͢Δ n = (sn, an, rn, s′n) 𝒯 • ͜Ε·Ͱͷ‫ݧܦ‬͸ c = c1:N p(𝒯) • ςετ࣌͸ɹɹɹ͔Β৽͍͠λεΫ͕αϯϓϧ͞ΕΔ 14

15.

ఏҊख๏ ֬཰తͳજࡏcontextͷֶश z • adapt͢ΔͨΊʹજࡏม਺ɹ͕λεΫʹؔ͢Δಛ௃తͳ৘ใΛΤϯίʔυ͢Δඞཁ͕͋Δ z • જࡏม਺ɹΛਪ࿦͢ΔͨΊʹม෼ਪ࿦Λ༻͍Δ • ۙࣅࣄ‫ޙ‬෼෍(Inference network) qϕ(z | c) Λఆٛ • ໨తؔ਺Λର਺໬౓ͩͱ͢Δͱɼม෼Լք͸ 𝔼𝒯 𝔼z∼qϕ(z | c𝒯) R(𝒯, z) + βDKL (qϕ (z | c𝒯) ∥p(z)) [ ]] [ • prior p(z)͸Gaussianͱ͢Δ • z Λਪ࿦͢Δ qϕ(z | c) ͷύϥϝʔλɹ͸meta-trainͰ࠷దԽɼmeta-testͰ͸ूΊͨ‫ݧܦ‬Λ΋ͱʹ z ϕ 15

16.

ఏҊख๏ ֬཰తͳજࡏcontextͷֶश • ‫͕ڥ؀‬MDPͰ͋ΔͱԾఆ͍ͯ͠ΔͷͰɼॱংʹؔ܎ͳ͘ભҠͷू߹ɹɹɹɹɹʹΞΫηε {si, ai, s′i, ri} Ͱ͖Ε͹λεΫ͕ਪଌͰ͖Δ • permutation invariantͷԾఆ͕ஔ͚Δ • ͳͷͰɼInference networkΛಠཱͳཁૉͷੵͱͯ͠ѻ͑Δ qϕ (z | c1:N) ∝ ΠNn=1Ψϕ (z | cn) • ͦΕͧΕ͸Gaussian Ψϕ (z | cn) = 𝒩 f μ (cn), f ϕσ (cn) (ϕ ) 16

17.

ఏҊख๏ off-policyͳϝλ‫ڧ‬Խֶश qϕ(z | c) • ΤϯίʔμɹɹɹɹΛֶश͢ΔσʔλͱpolicyΛֶश͢Δ σʔλ͕ಉ͡Ͱ͋Δඞཁ͸ͳ͍ • actorͱciritic͸ϦϓϨΠόοϑΝશମɹ͔Βαϯϓϧ͞ΕΔ ℬ σʔλͰֶश • ΤϯίʔμΛֶशͤ͞ΔͨΊͷαϯϓϥ𝒮c ͸ɼ ϦϓϨΠόοϑΝதͷ࠷ۙͷ‫ي‬ಓσʔλ͔Βαϯϓϧ • ‫׬‬શʹon-policyͰ͋Δඞཁ͸ͳ͍͕ɼϦϓϨΠόοϑΝશମ Λ࢖͏ͱon-policyͳtestσʔλͱͷϛεϚον͕େ͖͗͢Δ 17

18.

ఏҊख๏ off-policyͳϝλ‫ڧ‬Խֶश z • Soft Actor-Critic (SAC) [Haarnoja+ 2018] Λ֬཰తͳcontextม਺ɹΛ‫ؚ‬Ί֦ͯு • SAC͸maxEntRL(Τϯτϩϐʔਖ਼ଇԽ͕໨తؔ਺ʹ‫·ؚ‬ΕΔ)ͷoff-policyͳactor-criticख๏ • ΤϯίʔμͱactorɾcriticΛreparameterization trickΛ࢖ͬͯಉ࣌ʹ࠷దԽ • criticͷloss: ℒcritic = 𝔼 s, a, r, s′ ∼ ℬ Qθ(s, a, z) − (r + V (s′, z)) [ ] ( ) 2 z ∼ qϕ(z | c) • actorͷloss: ℒactor = 𝔼s∼ℬ,a∼πθ DKL πθ(a | s, z)∥ ( exp (Qθ(s, a, z)) 𝒵θ(s) ) 18

19.

࣮‫ݧ‬ɾ݁Ռ 19

20.

࣮‫ݧ‬ ᶃ ‫ط‬ଘͷϝλ‫ڧ‬Խֶशख๏ͱͷൺֱ • MuJoCoͷ6ͭͷ‫ݧ࣮Ͱڥ؀‬ • Half-Cheetah, Humanoid, Ant, Walker (Half-CheetahͱAnt͕2छྨͣͭ) • ใुؔ਺͕ҧ͏͔μΠφϛΫε͕ҧ͏ઃఆ • adapt͢Δඞཁ༗ • ϕʔεϥΠϯͷ20-100ഒαϯϓϧޮ཰ ࠷ऴతͳੑೳ΋ߴ͍ • ԣ࣠: meta-trainingͷαϯϓϧ • ॎ࣠: ฏ‫ۉ‬ऩӹ 20

21.

࣮‫ݧ‬ ᶄ Τϯίʔμ͔ΒͷαϯϓϦϯά • εύʔεͳใुԼͰɼon-policyͳख๏(MAESN[Gupta+ 2018])ͱൺֱ • sparse navigationͰ‫ূݕ‬ • meta-testͰ͸ΤʔδΣϯτ͕ΰʔϧͷ ೱ͍੨ͷ‫ؙ‬ͷதʹೖͬͯॳΊͯใु͕ಘΒΕΔ • ۙࣅࣄ‫ޙ‬෼෍͔ΒͷαϯϓϦϯά͕ɼใु͕εύʔεͳ ৔߹ʹ༗ޮͰ͋Δ͜ͱΛ֬ೝ • contextͷ਺͕૿͑Δʹै͍ऩӹ͕૿Ճ • MAESNΑΓ΋ੑೳ͕ߴ͍ 21

22.

࣮‫ݧ‬ ᶅ Ablation Study • ΤϯίʔμͷΞʔΩςΫνϟʹؔ͢Δ࣮‫ݧ‬ • Half-Cheetah-VelͰ‫ূݕ‬ • RNNΛ༻͍ͨΤϯίʔμͱൺֱ • RNN-tran: ભҠΛde-correlatedͯ͠αϯϓϦϯά • RNN-traj: ‫ي‬ಓΛαϯϓϦϯά • permutation invariantͳΤϯίʔμͷΞʔΩςΫνϟͷ ੑೳ͕Ұ൪ߴ͔ͬͨ 22

23.

࣮‫ݧ‬ ᶅ Ablation Study • σʔλͷαϯϓϦϯάํ๏ʹؔ͢Δ࣮‫ݧ‬ • Half-Cheetah-VelͰ‫ূݕ‬ • Τϯίʔμͷೖྗͱͯ͠ͲͷσʔλΛ༻͍Δ͔Λม͑Δ • off-policy: ‫׬‬શʹoff-policy(શόοϑΝ͔ΒαϯϓϦϯά) • off-policy RL-batch: policyͱಉ͡όονΛ༻͍Δ • Τϯίʔμʹ͸௚ۙͷόοϑΝ͔Βαϯϓϧ͢Δ ఏҊख๏(PEARL)ͷੑೳ͕Ұ൪ߴ͔ͬͨ 23

24.

࣮‫ݧ‬ ᶅ Ablation Study • ֬཰తɾܾఆ࿦తͳcontextʹΑΔҧ͍ • sparse navigationͰ‫ূݕ‬ • ܾఆ࿦తͳcontextͰ͸ѹ౗తʹऩӹ͕௿͍ • λεΫͷෆ࣮֬ੑ͕ϞσϦϯά͞Εͣ ޮՌతʹ୳ࡧ͕Ͱ͖ͳ͍ͨΊ 24

25.

·ͱΊ 25

26.

·ͱΊ • ϝλ‫ڧ‬Խֶशʹ͓͍ͯɼoff-policyͳख๏(PEARL)ΛఏҊ • ա‫ڈ‬ͷ‫͔ݧܦ‬Β֬཰తͳcontextͷม਺ΛΦϯϥΠϯͰਪ࿦ɼpolicyΛcontextͰ৚͚݅ͮΔ ͜ͱͰɼόοϑΝશମΛ࢖͏off-policyͳֶश͕Մೳʹͳͬͨ • ‫ط‬ଘͷϝλ‫ڧ‬Խֶशख๏ΑΓ΋meta-trainingͷαϯϓϧޮ཰͕ߴ͍͜ͱΛ࣮‫ݧ‬తʹࣔͨ͠ 26

27.

ᶄ Guided Meta-Policy Search 27

28.

঺հ͢Δ࿦จᶄ Guided Meta-Policy Search • https://arxiv.org/abs/1904.00956 (Submitted on 1 Apr 2019) • Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea Finn • UC Berkeley (BAIR) • Ͱ͢ΑͶ…ͱ͍͏‫͡ײ‬ • ஶऀ࣮૷ • • https://github.com/RussellM2020/GMPS Website • https://sites.google.com/berkeley.edu/guided-metapolicy-search 28

29.

TL; DR • ‫ڧ‬Խֶशʹ͓͚Δmeta learningʹ͓͍ͯɼoff-policyͳख๏(GMPS)ΛఏҊ • ௨ৗɼmeta-train࣌ʹ͸RLͰ௚઀ํࡦΛֶश͢Δඞཁ͕ͳ͍͜ͱ͕ଟ͍͜ͱʹ஫໨ • meta-trainͷmeta-objective(֎ͷ໨తؔ਺)͸ɼimitation learning (behaviour cloning)ͱֶͯ͠ शͤ͞Δ͜ͱͰɼ҆ఆੑɾֶशޮ཰ΛߴΊΔ • ͦͷͨΊʹɼmeta-trainingΛtask learningͱmeta-learningͷ2ͭͷϑΣʔζʹ໌ࣔతʹ෼͚Δ 29

30.

໰୊ҙࣝ ‫ط‬ଘͷϝλ‫ڧ‬Խֶश(ओʹMAMLϕʔεͷ)ख๏ͷܽ఺ • meta-trainingɾadaptationͱ΋ʹon-policyͳσʔλʹґଘ͓ͯ͠Γαϯϓϧޮ཰ੑ͕௿͍ • Ұͭ໨ͷ࿦จ[Rakelly+ 2019]ͱಉ͡ • ಛʹɼmeta-training͸meta-testͷํࡦͱ௚઀͸ؔ܎͠ͳ͍ͷͰɼϦονͳใु΍σϞϯετϨʔγϣϯΛ ༻͍ͯαϯϓϧޮ཰Λ޲্͍ͨ͠ 30

31.

ఏҊख๏ 31

32.

ఏҊख๏ͷ֓ཁ ຊ࿦จͷఏҊख๏ͷ֓ཁ • meta-trainͷmeta-objective(֎ͷ໨తؔ਺)Λ‫͋ࢣڭ‬Γֶश(behaviour cloning)ʹ͢Δ͜ͱ Ͱɼֶशͷ҆ఆԽɾαϯϓϧޮ཰ੑͷ޲্Λ໨ࢦ͢ • meta-trainingΛ2ͭͷϑΣʔζʹ෼͚Δ • ᶃ task learning: ‫ݸ‬ʑͷmeta-trainingλεΫͷpolicyΛֶश͢Δ • ͜ͷpolicy͸meta-testͰ࢖ΘΕΔΘ͚Ͱ͸ͳ͍ɽ໛฿ֶशʹ͓͚ΔexpertσʔλͰ΋ྑ͍ɽ • ᶄ meta-learning: ᶃͰֶशͨ͠policyΛ༻͍ͯmeta-levelͰsupervisedʹֶशΛߦ͏ 32

33.

໰୊ઃఆ ‫ج‬ຊతʹҰͭ໨ͷ࿦จ[Rakelly+ 2019]ͱಉ͡ p(𝒯) Λߟ͑Δ λεΫ্ۭؒͷ෼෍ɹɹ • ֤λεΫɹ͸ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹͰߏ੒͞ΕΔ 𝒯 𝒯 = {p (s0), p (st+1 | st, at), r (st, at)} • ͦΕͧΕɼॳ‫ظ‬ঢ়ଶͷ֬཰෼෍ɼঢ়ଶભҠ֬཰ɼใुؔ਺ p(𝒯) • ςετ࣌͸ɹɹɹ͔Β৽͍͠λεΫ͕αϯϓϧ͞ΕΔ 33

34.

ఏҊख๏ ᶃ task learningϑΣʔζ 𝒯i • ‫ݸ‬ʑͷmeta-trainingλεΫɹΛͱ͖ɼ } ࠷ద/४࠷దͳpolicyͷू߹ {π* i ΛಘΔ • ͜ΕΒ͸ΤΩεύʔτͱͯ͠ѻ͑Δ ᶄ meta-learningϑΣʔζ • MAMLΛ࢖ͬͨϝλ‫ڧ‬Խֶशͱಉ༷ɼℒRL (ϕi, 𝒟i) Λ࠷దԽ͢Δ͜ͱ͕໨త • ύϥϝʔλɹ͸λεΫɹɹʹޯ഑߱Լ๏Ͱadaptͨ͠ͱ͖ͷύϥϝʔλ ϕi 𝒯i • ͳͷͰɼ಺ଆͷ໨తؔ਺͸MAMLΛ࢖ͬͨϝλ‫ڧ‬Խֶशͱಉ༷ • ֎ଆͷ໨తؔ਺͕ɼ‫͋ࢣڭ‬Γ(behaviour cloning)ʹͳΔ ℒBC (ϕi, 𝒟i) ≜ − ∑ (st,at)∈𝒟 log πϕ (at | st) 34

35.

ఏҊख๏ ᶄ meta-learningϑΣʔζͷৄࡉ π* 𝒯i • ֤meta-trainingλεΫɹʹؔͯ͠ɼํࡦɹΛ i ͦΕͧΕϩʔϧΞ΢τͯ͠ΤΩεύʔτͷ ‫ي‬ಓͷσʔληοτɹΛ࡞Δ D* i • ͜ͷσʔληοτΛ༻͍ͯɼpolicyΛ meta-objectiveʹैͬͯߋ৽ min θ 𝔼𝒟tri ∼πθ ℒBC (θ − α ∇θ ℒRL (θ, 𝒟tri ), 𝒟val i )] ∑ val ∑ [ 𝒯i 𝒟i ∼𝒟*i θ • ߋ৽͞ΕͨύϥϝʔλɹΛ΋ͱʹɼ ϕi Λ‫ͯ͠ࢉܭ‬ϩʔϧΞ΢τ͢Δ͜ͱͰɼ ֤λεΫɹʹؔ͢Δύϥϝʔλ 𝒯i σʔληοτɹΛ૿΍͢͜ͱ͕Ͱ͖Δ 㱺behaviour cloningͷcompounding error΁ͷରԠ D* i 35

36.

ఏҊख๏ ఏҊख๏ͷಛ௃ • meta-learningͷ໰୊Λɼtask learningϑΣʔζͱmeta-learningϑΣʔζʹ໌ࣔతʹ෼͚Δ ఏҊख๏ͷϝϦοτ • ͜ΕʹΑΓɼࣄલʹֶश͍ͯͨ͠policy΍ɼσϞϯετϨʔγϣϯΛར༻Ͱ͖Δ • ‫͋ࢣڭ‬Γֶशʹ͢Δ͜ͱͰɼֶश͕҆ఆԽ͢Δ • meta-training࣌ʹ͔͠ೖखͰ͖ͳ͍৘ใΛ࢖ͬͯɼϝλ‫ڧ‬Խֶश͢Δ͜ͱ͕Ͱ͖Δ • ྫ) reward shapingɼ෺ମͷҐஔͳͲͷ௿࣍‫ͳݩ‬ঢ়ଶද‫ݱ‬ • MAMLϕʔεͷϝλ‫ڧ‬Խֶशಉ༷ɼσʔλ͕஝ੵ͢Δͨͼʹֶश͠ଓ͚ΒΕΔ 36

37.

ఏҊख๏ͷ࣮૷ ΤΩεύʔτpolicyͷ࠷దԽ • ΞϧΰϦζϜ্ɼ֤λεΫʹର͠‫ݸ‬ผͷpolicyΛֶशͯ͠΋͍͍͕ɼ contextualͳpolicy πθ (at | st, ω) Λֶश͢Δ͜ͱͰޮ཰ԽͰ͖Δ ω • ɹ͸λεΫͷ಺༰ʹؔ͢Δม਺(λεΫͷ಺༰͕Θ͔Ε͹ΰʔϧͷҐஔɼλεΫIDͱ͔Ͱ΋ྑ͍) • Ͳ͏ͤmeta-trainingͰ࢖͏͚ͩͳͷͰ • meta-testͰ͸meta-trainingͰೖΕ͜Μͩใु͸࢖ΘͣੜͷใुͷΈΛ࢖͏ • ຊ࿦จͷ࣮‫Ͱݧ‬͸ɼsoft actor-critic(SAC) [Haarnoja+ 2018]Λϕʔεʹͨ͠ 37

38.

ఏҊख๏ͷ࣮૷ ࠷దԽΞϧΰϦζϜ • Behaviour cloningͷmeta-objectiveͰ͸ ෳ਺ճޯ഑๏ʹΑΔߋ৽Λߦ͑Δ • ͔͠͠ɼߋ৽͢Δͨͼʹࠜ‫ݩ‬ͷύϥϝʔλ θ ͕ ϕi Λ࠶‫͢ࢉܭ‬Δඞཁ͕͋Δ มΘͬͯ͠·͍ɼ πθ • ͜ΕΛํࡦɹ͔Β৽͍͠σʔλΛαϯϓϧ ͤͣʹ΍Γ͍ͨ 㱺ޯ഑ͷॏΈ෇͚Λߦ͏ πθ(τ) ϕi = θ + α𝔼τ∼πθ ∇θ log πθ(τ)Ai(τ) [ πθinit(τ) ] • A ɹ͸Ξυόϯςʔδؔ਺ i • Behaviour cloningʹ͓͚Δߋ৽ θ ← θ − β ∇θ ℒBC (ϕi, 𝒟val i ) 38

39.

࣮‫ݧ‬ɾ݁Ռ 39

40.

࣮‫ݧ‬ ࣮‫ݧ‬ઃఆ • ϩϘοτΞʔϜ • Pushing (full state) • ϒϩοΫΛಛఆͷΰʔϧʹԡ͢ɽΰʔϧҐஔ͸ࢼߦࡨ‫Ͱޡ‬ਪ࿦͢Δɽ • खઌͱϒϩοΫͷҐஔ͕༩͑ΒΕΔ • Pushing (vision) • ը૾৘ใͷΈ • Door opening • υΞΛಛఆͷ֯౓ʹ։͚Δɽ • ΰʔϧͷ֯౓͸ࢼߦࡨ‫Ͱޡ‬ਪ࿦͢Δ • ࢛଍าߦͷҠಈ (Ant) • ΰʔϧʹͨͲΓண͘ λεΫͷಈը͸ https://sites.google.com/berkeley.edu/guided-metapolicy-search ʹ͋Δ 40

41.

࣮‫ݧ‬ ᶃ ϝλ‫ڧ‬Խֶश • ‫ط‬ଘͷϝλ‫ڧ‬Խֶशख๏ͱͷൺֱ • meta-trainingͰ͸task context(λεΫΛҰҙʹಛఆͰ͖Δ৘ใ)ʹΞΫηεͰ͖ΔͱԾఆ • SACϕʔεͷఏҊख๏ͷαϯϓϧޮ཰͕ྑ͍ • ԣ࣠: meta-trainingͷαϯϓϧ਺ ॎ࣠: ฏ‫ۉ‬ऩӹ 41

42.

࣮‫ݧ‬ ᶄ σϞϯετϨʔγϣϯΛ࢖ͬͨϝλֶश • ‫ط‬ଘͷϝλ‫ڧ‬Խֶशख๏ͱͷൺֱ • ૄͳใुઃఆͱͯ͠Door OpeningͱAntͰ‫ূݕ‬ • ఏҊख๏͕ૄͳใुԼͰ΋୳ࡧʹΑͬͯߴ͍ੑೳ͕ಘΒΕΔ͜ͱΛ֬ೝ • ը૾Λ࢖ͬͨλεΫͱͯ͠ΞʔϜͷpushingͰ‫ূݕ‬ • ఏҊख๏͕҆ఆͯ͠ߴ͍ੑೳ͕ಘΒΕΔ͜ͱΛ֬ೝ 42

43.

·ͱΊ 43

44.

·ͱΊ • ϝλ‫ڧ‬Խֶशʹ͓͍ͯɼoff-policyͳख๏(GMPS)ΛఏҊ • meta-trainingΛtask learningͱmeta-learningͷ2ͭͷϑΣʔζʹ෼͚Δ͜ͱͰɼΑΓֶश͕ ҆ఆ͓ͯ͠Γαϯϓϧޮ཰ͷྑ͍‫͋ࢣڭ‬Γֶश(behaviour cloning)Λಋೖ͢Δ͜ͱ͕Ͱ͖ͨ • ‫ط‬ଘͷϝλ‫ڧ‬Խֶशख๏ΑΓ΋meta-trainingͷαϯϓϧޮ཰͕ߴ͍͜ͱΛ࣮‫ݧ‬తʹࣔͨ͠ 44

45.

͓ΘΓʹ 45

46.

‫ײ‬૝ • ۙ೥ͷϝλֶशͷ2ͭͷྲྀΕ • ޯ഑߱Լ๏Λ༻͍ͯone-step updateͰadapt͢ΔϞσϧ (BAIRத৺) • ྫ) MAML[Finn+ 2017]ؔ࿈ͷख๏ • જࡏม਺ʹ৚݅෇͚Δ͜ͱͰadapt͢ΔϞσϧ (DeepMindத৺) • ྫ) Neural Processes[Garnelo+ 2018], GQN[Eslami+ 2018] • ͲͪΒ΋ࣅͨϞνϕʔγϣϯͷ‫ڀݚ‬ͷҟͳΔϞσϧɽ • ౷Ұతͳࢹ఺͔Βͷٞ࿦͸Ѩ‫ٱ‬ᖒ͞Μ͕঺հ • [DLྠಡձ]Meta-Learning Probabilistic Inference for Prediction • https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-forprediction-126167192 • ݁‫Ͳہ‬ͷέʔεʹͲ͕͍͍ͬͪͷ͔ɽ૒ํͷpro-conΛٞ࿦͢Δ͜ͱ͕ඞཁͳͷͰ͸ʁ 46

47.

Appendix 47

48.

References [Eslami+ 2018] Eslami, S. M. Ali, Danilo Jimenez Rezende, Frédéric Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil C. Rabinowitz, Helen King, Chloe Hillier, Matthew M Botvinick, Daan Wierstra, Koray Kavukcuoglu and Demis Hassabis. “Neural scene representation and rendering.” Science 360 (2018): 1204-1210. http://science.sciencemag.org/content/360/6394/1204 {Finn+ 2017] Chelsea Finn, Pieter Abbeel and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1126-1135, 2017. http://proceedings.mlr.press/v70/ finn17a.html [Garnelo+ 2018] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola and Danilo J. Rezende, S.M. Ali Eslami and Yee Whye Teh. “Neural Processes”. https://arxiv.org/abs/1807.01622. [Gupta+ 2018] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel and Sergey Levine. ”Meta-Reinforcement Learning of Structured Exploration Strategies”. In Advances in Neural Information Processing Systems, 2018. https://nips.cc/Conferences/2018/ Schedule?showEvent=12658 [Haarnoja+ 2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel and Sergey Levine. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1861-1870, 2018. http://proceedings.mlr.press/v80/haarnoja18b.html [Mendonca+ 2019] Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine and Chelsea Finn. “Guided MetaPolicy Search”. https://arxiv.org/abs/1904.00956 [Nagabandi+ 2018] Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine and Chelsea Finn. “Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning”. https://arxiv.org/abs/1803.11347 [Nichol+2018] Alex Nichol, Joshua Achiam and John Schulman. “On First-Order Meta-Learning Algorithms”. https://arxiv.org/abs/1803.02999 [Rakelly+ 2019] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn ands Sergey Levine. “Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables”. https://arxiv.org/abs/1903.08254 48