>100 Views
April 15, 19
スライド概要
2019/04/05
Deep Learning JP:
http://deeplearning.jp/seminar-2/
DL輪読会資料
off-policyͳϝλڧԽֶशख๏ ɾEfficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables ɾGuided Meta-Policy Search Presenter: Tatsuya Matsushima @__tmats__ , Matsuo Lab 1
ຊൃදʹ͍ͭͯ • off-policyʹϝλڧԽֶश͢Δख๏͕࠷ཱۙͯଓ͚ʹarXiv্Ͱൃද͞Ε͍ͯͨ • Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables [Rakelly+ 2019] (2019/3/19) • Guided Meta-Policy Search [Mendonca+ 2019] (2019/4/1) • MAMLΛ༻͍ͨطଘͷϝλڧԽֶशख๏Ͱɼmeta-trainingλεΫͰ͍ͬͺ͍ࢼߦࡨ͞ޡ ͤΔඞཁ͕͋Γɼoff-policyͳख๏ͷ։ൃڀݚͷൃలʹͭͳ͕Δ͔ 2
લఏࣝ 3
ϝλֶश ػցֶशʹ͓͚Δϝλֶश ʮ͋Δܾ·ͬͨόΠΞεɼ͢ͳΘͪԾઆۭؒͷத͔ΒɼࣄྫʹԠͯ͡ɼదͳԾઆ Λ֫ಘ͢Δී௨ͷֶशثΛϕʔεֶश͏͍ͱثɽͦͷ্ҐͰɼֶशରͷλεΫυ ϝΠϯʹԠͯ͡ɼֶशثͷόΠΞεΛܾఆ͢ΔͨΊͷϝλࣝΛ֫ಘ͢Δͷ͕ϝλֶ श (meta learning)ʯ㱺σʔλυϦϒϯʹֶशثͷόΠΞεΛܾΊ͍ͨ • ग़య: गͷైWiki http://ibisforest.org/index.php?%E3%83%A1%E3%82%BF%E5%AD%A6%E7%BF%92 • [DLྠಡձ]Meta-Learning Probabilistic Inference for Prediction (Ѩٱᖒ͞Μ) ʹৄ͍͠આ໌ • https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-forprediction-126167192 4
MAML MAML (Model Agnostic Meta-Learning) [Finn+ 2017] • ޯ๏ʹͮ͘جϝλֶशख๏ ϕ𝒯ΛٻΊadapt͢Δ • ύϥϝʔλͷॳظɹ͔Βɼޯ߱Լ๏ΛͬͯλεΫݻ༗ͷύϥϝʔλɹ θ • MAMLͷֶश min θ ∑ 𝒯 ℒ (θ − α ∇θ ℒ (θ, 𝒟tr𝒯), 𝒟val 𝒯 ) = min θ ∑ 𝒯 ℒ (ϕ𝒯, 𝒟val 𝒯) • ্࣮ɼ2࣍ޯ͕ग़ͯ͘Δ • 1࣍ۙࣅ͢Δ[ڀݚNichol+2018]͋Δ • meta-testͱ͖ɼޯ๏ʹैͬͯߋ৽ͨ͠ύϥϝʔλͰ λεΫΛղ͘ ϕ𝒯test = θ − α ∇θ ℒ (θ, 𝒟tr𝒯test) 5
ϝλڧԽֶश MAMLΛϕʔεͱͨ͠ϝλڧԽֶश • lossͱͯ͠ڧԽֶशͷloss(ෛͷใु)Λ༻͍Δ ℒRL (ϕ, 𝒟𝒯i) = − 1 𝒟𝒯i ∑ st,at∈𝒟 ri (st, at) 1 H = − 𝔼st,at∼πϕ,q𝒯 ri (st, at) i[H ∑ ] t=1 • MAMLΛmodel-basedͳڧԽֶश[Nagabandi+ 2018]ɼ୳ࡧख๏[Gupta+ 2018]ʹ༻͍ͨڀݚ ͋Δ • [DLྠಡձ]Meta Reinforcement Learning (ॳ୩͞Μ)ʹϝλڧԽֶशʹؔͯ͠ৄ͍͠આ໌ • https://www.slideshare.net/DeepLearningJP2016/dl-130067084 6
(ࢀߟ) On-policy v.s. Off-policy On-policy (ํࡦΦϯ) • ڍಈํࡦͱλʔήοτํࡦ(ՁͷਪఆͷͨΊͷํࡦ)͕ಉ͡ख๏ • ʹֶश͍ͯ͠ΔํࡦͱαϯϓϧΛੜ͢Δํࡦ͕ಉ͡ • ྫ) ε-greedyํࡦ Off-policy (ํࡦΦϑ) • ڍಈํࡦͱλʔήοτํࡦ͕ҟͳΔख๏ • ʹֶश͍ͯ͠ΔํࡦͱαϯϓϧΛੜ͢Δํࡦ͕ҟͳΔ ※ MAMLͷ߹ɼtrainͰλεΫ͝ͱʹগͳ͍αϯϓϧΛͱʹύϥϝʔλΛੜ͢ΔͷͰɼͦͷσʔ λͷ͕testͱେ͖͘ҟͳ͍ͬͯͨΒࠔΔ(=φΠʔϒʹoff-policyͰ͖ͳ͍)ͱ͍͏͜ͱͩͱࢥ͍·͢ 7
ᶃ Efficient Off-Policy Meta-Reinforcement ɹ Learning via Probabilistic Context Variables 8
հ͢Δจᶃ Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables • https://arxiv.org/abs/1903.08254 (Submitted on 19 Mar 2019) • Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Sergey Levine • ͓ͳ͡Έ UC Berkeley (BAIR) • DeepͬͨRL”UC Berkeley”͍ͬͯ͏ʹͳΓͭͭ͋Δ • ஶऀ࣮ • • https://github.com/katerakelly/oyster (BAIRʹͯ͠ΊͣΒ͘͠)PyTorchɽrlkitΛར༻ɽ 9
TL; DR • ڧԽֶशʹ͓͚Δmeta learningʹ͓͍ͯɼoff-policyͳख๏(PEARL)ΛఏҊ • ֬తͳจ຺(context)ͷજࡏมΛਪ • permutation invariantʹݧܦΛू͢Δ͜ͱͰɼ͍λεΫʹաద߹ͤͣʹૉૣ͘࠷దԽ͢ Δ • طଘͷϝλڧԽֶशख๏ͱൺֱͯ͠ɼ20-100ഒαϯϓϧޮ͕ߴ·ͬͨ 10
ҙࣝ طଘͷϝλڧԽֶश(ओʹMAMLϕʔεͷ)ख๏ͷܽ • meta-trainingɾadaptationͱʹon-policyͳσʔλʹґଘ͓ͯ͠Γαϯϓϧޮੑ͕͍ • MAMLͰɽmeta-trainͱmeta-testͰಉ͡ૢ࡞Λ͠ͳ͚ΕͳΒͣφΠʔϒʹoff-policyʹͰ͖ͳ͍ • ৽͍͠λεΫʹadapt͢Δͱ͖ʹɼλεΫͷෆ࣮֬ੑʹؔ͢Δਪ͕Ͱ͖ͳ͍ • ใु͕εύʔεͳͱ͖ʹͱͳΔ 11
ఏҊख๏ 12
ఏҊख๏ͷ֓ཁ ຊจͷఏҊख๏ͷ֓ཁ • off-policyͳRLΞϧΰϦζϜ(soft actor-critic, SAC [Haarnoja+ 2018])ͷͱͰɼ contextͷ֬มΛΦϯϥΠϯͰਪ͢Δख๏(PEARL)ΛఏҊ • Meta-trainingͷαϯϓϧޮ্ͱૉૣ͍adaptͷ྆ํΛࢦ͢ • meta-trainͰɼΤϯίʔμաڈͷݧܦΛ༻͍ͯɼpolicy͕λεΫΛ࣮ߦͰ͖ΔΑ͏ͳ֬ తͳcontextมΛਪ͢Δ • meta-testͰɼcontextมΛαϯϓϧ͠ɼΤϐιʔυͰݻఆɽ͚݅ͮΒΕͨpolicyΛ ༻͍ͯ৽͍͠λεΫʹadapt͢Δ • ݁Ռͱͯ͠ɼpolicyoff-policyͳσʔλΛ༻͍ͯ࠷దԽɼΤϯίʔμmeta-trainͱmetatestͷͷϛεϚονΛݮΒ͢Α͏ʹon-policyʹ࠷దԽ 13
ઃఆ p(𝒯) Λߟ͑Δ MDPͳλεΫ্ۭؒͷɹɹ • ֤λεΫɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹͰߏ͞ΕΔ 𝒯 𝒯 = {p (s0), p (st+1 | st, at), r (st, at)} • ͦΕͧΕɼॳظঢ়ଶͷ֬ɼঢ়ଶભҠ֬ɼใुؔ • ͜ͷઃఆΛஔ͘͜ͱͰɼঢ়ଶભҠ͕֬ҧ͏ɾใु͕ؔҧ͏λεΫΛؚΊͨΛ ԾఆͰ͖Δ • ঢ়ଶભҠ͕֬ҧ͏ྫ: ҟͳΔμΠφϛΫεΛ࣋ͭϩϘοτ • ใु͕ؔҧ͏ྫ: ҧ͏ॴͷφϏήʔγϣϯ c𝒯 𝒯 • λεΫɹʹ͓͚Δ1εςοϓͷભҠΛɹɹɹɹɹɹɹɹͱ͢Δ n = (sn, an, rn, s′n) 𝒯 • ͜Ε·Ͱͷݧܦ c = c1:N p(𝒯) • ςετ࣌ɹɹɹ͔Β৽͍͠λεΫ͕αϯϓϧ͞ΕΔ 14
ఏҊख๏ ֬తͳજࡏcontextͷֶश z • adapt͢ΔͨΊʹજࡏมɹ͕λεΫʹؔ͢ΔಛతͳใΛΤϯίʔυ͢Δඞཁ͕͋Δ z • જࡏมɹΛਪ͢ΔͨΊʹมਪΛ༻͍Δ • ۙࣅࣄޙ(Inference network) qϕ(z | c) Λఆٛ • తؔΛରͩͱ͢ΔͱɼมԼք 𝔼𝒯 𝔼z∼qϕ(z | c𝒯) R(𝒯, z) + βDKL (qϕ (z | c𝒯) ∥p(z)) [ ]] [ • prior p(z)Gaussianͱ͢Δ • z Λਪ͢Δ qϕ(z | c) ͷύϥϝʔλɹmeta-trainͰ࠷దԽɼmeta-testͰूΊͨݧܦΛͱʹ z ϕ 15
ఏҊख๏ ֬తͳજࡏcontextͷֶश • ͕ڥMDPͰ͋ΔͱԾఆ͍ͯ͠ΔͷͰɼॱংʹؔͳ͘ભҠͷू߹ɹɹɹɹɹʹΞΫηε {si, ai, s′i, ri} Ͱ͖ΕλεΫ͕ਪଌͰ͖Δ • permutation invariantͷԾఆ͕ஔ͚Δ • ͳͷͰɼInference networkΛಠཱͳཁૉͷੵͱͯ͠ѻ͑Δ qϕ (z | c1:N) ∝ ΠNn=1Ψϕ (z | cn) • ͦΕͧΕGaussian Ψϕ (z | cn) = 𝒩 f μ (cn), f ϕσ (cn) (ϕ ) 16
ఏҊख๏ off-policyͳϝλڧԽֶश qϕ(z | c) • ΤϯίʔμɹɹɹɹΛֶश͢ΔσʔλͱpolicyΛֶश͢Δ σʔλ͕ಉ͡Ͱ͋Δඞཁͳ͍ • actorͱciriticϦϓϨΠόοϑΝશମɹ͔Βαϯϓϧ͞ΕΔ ℬ σʔλͰֶश • ΤϯίʔμΛֶशͤ͞ΔͨΊͷαϯϓϥ𝒮c ɼ ϦϓϨΠόοϑΝதͷ࠷ۙͷيಓσʔλ͔Βαϯϓϧ • શʹon-policyͰ͋Δඞཁͳ͍͕ɼϦϓϨΠόοϑΝશମ Λ͏ͱon-policyͳtestσʔλͱͷϛεϚον͕େ͖͗͢Δ 17
ఏҊख๏ off-policyͳϝλڧԽֶश z • Soft Actor-Critic (SAC) [Haarnoja+ 2018] Λ֬తͳcontextมɹΛؚΊ֦ͯு • SACmaxEntRL(Τϯτϩϐʔਖ਼ଇԽ͕తؔʹ·ؚΕΔ)ͷoff-policyͳactor-criticख๏ • ΤϯίʔμͱactorɾcriticΛreparameterization trickΛͬͯಉ࣌ʹ࠷దԽ • criticͷloss: ℒcritic = 𝔼 s, a, r, s′ ∼ ℬ Qθ(s, a, z) − (r + V (s′, z)) [ ] ( ) 2 z ∼ qϕ(z | c) • actorͷloss: ℒactor = 𝔼s∼ℬ,a∼πθ DKL πθ(a | s, z)∥ ( exp (Qθ(s, a, z)) 𝒵θ(s) ) 18
࣮ݧɾ݁Ռ 19
࣮ݧ ᶃ طଘͷϝλڧԽֶशख๏ͱͷൺֱ • MuJoCoͷ6ͭͷݧ࣮Ͱڥ • Half-Cheetah, Humanoid, Ant, Walker (Half-CheetahͱAnt͕2छྨͣͭ) • ใु͕ؔҧ͏͔μΠφϛΫε͕ҧ͏ઃఆ • adapt͢Δඞཁ༗ • ϕʔεϥΠϯͷ20-100ഒαϯϓϧޮ ࠷ऴతͳੑೳߴ͍ • ԣ࣠: meta-trainingͷαϯϓϧ • ॎ࣠: ฏۉऩӹ 20
࣮ݧ ᶄ Τϯίʔμ͔ΒͷαϯϓϦϯά • εύʔεͳใुԼͰɼon-policyͳख๏(MAESN[Gupta+ 2018])ͱൺֱ • sparse navigationͰূݕ • meta-testͰΤʔδΣϯτ͕ΰʔϧͷ ೱ͍੨ͷؙͷதʹೖͬͯॳΊͯใु͕ಘΒΕΔ • ۙࣅࣄޙ͔ΒͷαϯϓϦϯά͕ɼใु͕εύʔεͳ ߹ʹ༗ޮͰ͋Δ͜ͱΛ֬ೝ • contextͷ͕૿͑Δʹै͍ऩӹ͕૿Ճ • MAESNΑΓੑೳ͕ߴ͍ 21
࣮ݧ ᶅ Ablation Study • ΤϯίʔμͷΞʔΩςΫνϟʹؔ͢Δ࣮ݧ • Half-Cheetah-VelͰূݕ • RNNΛ༻͍ͨΤϯίʔμͱൺֱ • RNN-tran: ભҠΛde-correlatedͯ͠αϯϓϦϯά • RNN-traj: يಓΛαϯϓϦϯά • permutation invariantͳΤϯίʔμͷΞʔΩςΫνϟͷ ੑೳ͕Ұ൪ߴ͔ͬͨ 22
࣮ݧ ᶅ Ablation Study • σʔλͷαϯϓϦϯάํ๏ʹؔ͢Δ࣮ݧ • Half-Cheetah-VelͰূݕ • Τϯίʔμͷೖྗͱͯ͠ͲͷσʔλΛ༻͍Δ͔Λม͑Δ • off-policy: શʹoff-policy(શόοϑΝ͔ΒαϯϓϦϯά) • off-policy RL-batch: policyͱಉ͡όονΛ༻͍Δ • ΤϯίʔμʹۙͷόοϑΝ͔Βαϯϓϧ͢Δ ఏҊख๏(PEARL)ͷੑೳ͕Ұ൪ߴ͔ͬͨ 23
࣮ݧ ᶅ Ablation Study • ֬తɾܾఆతͳcontextʹΑΔҧ͍ • sparse navigationͰূݕ • ܾఆతͳcontextͰѹతʹऩӹ͕͍ • λεΫͷෆ࣮֬ੑ͕ϞσϦϯά͞Εͣ ޮՌతʹ୳ࡧ͕Ͱ͖ͳ͍ͨΊ 24
·ͱΊ 25
·ͱΊ • ϝλڧԽֶशʹ͓͍ͯɼoff-policyͳख๏(PEARL)ΛఏҊ • աڈͷ͔ݧܦΒ֬తͳcontextͷมΛΦϯϥΠϯͰਪɼpolicyΛcontextͰ͚݅ͮΔ ͜ͱͰɼόοϑΝશମΛ͏off-policyͳֶश͕Մೳʹͳͬͨ • طଘͷϝλڧԽֶशख๏ΑΓmeta-trainingͷαϯϓϧޮ͕ߴ͍͜ͱΛ࣮ݧతʹࣔͨ͠ 26
ᶄ Guided Meta-Policy Search 27
հ͢Δจᶄ Guided Meta-Policy Search • https://arxiv.org/abs/1904.00956 (Submitted on 1 Apr 2019) • Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea Finn • UC Berkeley (BAIR) • Ͱ͢ΑͶ…ͱ͍͏͡ײ • ஶऀ࣮ • • https://github.com/RussellM2020/GMPS Website • https://sites.google.com/berkeley.edu/guided-metapolicy-search 28
TL; DR • ڧԽֶशʹ͓͚Δmeta learningʹ͓͍ͯɼoff-policyͳख๏(GMPS)ΛఏҊ • ௨ৗɼmeta-train࣌ʹRLͰํࡦΛֶश͢Δඞཁ͕ͳ͍͜ͱ͕ଟ͍͜ͱʹ • meta-trainͷmeta-objective(֎ͷతؔ)ɼimitation learning (behaviour cloning)ͱֶͯ͠ शͤ͞Δ͜ͱͰɼ҆ఆੑɾֶशޮΛߴΊΔ • ͦͷͨΊʹɼmeta-trainingΛtask learningͱmeta-learningͷ2ͭͷϑΣʔζʹ໌ࣔతʹ͚Δ 29
ҙࣝ طଘͷϝλڧԽֶश(ओʹMAMLϕʔεͷ)ख๏ͷܽ • meta-trainingɾadaptationͱʹon-policyͳσʔλʹґଘ͓ͯ͠Γαϯϓϧޮੑ͕͍ • Ұͭͷจ[Rakelly+ 2019]ͱಉ͡ • ಛʹɼmeta-trainingmeta-testͷํࡦͱؔ͠ͳ͍ͷͰɼϦονͳใुσϞϯετϨʔγϣϯΛ ༻͍ͯαϯϓϧޮΛ্͍ͨ͠ 30
ఏҊख๏ 31
ఏҊख๏ͷ֓ཁ ຊจͷఏҊख๏ͷ֓ཁ • meta-trainͷmeta-objective(֎ͷతؔ)Λ͋ࢣڭΓֶश(behaviour cloning)ʹ͢Δ͜ͱ Ͱɼֶशͷ҆ఆԽɾαϯϓϧޮੑͷ্Λࢦ͢ • meta-trainingΛ2ͭͷϑΣʔζʹ͚Δ • ᶃ task learning: ݸʑͷmeta-trainingλεΫͷpolicyΛֶश͢Δ • ͜ͷpolicymeta-testͰΘΕΔΘ͚Ͱͳ͍ɽ฿ֶशʹ͓͚ΔexpertσʔλͰྑ͍ɽ • ᶄ meta-learning: ᶃͰֶशͨ͠policyΛ༻͍ͯmeta-levelͰsupervisedʹֶशΛߦ͏ 32
ઃఆ جຊతʹҰͭͷจ[Rakelly+ 2019]ͱಉ͡ p(𝒯) Λߟ͑Δ λεΫ্ۭؒͷɹɹ • ֤λεΫɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹͰߏ͞ΕΔ 𝒯 𝒯 = {p (s0), p (st+1 | st, at), r (st, at)} • ͦΕͧΕɼॳظঢ়ଶͷ֬ɼঢ়ଶભҠ֬ɼใुؔ p(𝒯) • ςετ࣌ɹɹɹ͔Β৽͍͠λεΫ͕αϯϓϧ͞ΕΔ 33
ఏҊख๏ ᶃ task learningϑΣʔζ 𝒯i • ݸʑͷmeta-trainingλεΫɹΛͱ͖ɼ } ࠷ద/४࠷దͳpolicyͷू߹ {π* i ΛಘΔ • ͜ΕΒΤΩεύʔτͱͯ͠ѻ͑Δ ᶄ meta-learningϑΣʔζ • MAMLΛͬͨϝλڧԽֶशͱಉ༷ɼℒRL (ϕi, 𝒟i) Λ࠷దԽ͢Δ͜ͱ͕త • ύϥϝʔλɹλεΫɹɹʹޯ߱Լ๏Ͱadaptͨ͠ͱ͖ͷύϥϝʔλ ϕi 𝒯i • ͳͷͰɼଆͷతؔMAMLΛͬͨϝλڧԽֶशͱಉ༷ • ֎ଆͷత͕ؔɼ͋ࢣڭΓ(behaviour cloning)ʹͳΔ ℒBC (ϕi, 𝒟i) ≜ − ∑ (st,at)∈𝒟 log πϕ (at | st) 34
ఏҊख๏ ᶄ meta-learningϑΣʔζͷৄࡉ π* 𝒯i • ֤meta-trainingλεΫɹʹؔͯ͠ɼํࡦɹΛ i ͦΕͧΕϩʔϧΞτͯ͠ΤΩεύʔτͷ يಓͷσʔληοτɹΛ࡞Δ D* i • ͜ͷσʔληοτΛ༻͍ͯɼpolicyΛ meta-objectiveʹैͬͯߋ৽ min θ 𝔼𝒟tri ∼πθ ℒBC (θ − α ∇θ ℒRL (θ, 𝒟tri ), 𝒟val i )] ∑ val ∑ [ 𝒯i 𝒟i ∼𝒟*i θ • ߋ৽͞ΕͨύϥϝʔλɹΛͱʹɼ ϕi Λͯ͠ࢉܭϩʔϧΞτ͢Δ͜ͱͰɼ ֤λεΫɹʹؔ͢Δύϥϝʔλ 𝒯i σʔληοτɹΛ૿͢͜ͱ͕Ͱ͖Δ 㱺behaviour cloningͷcompounding errorͷରԠ D* i 35
ఏҊख๏ ఏҊख๏ͷಛ • meta-learningͷΛɼtask learningϑΣʔζͱmeta-learningϑΣʔζʹ໌ࣔతʹ͚Δ ఏҊख๏ͷϝϦοτ • ͜ΕʹΑΓɼࣄલʹֶश͍ͯͨ͠policyɼσϞϯετϨʔγϣϯΛར༻Ͱ͖Δ • ͋ࢣڭΓֶशʹ͢Δ͜ͱͰɼֶश͕҆ఆԽ͢Δ • meta-training࣌ʹ͔͠ೖखͰ͖ͳ͍ใΛͬͯɼϝλڧԽֶश͢Δ͜ͱ͕Ͱ͖Δ • ྫ) reward shapingɼମͷҐஔͳͲͷ࣍ͳݩঢ়ଶදݱ • MAMLϕʔεͷϝλڧԽֶशಉ༷ɼσʔλ͕ੵ͢Δͨͼʹֶश͠ଓ͚ΒΕΔ 36
ఏҊख๏ͷ࣮ ΤΩεύʔτpolicyͷ࠷దԽ • ΞϧΰϦζϜ্ɼ֤λεΫʹର͠ݸผͷpolicyΛֶश͍͍͕ͯ͠ɼ contextualͳpolicy πθ (at | st, ω) Λֶश͢Δ͜ͱͰޮԽͰ͖Δ ω • ɹλεΫͷ༰ʹؔ͢Δม(λεΫͷ༰͕Θ͔ΕΰʔϧͷҐஔɼλεΫIDͱ͔Ͱྑ͍) • Ͳ͏ͤmeta-trainingͰ͏͚ͩͳͷͰ • meta-testͰmeta-trainingͰೖΕ͜ΜͩใुΘͣੜͷใुͷΈΛ͏ • ຊจͷ࣮Ͱݧɼsoft actor-critic(SAC) [Haarnoja+ 2018]Λϕʔεʹͨ͠ 37
ఏҊख๏ͷ࣮ ࠷దԽΞϧΰϦζϜ • Behaviour cloningͷmeta-objectiveͰ ෳճޯ๏ʹΑΔߋ৽Λߦ͑Δ • ͔͠͠ɼߋ৽͢Δͨͼʹࠜݩͷύϥϝʔλ θ ͕ ϕi Λ࠶͢ࢉܭΔඞཁ͕͋Δ มΘͬͯ͠·͍ɼ πθ • ͜ΕΛํࡦɹ͔Β৽͍͠σʔλΛαϯϓϧ ͤͣʹΓ͍ͨ 㱺ޯͷॏΈ͚Λߦ͏ πθ(τ) ϕi = θ + α𝔼τ∼πθ ∇θ log πθ(τ)Ai(τ) [ πθinit(τ) ] • A ɹΞυόϯςʔδؔ i • Behaviour cloningʹ͓͚Δߋ৽ θ ← θ − β ∇θ ℒBC (ϕi, 𝒟val i ) 38
࣮ݧɾ݁Ռ 39
࣮ݧ ࣮ݧઃఆ • ϩϘοτΞʔϜ • Pushing (full state) • ϒϩοΫΛಛఆͷΰʔϧʹԡ͢ɽΰʔϧҐஔࢼߦࡨͰޡਪ͢Δɽ • खઌͱϒϩοΫͷҐஔ͕༩͑ΒΕΔ • Pushing (vision) • ը૾ใͷΈ • Door opening • υΞΛಛఆͷ֯ʹ։͚Δɽ • ΰʔϧͷ֯ࢼߦࡨͰޡਪ͢Δ • ࢛าߦͷҠಈ (Ant) • ΰʔϧʹͨͲΓண͘ λεΫͷಈը https://sites.google.com/berkeley.edu/guided-metapolicy-search ʹ͋Δ 40
࣮ݧ ᶃ ϝλڧԽֶश • طଘͷϝλڧԽֶशख๏ͱͷൺֱ • meta-trainingͰtask context(λεΫΛҰҙʹಛఆͰ͖Δใ)ʹΞΫηεͰ͖ΔͱԾఆ • SACϕʔεͷఏҊख๏ͷαϯϓϧޮ͕ྑ͍ • ԣ࣠: meta-trainingͷαϯϓϧ ॎ࣠: ฏۉऩӹ 41
࣮ݧ ᶄ σϞϯετϨʔγϣϯΛͬͨϝλֶश • طଘͷϝλڧԽֶशख๏ͱͷൺֱ • ૄͳใुઃఆͱͯ͠Door OpeningͱAntͰূݕ • ఏҊख๏͕ૄͳใुԼͰ୳ࡧʹΑͬͯߴ͍ੑೳ͕ಘΒΕΔ͜ͱΛ֬ೝ • ը૾ΛͬͨλεΫͱͯ͠ΞʔϜͷpushingͰূݕ • ఏҊख๏͕҆ఆͯ͠ߴ͍ੑೳ͕ಘΒΕΔ͜ͱΛ֬ೝ 42
·ͱΊ 43
·ͱΊ • ϝλڧԽֶशʹ͓͍ͯɼoff-policyͳख๏(GMPS)ΛఏҊ • meta-trainingΛtask learningͱmeta-learningͷ2ͭͷϑΣʔζʹ͚Δ͜ͱͰɼΑΓֶश͕ ҆ఆ͓ͯ͠Γαϯϓϧޮͷྑ͍͋ࢣڭΓֶश(behaviour cloning)Λಋೖ͢Δ͜ͱ͕Ͱ͖ͨ • طଘͷϝλڧԽֶशख๏ΑΓmeta-trainingͷαϯϓϧޮ͕ߴ͍͜ͱΛ࣮ݧతʹࣔͨ͠ 44
͓ΘΓʹ 45
ײ • ۙͷϝλֶशͷ2ͭͷྲྀΕ • ޯ߱Լ๏Λ༻͍ͯone-step updateͰadapt͢ΔϞσϧ (BAIRத৺) • ྫ) MAML[Finn+ 2017]ؔ࿈ͷख๏ • જࡏมʹ͚݅Δ͜ͱͰadapt͢ΔϞσϧ (DeepMindத৺) • ྫ) Neural Processes[Garnelo+ 2018], GQN[Eslami+ 2018] • ͲͪΒࣅͨϞνϕʔγϣϯͷڀݚͷҟͳΔϞσϧɽ • ౷Ұతͳࢹ͔ΒͷٞѨٱᖒ͞Μ͕հ • [DLྠಡձ]Meta-Learning Probabilistic Inference for Prediction • https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-forprediction-126167192 • ݁ͲہͷέʔεʹͲ͕͍͍ͬͪͷ͔ɽํͷpro-conΛٞ͢Δ͜ͱ͕ඞཁͳͷͰʁ 46
Appendix 47
References [Eslami+ 2018] Eslami, S. M. Ali, Danilo Jimenez Rezende, Frédéric Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil C. Rabinowitz, Helen King, Chloe Hillier, Matthew M Botvinick, Daan Wierstra, Koray Kavukcuoglu and Demis Hassabis. “Neural scene representation and rendering.” Science 360 (2018): 1204-1210. http://science.sciencemag.org/content/360/6394/1204 {Finn+ 2017] Chelsea Finn, Pieter Abbeel and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1126-1135, 2017. http://proceedings.mlr.press/v70/ finn17a.html [Garnelo+ 2018] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola and Danilo J. Rezende, S.M. Ali Eslami and Yee Whye Teh. “Neural Processes”. https://arxiv.org/abs/1807.01622. [Gupta+ 2018] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel and Sergey Levine. ”Meta-Reinforcement Learning of Structured Exploration Strategies”. In Advances in Neural Information Processing Systems, 2018. https://nips.cc/Conferences/2018/ Schedule?showEvent=12658 [Haarnoja+ 2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel and Sergey Levine. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1861-1870, 2018. http://proceedings.mlr.press/v80/haarnoja18b.html [Mendonca+ 2019] Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine and Chelsea Finn. “Guided MetaPolicy Search”. https://arxiv.org/abs/1904.00956 [Nagabandi+ 2018] Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine and Chelsea Finn. “Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning”. https://arxiv.org/abs/1803.11347 [Nichol+2018] Alex Nichol, Joshua Achiam and John Schulman. “On First-Order Meta-Learning Algorithms”. https://arxiv.org/abs/1803.02999 [Rakelly+ 2019] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn ands Sergey Levine. “Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables”. https://arxiv.org/abs/1903.08254 48