278 Views
March 06, 20
スライド概要
2020/03/06
Deep Learning JP:
http://deeplearning.jp/seminar-2/
DL輪読会資料
Hindsight Experience ReplayΛԠ༻ͨ͠ ࠶ϥϕϦϯάʹΑΔޮతͳڧԽֶश 2020.03.06 Presenter: Tatsuya Matsushima @__tmats__ , Matsuo Lab 1
ຊൃදʹ͍ͭͯ εύʔεͳใुͷڧԽֶश • ใु͕ಘΒΕΔ·Ͱʹ͍͕ྻܥඞཁ(long horizon) • ํࡦͷ୳ࡧίετ͕େ͖͍͘͠ • ۃڀతʹΤϐιʔυͷ࠷ʹޙ2ͷใुʢޭ/ࣦഊʣ ͷΈ͕͔ڥΒಘΒΕΔ • ྫ) ϩϘοτΞʔϜʹΑΔϚχϐϡϨʔγϣϯ, ... Α͘ΘΕΔΞϓϩʔν • ಘΒΕͨσʔλΛॻ͖ͯ͑ϥϕϧΛ͚ͭΔ • Hindsight Experience Replay ʢHERʣ • ֶशσʔλʹ͓͚Δΰʔϧͷ࠶ϥϕϦϯά • ใुϥϕϧͷͳ͍σϞͷར༻ʢ฿ֶशʣ 2
ຊൃදʹ͍ͭͯ ࠷ۙެ։͞Εͨ͜ΕΒΛͬͨจ 1) Generalized Hindsight for Reinforcement Learning • https://arxiv.org/abs/2002.11708, https://sites.google.com/view/generalized-hindsight • 2020/2/26, ஶऀʹPieter Abbeel HERͷར༻ 2) Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement • https://arxiv.org/abs/2002.11089 • 2020/2/25, ஶऀʹSergey Levine 3) Learning Latent Plans from Play (CoRL2019) • https://arxiv.org/abs/1903.01973, https://learning-from-play.github.io/ • CoRL2019, ஶऀʹSergey Levine σϞͷར༻ (ࠓলུ) 4) Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning (CoRL2019) • https://arxiv.org/abs/1910.11956, https://relay-policy-learning.github.io/ • CoRL2019, ஶऀʹSergey Levine 3
ຊൃදʹ͍ͭͯ ࠷ۙެ։͞Εͨ͜ΕΒΛͬͨจ 1) Generalized Hindsight for Reinforcement Learning • https://arxiv.org/abs/2002.11708, https://sites.google.com/view/generalized-hindsight • 2020/2/26, ஶऀʹPieter Abbeel HERͷར༻ 2) Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement • https://arxiv.org/abs/2002.11089 • 2020/2/25, ஶऀʹSergey Levine 3) Learning Latent Plans from Play (CoRL2019) • https://arxiv.org/abs/1903.01973, https://learning-from-play.github.io/ • CoRL2019, ஶऀʹSergey Levine σϞͷར༻ (ࠓলུ) 4) Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning (CoRL2019) • https://arxiv.org/abs/1910.11956, https://relay-policy-learning.github.io/ • CoRL2019, ஶऀʹSergey Levine 4
Hindsight Experience ReplayʢHERʣ Hindsight Experience Replay • https://arxiv.org/abs/1707.01495 • Goal-conditionalͳڧԽֶशʹ͓͍ͯ ͋ͱܙʢhindsightʣΛ͏ݧܦϦϓϨΠ • λεΫΛୡ͠ͳ͔ͬͨ߹ʹɼͦΕ·Ͱʹ ͱͬͨҰ࿈ͷߦಈ͕༗ҙٛͰ͋ΓಘͨΰʔϧΛ͔ޙΒઃఆֶͯ͠शʹؚΊΔ • ྫʣݩʑͷΰʔϧͱผʹɼ֤Τϐιʔυͷ࠷ऴঢ়ଶΛ͔ޙΒΰʔϧͩͬͨ͜ͱʹ͢Δ • DLྠಡձʢதଜ͞Μʣ https://www.slideshare.net/DeepLearningJP2016/dlhindsight-experience-replay • Pieter AbbeelͷNIPS2017ͷߨԋ https://www.youtube.com/watch?v=TyOooJC_bLY 5
Hindsight Experience ReplayʢHERʣ • ํࡦɾQؔΛΰʔϧͰ͚݅ͮΒΕͨϞσϧΛ͏ • HERͰɼΰʔϧͱใुΛॻ͖ͨ͑ʢ࠶ϥϕϦϯάʣσʔλΛֶͬͯश ग़యɿPieter AbbeelͷNIPS2017ߨԋεϥΠυ 6
Hindsight Experience ReplayʢHERʣ • ݩʑͷΤϐιʔυ͚ͩͰͳ͘ ΰʔϧͱใुΛ࠶ϥϕϦϯάͨ͠ σʔλϦϓϨΠόοϑΝʹՃ • ΰʔϧͷܾΊํ𝕊ͷબ ͍Ζ͍Ζߟ͑ΒΕΔʢޙड़ʣ 7
Hindsight Experience ReplayʢHERʣ DLྠಡձࢿྉʢதଜ͞ΜʣΑΓ 8 https://www.slideshare.net/DeepLearningJP2016/dlhindsight-experience-replay
Hindsight Experience ReplayʢHERʣ DLྠಡձࢿྉʢதଜ͞ΜʣΑΓ 9 https://www.slideshare.net/DeepLearningJP2016/dlhindsight-experience-replay
Hindsight Experience ReplayʢHERʣ DLྠಡձࢿྉʢதଜ͞ΜʣΑΓ 10 https://www.slideshare.net/DeepLearningJP2016/dlhindsight-experience-replay
Hindsight Experience ReplayʢHERʣ DLྠಡձࢿྉʢதଜ͞ΜʣΑΓ 11 https://www.slideshare.net/DeepLearningJP2016/dlhindsight-experience-replay
հ͢Δจͷ࠷ۙͷΞΠσΞ HERɼঢ়ଶͱͯ͠ද͞ݱΕΔΰʔϧΛ࠶ϥϕϦϯάΛ͢Δख๏ͩͬͨ ΰʔϧ͕ঢ়ଶͱͯ͠ఆٛ͞ΕΔҎ֎ʹར༻Ͱ͖ΔΑ͏ʹɼ ڧٯԽֶश(IRL)Λͬͯɼ͋ͱͰܙใुͷ࠶ϥϕϦϯάΛ͢Δ͕ڀݚग़͖͍ͯͯΔ • ݁Ռͱͯ͠ɼϚϧνλεΫRLʹ͑Δ 1) Generalized Hindsight for Reinforcement Learning • https://arxiv.org/abs/2002.11708 (2020/2/26) 2) Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement • https://arxiv.org/abs/2002.11089 (2020/2/25) 1ҧ͍Ͱಉ͡େֶʢUCόʔΫϨʔʣ͔Βಉ͡ํੑͷจ͕ग़͍ͯΔʢۮવʁʣ • ऀޙͷํ͕͍ΈΛఏҊ͍ͯ͠Δʢ͕͢ؾΔʣ 12
ᶃ Genralized Hindsight for Reinforcement Learning Genralized Hindsight for Reinforcement Learning • Alexander C. Li, Lerrel Pinto, Pieter Abbeel • Submitted on 26 Feb 2020 • arXiv: https://arxiv.org/abs/2002.11089 • website: https://sites.google.com/view/generalized-hindsight • HERΛIRLΛ༻͍ͯϚϧνλεΫRLʹར༻Ͱ͖ΔΑ͏ʹ֦ுͨ͠ Generalized HindsightΛఏҊ 13
ᶃ Genralized Hindsight for Reinforcement Learning HERͷ՝ • HERΰʔϧͰ͚݅ͮΒΕͨʢgoal-conditionalʣRLͰͷख๏ͩͬͨ • ΰʔϧ͕ঢ়ଶͰද͞ݱΕΔ͜ͱ͕ඞཁ͕ͩɼ͜ͷઃఆҰൠతͰͳ͍ ຊจͷత • μΠφϛΫεಉ͕ͩ͡ɼใुؔͷҟͳΔϚϧνλεΫRLͷઃఆʹɼ HERతͳൃΛͯ͠༻׆ɼαϯϓϧޮͷ্ΛਤΔ • ใु͕ؔ r( ⋅ | z) Ͱද͞ΕΔMDPʢͨͩ͠ɼλεΫ: 𝒯ɼλεΫม: z ∼ 𝒯ʣ 14
ᶃ Genralized Hindsight for Reinforcement Learning Hindsight Relabeling • 𝕊͋ͱܙͷλεΫมͷબͼํͱ͢Δ • Approximate IRL relabeling (AIR)ͱAdvantage relabelingΛఏҊ • RLʹΑ͋͘Δ͜ͱ͕ͩɼ͜͜Ͱใुؔͷܗঢ়طͱͯ͠ѻ͍ͬͯΔʢͣʣ • ҙͷs,a,zʹର͢Δr͕Θ͔Δ • r(s, a | z = g) = [d(s, z = g) < ϵ]ͱ͢ΕHERͱಉ͡ 15
ᶃ Genralized Hindsight for Reinforcement Learning ϦϥϕϦϯάͷํ๏1: Approximate IRL relabeling (AIR) 𝕊IRL K • λεΫ𝒯͔ΒKݸͷλεΫ{vj}j=1Λ αϯϓϧ͠ɼ֤يಓτ͕·ؚΕΔNݸͷيಓͷ ू߹𝒟ͷͳ͔ʹ͓͚Δɼ ̂ vj) ͦͷيಓͷऩӹR(τ, vj)ͷύʔηϯλΠϧP(τ, ͕࠷ߴ͍mݸͷλεΫมΛฦ͢ • େ͖ͳKͰෳͷλεΫ͕ಉ͡ύʔηϯλΠϧΛ ࣋ͭՄೳੑ͕͋ΔͷͰɼ࣮༻্ऩӹͰͳ͘ π ̂ Ξυόϯςʔδ A(τ, z) = R(τ | z) − V (s0, z) Λ͏ T−1 • ͜Ε͕IRLͳͷɼIRL͕࠷୯७ʹ 𝔼 [ ∑t=0 γr* (st) | πE] ຬͨ͢r*Λ͚ͭݟΔͱΈͳͤΔͨΊ ≥ T−1 𝔼 [ ∑t=0 γr* (st) | π] ∀πΛ 16
ᶃ Genralized Hindsight for Reinforcement Learning ϦϥϕϦϯάͷํ๏2: Advantage Relabeling 𝕊A • AIR͕ྔࢉܭେ͖͍ʢ𝒪(NT)ʣ π ̂ • ΞυόϯςʔδA(τ, z) = R(τ | z) − V (s0, z) ͕࠷େ͖͍mݸͷλεΫมΛฦ͢ • ݧܦతʹ͏·͍ͬͨ͘ • SACͷ߹ V (s, z) = min (Q1(s, π(s | z), z), Q2(s, π(s | z), z)) π 17
ᶃ Genralized Hindsight for Reinforcement Learning ࣮ݧ • (a)(b)ใुͷώʔτϚοϓ͕ҟͳΔ • (c)ϋϯυͷઌͷҐஔʹΑΔใुͱΤωϧΪʔɼsafetyʹؔ͢Δใु • ใु͕༩͑ΒΕΔҐஔͱͦΕͧΕͷॏΈ͕ҟͳΔ • (d)͞ɼ͖ɼߴ͞ɼΤωϧΪʔফඅʹؔ͢ΔใुͰɼͦΕͧΕͷॏΈ͕ҟͳΔ • (e)ਐߦํͷਖ਼͠͞Ͱใु͕ҟͳΔ 18
ᶃ Genralized Hindsight for Reinforcement Learning ݁Ռ • AIRͱAdvantage Relabelingαϯϓϧޮɾ࠷ऴతͳऩӹͰ ଞͷϕʔεϥΠϯΛ্ճͬͨ • IUͰϥϯμϜʹλεΫΛબͿ 19
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement • Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba • Submitted on 25 Feb 2020 • arXiv: https://arxiv.org/abs/2002.11708 • HERΛIRLΛ༻͍ͯϚϧνλεΫRLʹར༻Ͱ͖ΔΑ͏ʹ֦ுͨ͠ Hindsight Inference for Policy Improvement (HIPI)ΛఏҊ • Ϟνϕʔγϣϯಉ͡ • MaxEnt RLͱMaxEnt IRLΛΈ߹ΘͤΔ • ͜ͷ͕ᶃͱҧ͏ ಉ͡ਤΜɾɾɾʢ͕ͬͪ͜ઌͷެ։͚ͩͲʣ 20
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement ͔͜͜Βͷදه • ϚϧνλεΫRL: ใुؔ rψ (s, a) λεΫม ψ • ᶃͰ z ∈ Ψ ʹґଘ ∼ 𝒯 Ͱද͍ͨͯ͠هͷͱಉ͡ • λεΫͷࣄલ p(ψ) • ํࡦqͷͱͰͷيಓͷ q(τ) = p1 (s1) ∏t p (st+1 | st, at) q (at | st) 21
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement લఏ: MaxEnt RLʢγϯάϧλεΫʣ • ֶशʹΑΓํࡦ q(τ) Λ p(τ) ≜ 1 Z p1 (s1) ∏t r(st, at) p (st+1 | st, at) e ʹ͚͍ۙͮͨ • ͜Ε q(τ) ͱ p(τ) ͷreverse KLͷ࠷খԽͱߟ͑Δͱɼ తؔΤϯτϩϐʔਖ਼ଇԽ͞Εͨใुͷ࠷େԽʹͳΔ −DKL(q∥p) = 𝔼q ( ∑t rt − log q (at | st)) − log Z [ ] • ؔํࡦʹґଘ͠ͳ͍ͷͰRLΞϧΰϦζϜͰߟ͑ͳ͍ • ྫ) Soft Actor-Critic (SAC) 22
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement લఏ: MaxEnt IRL • λεΫ ψ ͷͱ͖ɼيಓτͷΛ p(τ | ψ) = 1 Z(ψ) p1 (s1) ∏t p (st+1 | st, at) e • ͨͩ͠ɼؔ Z(ψ) rψ(st, at) ͱ͢Δ ≜ ∫ p1 (s1) ∏t p (st+1 | st, at) rψ(st, at) e dτ • ϕΠζͷఆཧΑΓɼλεΫͷࣄޙɼ p(ψ | τ) = p(τ | ψ)p(ψ) p(τ) ∝ p(ψ)e ∑t rψ(st, at)−log Z(ψ) • ͜ͷࢉܭ͍͠ʢͯ͢ͷঢ়ଶͱߦಈʹؔ͢Δੵʣ͕ɼMaxEnt RLΛߟ͑Δͱ log Z(ψ) = maxq(τ|ψ) 𝔼q(τ|ψ) [ ∑t rψ (st, at) − log q (at | st, ψ)] • γϯάϧλεΫͷIRLใुؔΛٻΊΔΈ͕ͩɼ͜͜ψΛٻΊΔͱ͍ͯ͠Δ 23
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement MaxEnt RLʢϚϧνλεΫʣ • ֶशʹΑΓํࡦ q(τ, ψ) Λ p(τ, ψ) ≜ 1 Z p1 (s1) ∏t p (st+1 | st, at) e rψ(st, at) ʹ ͚͍ۙͮͨ • q(τ, ψ) = q(τ | ψ)p(ψ)ͱͯ͠ߟ͑ΔͱɼλεΫґଘͷํࡦq(τ | ψ) 𝔼ψ∼q(ψ),r∼q(τ|ψ) rψ (st, at) − log q (at | st, ψ) Ͱ·ٻΔ [∑ ] t 24
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement MaxEnt IRLʹΑΔhindsight relabelling • MaxEnt RLͷϚϧνλεΫͷํࡦΛq(τ, ψ) ࠶ϥϕϦϯά q(ψ | τ) Λߟ͑Δͱ = q(ψ | τ)q(τ)ͱͯ͠ɼ ͷ࠷େԽͰ·ٻΔ • ͜ΕΛղ͘ͱ q(ψ | τ) ∝ p(ψ)e ∑t rψ(st, at)−log Z(ψ) • ͜ΕMaxEnt IRLͰλεΫͷࣄޙΛߟ͑Δͷͱಉ͡ • يಓશମͰͳ͘1εςοϓͷঢ়ଶભҠΛߟ͑Δͱq (ψ | st, at) ∝ Q̃ (st, at)−log Z(ψ) p(ψ)e 25 q
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement Hindsight Relabeling • ຊจͷओுɼʮHindsight Relabeling͕IRLͰ͋Δ͜ͱʯ • ใुؔΛ ͱ͢Δͱɼ ͱͳΓɼHERͷΰʔϧঢ়ଶΛ࠶ϥϕϦϯά͢Δͷͱಉ͡ • ͜Εᶃͷจͱಉ͡ 26
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement Hindsight Relabelingͷར༻ • ࠶ϥϕϧͯ͠RL͢Δख๏ʢHIPI-RLʣͱBC͢Δख๏ʢHIPI-BCʣΛఏҊ 27
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement ࣮ݧ • ᶃͷ࣮ͱݧಉ͡Α͏ʹɼλεΫมψʹΑͬͯใु͕มΘΔΑ͏ͳઃఆ • ψͱͯ͠ඪͷਐߦํ࠲ඪɼ͞Λࢦఆ • ͍͔ͭ͘ΰʔϧͰλεΫ͕ࢦఆ͞ΕΔλεΫ 28
ᶄ Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement ݁Ռ • ΰʔϧͰࢦఆ͞ΕΔλεΫͰɼIRLΛ͏͜ͱͰαϯϓϧޮͷ্ • HERͱൺֱ͕Ͱ͖Δ݅ • HIPI-RLHIPI-BCϥϯμϜʹ࠶ϥϕϦϯά͢ΔΑΓੑೳͷ্ 29
·ͱΊɾײ ·ͱΊ • IRLΛར༻͢Δ͜ͱͰɼHERΛใु͕ΰʔϧঢ়ଶҎ֎Ͱࢦఆ͞ΕΔϚϧνλεΫֶ शʹ֦ு͢Δ͜ͱ͕Ͱ͖Δ ײ • ෳͷλεΫͷσʔλΛͬͯޮతʹֶश͢Δͷ࣮ݱతͳํͩͱࢥ͏Ұํɼ • ΦϯϥΠϯͰෳͷλεΫ͕ू·ͬͯ͘Δͱ͍͏ͷ࣮ݱతͳͷ͔Θ͔Βͳ͍ • ΦϑϥΠϯσʔλʹ࣋ͬͯΏ͘ͱ͋Γͦ͏ʁ • ใु͕ؔطͰgoal-conditionalͰͳ͍ͬͯͦ͜·ͰҰൠతͳͷͩΖ͏͔ʁ • ใु͕ؔઃ͞ܭΕ͍ͯΔඞཁ͋ΔʢओͰ؍ਓ͕ؒΞϊςʔγϣϯɼͩͱ͍͠ݫʣ • ใुͷਪϞσϧΛผʹ࡞Ε͍͍ͷ͔͠Εͳ͍͚ΕͲ… 30