262 Views
January 15, 21
スライド概要
ublished on Jan 15, 2021
2021/01/15
Deep Learning JP:
http://deeplearning.jp/seminar-2/
DL輪読会資料
DEEP LEARNING JP [DL Papers] Why Deep RL fails? A brief survey of recent works. Presenter: Kei Ota (@ohtake_i). http://deeplearning.jp/ 1
• ਂڧԽֶशʢ%3-ʣΫϥεྨճؼʹൺෆ҆ఆͰ͋Δ͜ͱ͕ଟ͍ɽ • ຊൃදͰɼ%3-ͷෆ҆ఆ͞ԿʹىҼ͢Δͷ͔ɼͲ͏͢Ε͖ܰͰݮΔ͔Λ ཧతɾ࣮ݧతʹ͔͍ͭͨ͘͠ڀݚͷจΛઙ͘͘հ͢Δɽ • հจ %FFQ3FJOGPSDFNFOU-FBSOJOHUIBU.BUUFST """* %FFQ3FJOGPSDFNFOU-FBSOJOHBOEUIF%FBEMZ5SJBE BS9JW %JBHOPTJOH#PUUMFOFDLTJO%FFQ2MFBSOJOH"MHPSJUINT *$.- 3FWJTJUJOH'VOEBNFOUBMTPG&YQFSJFODF3FQMBZ *$.- *NQMJDJUVOEFSQBSBNFUFSJ[BUJPOJOIJCJUTEBUBFGGJDJFOUEFFQSFJOGPSDFNFOUMFBSOJOH *$-3 – %3-%FFQ%FOTF"SDIJUFDUVSFTJO3FJOGPSDFNFOU-FBSOJOH BS9JW – – – – – 2
Deep Reinforcement Learning that Matters • ࣮ݧతʹ%3-͕ҎԼͷੑ࣭Λ࣋ͭ͜ͱΛࣔͨ͠ɿ – ϋΠύʔύϥϝʔλɾ׆ੑԽؔɾ࣮ʹΑΓ݁Ռ͕େ͖͘มΘΔ – ճؼɾΫϥεྨͷΑ͏ʹΛ૿ͯ͠ඞͣ͠ੑೳ্͕͢ΔͱݶΒͳ͍ Deep RLͳ͍ͥ͠ʁ 3
Deadly Triad • %FBEMZ5SJBE%2-ͷࣦഊͷݪҼֶशʹ༻͍Δͭͷํ๏ʹىҼ͢Δ – 1G BA A8 DB 1 #PPUTUSBQQJOHʢ5%ֶशʣ – – g y P hg R it RW Pehg 'VODUJPOBQQSPYJNBUJPO – – s w uM ..Ni P df g L s c ]iw hg Vg 0GGQPMJDZ – – 2B B8K KD A x ao f mkl BD K G GB G 81 D D B AA gMv N d ar A BAB BG A3 D D [ Vg D A0 8 pi RnS D KA D I P g 0 A BD r Vg R g A , DA A . 1 4
Deadly Triad
• %FBEMZ5SJBEʹΑΓՁ͕ؔൃࢄ͢Δྫ
–
-O
2
i
l
– ! "# = 1, ! "' = 2 >
o
!
–
l
) " = * × ! -) "# = *, ) "' = 2*
–
D )("# )
c
• . > 1/2
*
p
• ͜ͷͷײతղऍɿ
–
–
T
l
>
c
>
l
l
f
c
>
5
Deadly Triad • ͦͦ%FBEMZ5SJBEΛճආͰ͖ͳ͍͔ʁ – – t D O Qs N p • #PPUTUSBQQJOH – T 22 - • – – e e • 0GGQPMJDZ – n Mr d d g Qo s iB MC 2 B Bl • y N r giac Q B • %FBEMZ5SJBE͔Βൈ͚ग़͢ͷͦ͠͏ͳͷͰɼͦͷੑ࣭ΛΓ͍ͨ – B QO D 6
Deep Reinforcement Learning and the Deadly Triad • ͜ͷจͰɼ%FBEMZ5SJBEͷͦΕͧΕͷߏཁૉͷӨڹΛ࣍ͷΑ͏ʹௐઅɽ • #PPUTUSBQQJOH – - 3 1 0 = • 'VODUJPOBQQSPYJNBUJPO – a 8 N 80 M d 83 - , , • 0GGQPMJDZ – e D8 • ͜ΕΒͷઃఆͰɼͦΕͧΕΛมߋͨ࣌͠ʹֶशͷڍಈ͕Ͳ͏มΘΔ͔Λ؍ଌ͠ ͍ͭΞϧΰϦζϜ͕ෆ҆ఆʹͳΔ͔ɼߏཁૉͱੑೳͷؔΛ࣮ݧతʹௐࠪ 7
Deep Reinforcement Learning and the Deadly Triad • ࣮ݧલʹɼҎԼͷԾઆΛஔ͍ͯ݁ՌΛͨ͠ূݕɽ ( ) . , , , . -1 bD 4 D 6 6 , B A3 : 23A 1 a i y osrxi urt mnlp Q rw a L- 36 2A:36a a a a :5 a 1 Q g b A i g g Q c e i Q lko g i Oi Ofe . .D 5 : A F: 3 : i g T Q e g , , d B A3 : 8
Deep Reinforcement Learning and the Deadly Triad '%2-Ͱ VOCPVOEFEͳʢ2͕ࡍ͘ͳݶେ͖͘ͳΔʣൃࢄ͠ʹ͍͘ – – – g no • • s Qi 0 1 0 1 r : 0 d b e - ! = 0.99 c e 0 c f a e sm 1/ 1 − ! = 100 9
Deep Reinforcement Learning and the Deadly Triad ##PPUTUSBQQJOHʹ5BSHFUOFUXPSLΛ͏ͱൃࢄ͠ʹ͍͘ – T D – – - D 10
Deep Reinforcement Learning and the Deadly Triad #2ͷաେධՁΛमਖ਼͢Δͱൃࢄ͠ʹ͍͘ – – – a D D b Q - - - Q T 11
Deep Reinforcement Learning and the Deadly Triad #ϚϧνεςοϓΛ͘͢Δͱൃࢄ͍͢͠ʢόΠΞεখ͘͞ࢄେ͖͘ʣ – pMgbac • E .D LyYl Mijdcfeh ! rmMijdcfeh ! – – 0 s F ] A D A vWn E DI BE 2 A B t P E B , DB [Yok EB F R W Y H D A You EB F H D A You x s w - D BI A 0 A HEFA A AF E B C D A C -, 12
Deep Reinforcement Learning and the Deadly Triad 'ωοτϫʔΫαΠζ͕େ͖͍΄͏͕ൃࢄ͠ʹ͍͘ – – – • • s c n f r o d e i e g 13
Deep Reinforcement Learning and the Deadly Triad 0༏ઌ͖όοϑΝͷ༏ઌ߹͍Λେ͖͘͢Δͱൃࢄ͍͢͠ – – – R g P E ! ∈ {0,1,2} f dc dc g e - g i 2 / 14
Deep Reinforcement Learning and the Deadly Triad ·ͱΊɿ%FBEMZ5SJBEͷߏཁૉ͍͔ͭ͘ͷख๏ʹΑΓͦͷӨڹΛ؇Մೳ #PPUTUSBQQJOH – - - - D Q R 'VODUJPOBQQSPYJNBUJPO – E M – P ! 0GGQPMJDZ Q E DT D 15
Diagnosing Bottlenecks in Deep Q-learning Algorithms • %2-͕࣋ͭજࡏతͳΛௐࠪ͢ΔͨΊʮϢχοτςετʯΛ࣮ࢪ͠ɼ Լͭهͷٙʹ࣮ݧతʹ͑ͨɽ F 2 . 1 3 42 5 4 2 O 2 B3 : : 2 3 4 B3 16
Diagnosing Bottlenecks in Deep Q-learning Algorithms ' ؔۙࣅ͕ثऩଋʹ༩͑ΔӨڹʁ – – – J MT C D D C • C 17
Diagnosing Bottlenecks in Deep Q-learning Algorithms # աֶशൃੜ͢Δͷ͔ʁ – – >R >R • – 4 6532 i o f OR oB nD O D u T nD la - r p ,063 110 e c u > • • T c i la n r 18
Diagnosing Bottlenecks in Deep Q-learning Algorithms # աֶशΛͲ͏ͬͯܰ͢ݮΔ͔ʁ – – – EHHE=I .A P ) EM=H 2PI=M W cv z u ti sm [ fn ( kwyxz u pem [khl W n ∈ {0.5, 1.0, 2.0, 4.0, 8.0} j bg u [ f k h A=MH OKLLE C drY M= HA AOPM AHHI= -MMKM W n l ad go ]d go i[ .A P 2PI=M o A=MH OKLLE Cl u M=FEO =I= D= M= E D= D )C=MR=H K DP= A CEK 0PCK =MK DAHHA =MG KRH= E D= D )C=MR=H ,E = /DK D = AMCA A E A 1ILHE EO P AM L=M=IAOAME =OEK E DE EO EHH ,= A A E EOE C .P =IA O=H KB -SLAMEA =O= ABBE EA O AAL MAE BKM AIA O HA=M E C 1 A ALH= 1 19
Diagnosing Bottlenecks in Deep Q-learning Algorithms # ճؼઌͷඇఆৗੑͷӨڹʁ – • • - - • • - - - ! : – 20
Diagnosing Bottlenecks in Deep Q-learning Algorithms • ·ͱΊɿ%FBEMZ5SJBEͷߏཁૉ͍͔ͭ͘ͷख๏ʹΑΓ؇͞ΕΔ #PPUTUSBQQJOH – - PQ D 'VODUJPOBQQSPYJNBUJPO – 0GGQPMJDZ – R a M - E T D D! E Q E 21
Revisiting Fundamentals of Experience Replay • ϦϓϨΠόοϑΝͷύϥϝʔλ͕3-ʹ༩͑ΔӨڹΛௐࠪʢ%5ͷ0ʹ૬ʣ – 04 1 ,1 12 – 4 :5 /: 2 – 04 1 01 : • • 1 l K pd A4 1 1 : 1 o RiK Rg fe nc a d * : 4= : 2 * . * : 4= : 2 M - P : : 2 KC 22
Revisiting Fundamentals of Experience Replay • 0 ϦϓϨΠόοϑΝͷϋΠύʔύϥϝʔλͱ3-ͷੑೳͷؔʁ – 4.no • p t s daf cb – 21 32 . – 2 m 2 l w 32 . 3 4 .2 – R 3 A 4 .2 m 32 . i R A daf bd A r y gae .1 2 3 4 24 bd 1 21 f 4. - 23
Implicit under-parameterization inhibits data-efficient deep reinforcement learning • ํࡦͱՁؔΛ5%๏Ͱֶश͢Δͱɼ༗ޮϥϯΫ͕ݮগ੍͠ޚੑೳ͕ѱԽ – – - - - I : 24
Implicit under-parameterization inhibits data-efficient deep reinforcement learning • # ωοτϫʔΫͷ༗ޮϥϯΫͱ੍ޚੑೳʢऩӹʣͷؔʁ – – L • • • O R O = / T 25
Implicit under-parameterization inhibits data-efficient deep reinforcement learning • # ωοτϫʔΫͷ༗ޮϥϯΫͱ੍ޚੑೳʢऩӹʣͷؔʁ – – / • • • = LR O 26
Implicit under-parameterization inhibits data-efficient deep reinforcement learning • ##PPUTUSBQQJOH 5%๏ ͕ѱ͍ͷ͔ʁ – – L D • M D T BC S S 27
Implicit under-parameterization inhibits data-efficient deep reinforcement learning • ༗ޮϥϯΫ͕མͪͳ͍Α͏ͳϩεΛೖΕͨΒྑ͍ͷͰʁ – 28
D2RL: Deep Dense Architectures in Reinforcement Learning • ํࡦɾՁؔʹ%FOTF/FUΛ࠾༻ – – 29
• ຊൃදͰɼ%FFQ3-ʢಛʹ%2-ʣ͕͏·ֶ͘शͰ͖ͳ͍ࣄྫ͔Βग़ൃ͠ɼ ͳֶͥशͰ͖ͳ͍͔ɼͦͷෆ҆ఆ͞ԿʹىҼ͢Δͷ͔Λհͨ͠ɽ • ಛʹɼz%FBEMZ5SJBEzͱݺΕΔͭͷཁૉͷ߹ֶ͕ͤशΛෆ҆ఆʹ͍ͯ͠Δ ͜ͱΛࣔ͠ɼ͜ΕΒͷཁૉʹΑΔӨڹΛܰ͠ݮಘΔڀݚΛհͨ͠ɽ – #PPUTUSBQQJOH • .VMUJTUFQɼ5BSHFUOFUXPSLɼ%PVCMF2MFBSOJOHɼ%FOTF/FUɼFBSMZTUPQQJOHͷ࠾༻ • ༗ޮϥϯΫͰͷੑೳͷՄࢹԽ – 'VODUJPOBQQSPYJNBUJPO • ΑΓେ͖͍ωοτϫʔΫͷ࠾༻ – 0GGQPMJDZ • ΑΓPOQPMJDZͰଟ༷ͳαϯϓϧͷར༻ • ୠ͠ɼ͜ΕΒڥɾ3-ΞϧΰϦζϜɾϋΠύʔύϥϝʔλʹ ʢ߹ʹΑͬͯ͘ڧʣґଘ͢ΔͷͰɼͦͷ࣌ʑͰదʹ͍͚Δ͖ɽ 30
DEEP LEARNING JP [DL Papers] Why Deep RL fails? A brief survey of recent works. Presenter: Kei Ota. http://deeplearning.jp/ 31