>100 Views
December 12, 17
スライド概要
2017/12/8
Deep Learning JP:
http://deeplearning.jp/seminar-2/
DL輪読会資料
DEEP LEARNING JP [DL Papers] Deep Reinforcement Learning that Matters Reiji Hatsugai http://deeplearning.jp/ 1
k"āĢ! • õ²»©ę:¬Ãý36& – ûÄŖxÄ" – IE@«Ęħ 6Ř#"3 6¢ 6 /" • õ²»©ę16j# ; ò6ûÄŖ xÄ" :¥ "ñĻ¬ř:ġà4 !ijÜ • »©ę#IE@s¨ ¦ "!IE@!ō6Ĵij#+57 difficulty#"3!ó+6" !"ijÜ • k#Û· #° .¬Ĭă ijÜ • +5++ Ĵij"c! 7$ 2
Ř!6®Ķ 3
Ř!6®Ķ 4
kß3 6IE@: HalfCheetah 5
kß3 6IE@: Hopper 6
7
8
9
10
!" ~$(&|(" ) ("*+ ~,(- . |(" , !" ) 0"*+ = 0((" , !" , ("*+ ) 11
!" ~$(&|(" ) $ ("*+ ~,(- . |(" , !" ) 0"*+ = 0((" , !" , ("*+ ) ∞ π = arg max Eπ [∑ γ rτ ] ∗ π τ τ =0 12
äļ"õ²»©ę Soft Q TRPO PCL UNREAL ACER DDQN DQN SAC A3C Q-Prop D4PG ACKTR PPO IPG DDPG 13
äļ"õ²»©ę Soft Q TRPO PCL UNREAL ACER DDQN 『深層』強DQN 化学 習 に な っ て か ら SAC A3C D4PG た くさん の手法が開ACKTR 発された Q-Prop PPO IPG DDPG 14
õ²»©ę:ćĊ5rj • 3 6TIba 1. Ñô Ś 2. û¬ř3Ś 3. |÷.0;ś¦pä#¬ģ ŌŃ6Ŝ 4. ś¬ģąŜ ijÜ(ÄĜ śŝ!Ð6Ŝ 15
Deep Reinforcement Learning that Matters • ICML2017"reproducibility work shopReproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control" Ĕ • AAAI2018 accepted • õ²»©ę"ûÄ" : – ¤ă – ă /"śR=T[1¬ģŜ /"śü¡"S\<aE[aJYDbNŜ • !ħ77"¬řī«¬řÄĜñĻ6 • ÝÑô á!åÊÄ"íĬ:ġ:Øì 16
Deep Reinforcement Learning that Matters • ¬řñĻÑô# – – – – ACKTR (Wu et al. 2017) PPO (Schulman et al. 2017) DDPG (Lillicrap et al. 2015) TRPO (Schulman et al. 2015) • ACKTR, PPO#ÝĄ • DDPG, TRPO#3 ñĻ76baseline • çă!ñĻ#"®Ķ"T[ZI £â6 17
Deep Reinforcement Learning that Matters • • • • • • Network Architecture Reward Scale Random Seeds and Trials Environments Codebases Reporting Evaluation Metrics 18
Deep Reinforcement Learning that Matters • • • • • • Network Architecture Reward Scale Random Seeds and Trials Environments Codebases Reporting Evaluation Metrics 外因的なもの 19
Deep Reinforcement Learning that Matters • • • • • • Network Architecture Reward Scale Random Seeds and Trials Environments Codebases Reporting Evaluation Metrics 内因的なもの 20
Network Architecture • õ²»©ę#3 r976²"îĿ 6 – (64, 64) (rllab) – (100, 50, 25) (Q-Prop) – (400, 300) (DDPG) • 74:<]B\FY!®ñĻ • 4!Activation Function:£ñĻ 21
Policy Architecture 22
Activation Function 23
Network Architecture • • • • • PPO¦ C=F"QKM`b@:rĞ Tanh#2Á§ PPO"ĒéĦ6R=T[#ûÄ! 5¦ ¾ŕ:f6 “This also suggests a possible need for hyper parameter agnostic algorithms” ï 24
Reward Scale • • • • • QōÛ:©ę6á!r976L@PK@śDQN#clipingŜ 0 . = 20 σ=0.1 3 r976 ׫6w ETbE¦ w©ę œú! 6 śLeCun et al .2012; Glorot and Bengio 2010; Vincent, de Brebisson, and Bouthillier 2015Ŝ ʼn"ċŌ:Ė6 25
Reward Scale 26
Reward Scale • • • • • Reward Scale#¦ ¾ŕ:f6 œ«´ twōÛ"©ę!¦ ¾ŕ:f6× ü¡!33Reward Scale#Ā 6 Layer norm"åö£96 Learning values across many orders of magnitude (Hado van Hasselt et al. 2016) – µÚ:adaptive!©ęü¡!36 6 • HumanoidStandup-v1 # – Ř:ħ ʼn ʼn"EAb]"¾ŕ:¯ 100d"?bJb! !#Reward Scale:Ć!Ě6Âĥ 56 6 27
Deep Reinforcement Learning that Matters • • • • • • Network Architecture Reward Scale Random Seeds and Trials Environments Codebases Reporting Evaluation Metrics 内因的なもの 28
Random Seeds and Trials • 10R=T[Ńseed • 10:55! 6 • V_KM 29
Random Seeds and Trials 30
Random Seeds and Trials 31
Random Seeds and Trials <0.05 32
Random Seeds and Trials • [aJY!CaV]/":2ė! – 77"ş – 3 9 4 Ú ¦ åʳ 6 #Ĩ • ijÜÝÑô z76:iº6 ĮġÛ /"#seed!36/""ĜÄ • power analysisĮġÛ:Ħĉ/6;0 • 4 6IJê Âĥ ° 33
Environment • Ñô:Hopper, HalfCheetah, Swimmer, Walker2D!ņý • Ñô ü¡ð!"3 ÄĜ! 6 ñĻ6 34
HalfCheetah 35
Hopper 36
HalfCheetah • HalfCheetah"3 J=OX@E «´ /"#DDPG » • Hopper"3!J=OX@E œ«´DDPG#¹ • ś«´œ«´!#Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous ControlŜ • ü¡"œ«´Ä »DDPG"3 QōÛ:©ę6/"#Œ 6 • Ľ!ĨHalfCheetah#DDPG!č "DDPG base"<] B\FY"ĭt!HalfCheetah:161#unfair 37
Swimmer 38
Swimmer • Ħ6TRPO"cj* • l"policy#local optimal!ŏ6 • ʼn"¦ ¬Ő"Ô/Ħ • ś ʼnīĩ Ç"#Ŝ 39
Code base • TRPODDPG!# "Ġě¬ģ1rllab, baseline"¬ģ • R=T[Ā 6¬ģ 6 40
Code base 41
Code base • R=T[ "! • dramatic impacts on performance • ijÜ!ã 7 įđ ¬ģ ¦ ¾ŕ:Ò 42
Reporting Evaluation Metrics • k+"õ²»©ę"ĭtÞô# q ĮġÚm V_KMñĻ • k+Ħ3!õ²»©ę#Ú ¦ • Øì6ĭtÞô – vŗŌ:ĭt6 – í«è:ĮġÛ:«.6 – <]B\FYŌ"ñĻ!åʳ 6 ēĩă!ĭt6 43
Deep Reinforcement Learning that Matters+.
• ă¤ă"gÞ!õ²»©ę"ûÄ#¾ŕ:
• 7 4"õ²»©ę"ĭtÞô
–
–
–
–
6
[aJYDbN: ;ĭt
ēĩă í«/ý
R=TbT[ZI:|}ŋ6
¬ģ"įđ¬řī«:}ŋ6
• 7
4"õ²»©ę
ŀ,ł
– hyperparameters agnostic algorithm
• “There is often no clear winner among all benchmark environments.”
44
Ï# !:ħ6" • ijÜ#HalfCheetahHopper:DDPGħ 6 1©ę! ª«Ä 6 ü¡:stable, unstableĭ • task difficulty#Ē±algorithmGKMĚ6" 3"8 ş • Simple Nearest Neighbor Policy Method for Continuous Control Tasks – Nearest NeighborPolicy:Ģû – task difficulty:©ę! 6È "Ė"ňtask7ĝp"Œ!ħ – NN:rÈ "Ė#Ětask7ĝp"Œ:×Ŋ6 45
Ñô
• NN-1, NN-2"Ş:Øì
• c«ne"Đĉ ʼn À47ĺĹ:u¨
• NN-1
1. û"æùÌu¨7ĺĹ"æùÌ"h
&
2. Ň$7ĺĹľ5!ä¿+action:ġ
4cÿļ/":Ň
• NN-2
1. û"ùÌu¨7ĺĹ"ùÌ"h 4cÿļ/":Ň&
2. u¨7ĺĹ"action:1stepġ1!Ð6
• >UHbN"ä¿
ʼn
ķ6Sparse reward
46
NNĒé 47
Simple Nearest Neighbor • • • • • • Sparse Mountain Car:Ŏ6Ĉ¶Î6 ijÜ/ 7HalfCheetah/Î6 HalfCheetah#1#5č* task difficulty"ħ"āÉ#ŔĂ ICLR3,4,4 NNPolicyĝp#Ł!Øì7Ñô6ç¼!č E@ Î Č ÓÙ76 I 48
ħ6Ř#¬#;
• ď ĪÍ",!36©ęħ
HalfCheetah#Ŝ
• õ²©ę"ĢûĜ#6"
!Ģû4
6^W]"Řś°
İ
/
ş
– þy:{!6"7$Âĥ
– 0sensor{şş
• //õ
3²"MLP
r
• Towards Generalization and Simplicity in Continuous Control
–
–
–
–
Policy"parameterize:ĕ½1RBFĢû
Natural Gradient:râÝ<]B\FY:rý
Neural Netńğ Ēéśhumanoid#
Ŝ
~Ġě!"mujoco"Todorov;Natural Gradient"Kakade;
6 49
Towards Generalization and Simplicity in Continuous Control 50
õ²»©ę!ŘØĸ6ĎijÜı;,ËÉ
• õ²»©ę"ûÄŖxÄ"Ř#ëõ
• sensor{#ø!Ĥő ü¡:ŎDeepLearning!36ĵ
û"ÅÆ:
• -8äņ:Œ!6ŚşśĥćĊŜ
• č Ř:Ĥő!Õ+6"#
Ģ
– vý¼Ř:¸úħó6%5:6 sparse reward
4//ħ Âĥ" Ř
– ʼnōÛ½ŌÖă!!IE@:«Ę6"#Ņ5 ;0 "
• IL, IRL??
–
ʼnōÛ"EAb] ©ę!¦
¾ŕ:)á"3! ʼnōÛ:ó
.6'
" //"3 o"/" " normalize
"ş
51