[DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

>100 Views

March 10, 17

スライド概要

2017/3/10
Deep Learning JP:
http://deeplearning.jp/seminar-2/

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

(ダウンロード不可)

関連スライド

各ページのテキスト
1.

#SJEHJOHUIF(BQ#FUXFFO 7BMVFBOE1PMJDZ#BTFE 3FJOGPSDFNFOU-FBSOJOH ॳ୩ྯ࣊

2.

ϝλ৘ใ w (PPHMF#SBJO w  w 0pS/BDIVN

3.

3-ΞϧΰϦζϜͷ෼ྨ w 3-ΞϧΰϦζϜ͸େ͖࣍͘ͷ̐ͭʹ෼͚ΒΕΔ w 0OQPMJDZPS0⒎QPMJDZ  w 7BMVFCBTFEPS1PMJDZCBTFE

4.

3-BMHPSJUIN`TNBUSJY 7BMVFCBTFE 1PMJDZCBTFE 0OQPMJDZ 0⒎QPMJDZ 4"34" %2/ 3FUSBDF "$ 5310 %%1( "$&3 0⒎1"$

5.

8IBUJT0OQPMJDZ w ‫ࡏݱ‬ͷํࡦͰಘΒΕͨ‫੻ي‬ͷΈΛ࢖ͬͯɹɹɹɹɹ ΤʔδΣϯτΛߋ৽͢Δ w ‫ࡏݱ‬ͷํࡦКʹґଘͨ͠2΍7Λ‫ٻ‬ΊΔࣄ͕ଟ͍

6.

3-BMHPSJUIN`TNBUSJY 7BMVFCBTFE 1PMJDZCBTFE 0OQPMJDZ 0⒎QPMJDZ 4"34" %2/ 3FUSBDF "$ 5310 %%1( "$&3 0⒎1"$

7.

4"34" Q π ํࡦΛКʹ‫ݻ‬ఆͨ͠ͱ͖ͷ2ؔ਺ #FMMNBO&RVBUJPO π π Q (s,a) = r(s,a) + γ Eπ [Q (s ',a')] NJOJNJ[F-GPSВ π θ π θ L = (r(s,a) + γ Q (s ',a') − Q (s,a)) 2

8.

8IBUJT0⒎QPMJDZ w ‫ࡏݱ‬ͷํࡦͱ͸ҧ͏ํࡦ͔ΒಘΒΕͨ‫੻ي‬΋࢖ͬͯ ΤʔδΣϯτΛߋ৽͢Δ w ࠷దํࡦʹର͢Δ2΍7Λ௚઀‫ٻ‬ΊΔ͜ͱ͕ଟ͍

9.

3-BMHPSJUIN`TNBUSJY 7BMVFCBTFE 1PMJDZCBTFE 0OQPMJDZ 0⒎QPMJDZ 4"34" %2/ 3FUSBDF "$ 5310 %%1( "$&3 0⒎1"$

10.

2MFBSOJOH %2/ o Q ࠷దํࡦʹର͢Δ2ؔ਺ #FMMNBO&RVBUJPO Q (s,a) = r(s,a) + γ max Q (s ',a') o o a' NJOJNJ[F-GPSВ L = (r(s,a) + γ max Q (s ',a') − Q (s,a)) a' o θ o θ 2

11.

0OQPMJDZWT0⒎QPMJDZ w w 0OQPMJDZNFUIPE ๏ ߋ৽ࣜʹϚϧνεςοϓΛ͙͢ద༻Ͱ͖Δ º ֶशͷͨΊͷ‫੻ي‬Λ౎౓αϯϓϧ͠ͳ͍ͱ͍͚ͳ͍ 0⒎QPMJDZNFUIPE ๏ ࠓ·Ͱಘͨ‫੻ي‬Λશֶͯशʹར༻Ͱ͖Δ º NBYPQFSBUPSͷ͍ͤͰ̍εςοϓ͔͠ߋ৽ʹ࢖͑ͳ͍

12.

0OQPMJDZWT0⒎QPMJDZ w w 0OQPMJDZNFUIPE ๏ ߋ৽ࣜʹϚϧνεςοϓΛ͙͢ద༻Ͱ͖Δ º ֶशͷͨΊͷ‫੻ي‬Λ౎౓αϯϓϧ͠ͳ͍ͱ͍͚ͳ͍ 0⒎QPMJDZNFUIPE ๏ ࠓ·Ͱಘͨ‫੻ي‬Λશֶͯशʹར༻Ͱ͖Δ º NBYPQFSBUPSͷ͍ͤͰ̍εςοϓ͔͠ߋ৽ʹ࢖͑ͳ͍

13.

/TUFQ4"34" #FMMNBO&RVBUJPO π π Q (s,a) = r(s,a) + γ Eπ [Q (s ',a')] OTUFQ෼ͷ‫੻ي‬Λ༻͍Δ n−1 L = (∑ γ r(si ,ai ) + γ Q (sn ,an ) − Q (s0 ,a0 )) i n π θ π θ 2 i=0 TUFQΛେ͖͘͢Δ͜ͱͰUBSHFUਪఆͷCJBTΛ‫ݮ‬Β͢

14.

0OQPMJDZWT0⒎QPMJDZ w w 0OQPMJDZNFUIPE ๏ ߋ৽ࣜʹϚϧνεςοϓΛ͙͢ద༻Ͱ͖Δ º ֶशͷͨΊͷ‫੻ي‬Λ౎౓αϯϓϧ͠ͳ͍ͱ͍͚ͳ͍ 0⒎QPMJDZNFUIPE ๏ ࠓ·Ͱಘͨ‫੻ي‬Λશֶͯशʹར༻Ͱ͖Δ º NBYPQFSBUPSͷ͍ͤͰ̍εςοϓ͔͠ߋ৽ʹ࢖͑ͳ͍

15.

2MFBSOJOH %2/ L = (r(s,a) + γ max Q (s ',a') − Q (s,a)) a' o θ o θ NBYPQFSBUPSͰબ୒͞ΕͨB`ͱ ࣮ࡍͷ‫੻ي‬ͷB`͕ҟͳΔͷͰ NVMUJTUFQʹͰ͖ͳ͍ 2

16.

7BMVFCBTFEͱ1PMJDZCBTFE w w 7BMVFCBTFE ‎ ࠷దͳՁ஋ؔ਺Λ‫ٻ‬ΊΔ ˒ ֶशํ๏Ձ஋ؔ਺ͷpUUJOH 1PMJDZCBTFE ‎ ࠷దͳํࡦΛ௚઀‫ٻ‬ΊΔ ˒ ֶशํ๏ํࡦޯ഑๏

17.

$POUSJCVUJPO w FOUSPQZSFHVMBSJ[FEͳQPMJDZʹؔͯ͠ɹɹɹɹɹ ҰൠԽͨ͠ϕϧϚϯ࠷దํఔࣜΛఏҊ w ͦΕΛ‫ʹݩ‬P⒎QPMJDZͰNVMUJTUFQͳΞϧΰϦζϜ 1$- 1BUI$POTJTUFODZ-FBSOJOH ΛఏҊ w 7BMVFCBTFEͱ1PMJDZCBTFEͳख๏Λ౷Ұతͳɹ ‫Ͱํݟ‬આ໌

18.

&OUSPQZSFHVMBSJ[FEͱ͸ w ํࡦͷ෼෍͕POFIPUʹͳΒͳ͍Α͏ʹํࡦͷ FOUSPQZΛ࠷େԽͤ͞ͳ͕Βֶशͤ͞Δ w "$ͳͲͰ͸ଛࣦؔ਺ͷ̍෦෼Ͱ࢖ΘΕ͍ͯΔ

19.

(FOFSBMCFMMNBOFRVBUJPO ∗ Q FOUSPQZSFHVMBSJ[FEͳ࠷ద2ؔ਺ #FMMNBO&RVBUJPO Q (s,a) = r(s,a) + γτ log ∑ a' exp(Q (s ',a') / τ ) ∗ ∗ w ϕϧϚϯ࠷దํఔࣜ಺ͷNBYPQFSBUPSΛɹɹɹɹ MPHTVNFYQʹ͢Δ͜ͱͰҰൠԽ w НˠͰNBYPQFSBUPSͱҰக

20.

Нˠͷ࣌ τ log ∑ a' exp(Q (s ',a') / τ ) ∗ = τ log(exp(Q (s ,a ) / τ )∑ a' exp((Q (s ',a') − Q (s ,a )) / τ )) ∗ M ∗ M ∗ M M = max Q (s ',a') + τ log( ∑ a' exp((Q (s ',a') − Q (s ,a )) / τ )) ∗ ∗ ∗ M M a' ҎԼ w "QQFOEJYʹ͸FOUSPQZSFHVMBSJ[FEͳํࡦʹɹɹ ͜ͷํఔ͕ࣜ੒Γཱͭ͜ͱ͕ࣔ͞Ε͍ͯΔ

21.

Ϛϧνεςοϓͷલʹ w ͋Δ̍ͭͷεςοϓͰͷКʹؔ͢Δ໨తؔ਺ͷ ࠷େԽͰ‫ࡏݱ‬ͷ࠷దঢ়ଶՁ஋ؔ਺ͷਪఆΛߦ͏ w ཁ͸#FMMNBOํఔࣜͷಋग़Ͱ΍͍ͬͯΔ͜ͱΛɹɹ ໨తؔ਺ͷ࠷େԽͱ͍͏จ຺Ͱߟ͑Δʁ ҎԼFOUSPQZSFHVMBSJ[FEͳ7ͱКʹ͍ͭͯ V (s) = −τ log π (a | s) + r(s,a) + γ V (s ') ∗ ∗ ͕੒ཱ͢Δ͜ͱʹ͍ͭͯઆ໌͠·͢ ৄ͍͠ূ໌͸࿦จͷ"QQFOEJYʹ ∗

22.

࠷దঢ়ଶՁ஋ؔ਺ͷ࣌ {a1 ,...,an } ࣍ͷঢ়ଶ ‫ࡏݱ‬ͷঢ়ଶ s0 ,v0 ໨తؔ਺ OMR (π ) = {v1 ,...,vn } {s1 ,..., sn } n ∑ π (a )(r + γ v ) i o i i i=1 ໨తؔ਺Λ࠷େԽͤ͞Δ࣌К͸POFIPUʹͳΔ v = OMR (π ) = max(ri + γ v ) o 0 o i o i

23.

&OUSPQZSFHVMBSJ[FEͳ࣌ᶃ n ໨తؔ਺ OENT (π ) = ∑ π (ai )(ri + γ v − τ log π (ai )) ∗ i i=1 π (ai ) OENT (π ) = −τ ∑ π (ai )log +τS ∗ exp((ri + γ vi ) / τ ) / S i=1 n ໨తؔ਺ͷ࠷େԽ ˠ,-μΠόʔδΣϯεͷ࠷খԽ π (ai ) = ∗ exp((ri + γ v ) / τ ) ∗ i n ∑ exp((r i' i '=1 +γ v ) /τ ) ∗ i' ࣜม‫ܗ‬

24.

&OUSPQZSFHVMBSJ[FEͳ࣌ᶄ n v = OENT (π ) = τ log ∑ exp((ri + γ v ) / τ ) ∗ 0 ∗ ∗ i i=1 π (ai ) = exp((ri + γ v ) / τ ) ∗ i ∗ n ∑ exp((r i' +γ v ) /τ ) ∗ i' i '=1 v = −τ log π (ai ) + r(si ,ai ) + γ v ∗ 0 ∗ ∗ i

25.

$POTJTUFODZ w ҰൠʹҎԼͷ౳͕ࣜશͯͷ T B S Ͱ੒ཱ͢Δɹɹɹɹɹɹ ূ໌͸"QQFOEJY V (s) = −τ log π (a | s) + r(s,a) + γ V (s ') ∗ ∗ ∗ ‫ؼ‬ೲతʹద༻ −V (s1 ) + γ V (st ) + R(s1:t ) − τ G(s1:t , π ) = 0 ∗ t−1 n−m−1 ∗ ∗ n−m−1 R(sm:n ) = ∑ γ r(sm+i ,am+i ) G(sm:n , π ) = ∑ γ log π (am+i | sm+i )  i i=0 i i=0

26.

1$QBSBNFUFSJ[F Cθ ,φ (s1:t ) = −Vφ (s1 ) + γ Vφ (st ) + R(s1:t ) − τ G(s1:t , π θ ) t−1 w $ͷೋ৐Λଛࣦؔ਺ͱͯ͠༻͍Δ͜ͱͰВͱП΁ͷ ࠷దԽ໰୊ͱͯ͠‫ؼ‬ண͢Δ͜ͱ͕Ͱ͖Δ w QPMJDZCBTFEͱWBMVFCBTFEΛ౷ҰతʹఆࣜԽ Δθ ∝ Cθ ,φ (s1:t )∇θ G(s1:t , π θ ) Δφ ∝ Cθ ,φ (s1:t )(∇φVφ (s1 ) − ∇φγ Vφ (st )) t−1

28.

"$ͱͷൺֱ $POTJTUFODZ Cθ ,φ (s1:t ) = −Vφ (s1 ) + γ Vφ (st ) + R(s1:t ) − τ G(s1:t , π θ ) t−1 "$ߋ৽ࣜ Aθ ,φ (s1:d+1 ) = −Vφ (s1 ) + γ Vφ (sd+1 ) + R(s 1:d+1 ) d T −1 Δθ ∝ Es0:T [∑ Aθ ,φ (si:i+d )∇θ log π θ (ai | si )] i=0 T −1 Δφ ∝ Es0:T [∑ Aθ ,φ (si:i+d )∇φVφ (si )] i=0

29.

%2/ͱͷൺֱ w NBYPQFSBUPSʹΑͬͯ̍εςοϓ͔͠ߋ৽ࣜʹɹ ૊ΈࠐΊͳ͔ͬͨ΋ͷΛϚϧνεςοϓʹ֦ு w ࣮‫ݧ‬తʹ্ख͘ߦ͍ͬͯͨOTUFQ2ʹMPHQSPCΛ ଍͢͜ͱͰཧ࿦తʹ‫ͮ͘ج‬ΞϧΰϦζϜͱ֦ͯ͠ு

30.

࣮‫ݧ‬ w ΞϧΰϦζϜ‫ܥ‬λεΫ w "$ͱ1SJPSJUJ[FE%%2/Ͱൺֱ w શͯͷλεΫͰ%2/ "$ʹಉ౳΋͘͠͸উར

32.

࣮‫ݧ‬ΤΩεύʔτ w ΤΩεύʔτͷ‫੻ي‬Λ3#ʹೖΕֶͯश w w JNQPSUBODFTBNQMJOHΛ࢖͏ख๏ͱҧ͍෼෍͕ Θ͔Βͳͯ͘΋࢖༻Մೳ ΋ͷ͘͢͝ྑ͘ͳͬͨ

34.

·ͱΊ w 0⒎QPMJDZͰϚϧνεςοϓͳֶश͕Մೳ w Ձ஋ؔ਺ۙࣅͱํࡦޯ഑Λ౷Ұͨ͠ଛࣦؔ਺ͷɹɹ ࠷খԽͰఆࣜԽͨ͠ w ‫ط‬ଘख๏ΛPWFSQFSGPSNͨ͠

35.

‫ݸ‬ਓతʹࢥͬͨ͜ͱ w ࣮‫͕ݧ‬τΠλεΫతͳ΋ͷ͔͠΍͍ͬͯͳ͍ͷ͸ͳ ͥʁ ‎ $BSU1PMF΍ͬͯΈͨ w w IUUQTHJUIVCDPNSBSJMVSFMPQDM@LFSBT $POUJOVPVTDPOUSPMλεΫʹ΋ߋ৽ࣜ͸ͦͷ·· ࢖͑ͦ͏Ͱ ূ໌͸೉ͦ͠͏ ࢼͯ͠ΈΔՁ஋͕͋Γ ͦ͏

36.

$BSU1PMF