11. Q-learning [Watkins and Dayan 1992]
▶ DQN 元
▶ 一定 条件下 Q∗
収束
Input: γ, α
1: Initialize Q(s, a) arbitrarily
2: loop
3: Initialize s
4: while s is not terminal do
5: Choose a from s using policy derived from Q
6: Execute a, observe reward r and next state s′
7: Q(s, a) ← Q(s, a) + α[r + γ maxa′ Q(s′
, a′
) − Q(s, a)]
8: s ← s′
9: end while
10: end loop
50. 参考文献 I
[1] Farnaz Abtahi and Ian Fasel. “Deep belief nets as function approximators for
reinforcement learning”. In: AAAI 2011 Lifelong Learning Workshop (2011),
pp. 183–219.
[2] Bram Bakker. “Reinforcement Learning with Long Short-Term Memory”. In:
NIPS 2001. 2001.
[3] Marc C. Bellemare et al. “The arcade learning environment: An evaluation
platform for general agents”. In: Journal of Artificial Intelligence Research 47
(2013), pp. 253–279.
[4] Xiaoxiao Guo et al. “Deep learning for real-time Atari game play using offline
Monte-Carlo tree search planning”. In: Advances in Neural Information
Processing Systems (NIPS) 2600 (2014), pp. 1–9.
[5] Sergey Levine and Vladlen Koltun. “Guided Policy Search”. In: ICML 2013.
Vol. 28. 2013, pp. 1–9.
[6] Volodymyr Mnih et al. “Human-level control through deep reinforcement
learning”. In: Nature 518.7540 (2015), pp. 529–533.
[7] Volodymyr Mnih et al. “Playing Atari with Deep Reinforcement Learning”.
In: NIPS 2014 Deep Learning Workshop. 2013, pp. 1–9. arXiv:
arXiv:1312.5602v1.
51. 参考文献 II
[8] Arun Nair et al. “Massively Parallel Methods for Deep Reinforcement
Learning”. In: ICML Deep Learning Workshop 2015. 2015.
[9] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language
Understanding for Text-based Games Using Deep Reinforcement Learning.
2015. arXiv: arXiv:1506.08941v1.
[10] John Schulman et al. High-Dimensional Continuous Control Using
Generalized Advantage Estimation. 2015. arXiv: arXiv:1506.02438v1.
[11] John Schulman et al. “Trust Region Policy Optimization”. In: ICML 2015.
2015. arXiv: arXiv:1502.05477v1.
[12] David Silver. Deep Reinforcement Learning. ICLR 2015 Keynote.
http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-
iclr2015.pdf. 2015.
[13] David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML
2014. 2014, pp. 387–395.
[14] Nathan Sprague. “Parameter Selection for the Deep Q-Learning Algorithm”.
In: RLDM 2015. 2015.
[15] Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing Exploration
In Reinforcement Learning With Deep Predictive Models. 2015. arXiv:
1507.00814v2.
52. 参考文献 III
[16] Gerald Tesauro. “TD-Gammon, A Self-Teaching Backgammon Program,
Achieves Master-Level Play”. In: Neural Computation 6(2) (1994),
pp. 215–219.
[17] Christopher JCH Watkins and Peter Dayan. “Q-learning”. In: Machine
learning 8.3-4 (1992), pp. 279–292.