Reference:
L. Busoniu,
A. Lazaric,
M. Ghavamzadeh,
R. Munos,
R. Babuska, and
B. De Schutter,
"Least-squares methods for policy iteration," in Reinforcement
Learning: State-Of-The-Art (M. Wiering and M. van Otterlo, eds.),
vol. 12 of Adaptation, Learning, and Optimization,
Heidelberg, Germany: Springer, ISBN 978-3-642-27644-6, pp. 75-109,
2012.
Abstract:
Approximate reinforcement learning deals with the essential problem of
applying reinforcement learning in large and continuous state-action
spaces, by using function approximators to represent the solution.
This chapter reviews least-squares methods for policy iteration, an
important class of algorithms for approximate reinforcement learning.
We discuss three techniques for solving the core, policy evaluation
component of policy iteration, called: least-squares temporal
difference, least-squares policy evaluation, and Bellman residual
minimization. We introduce these techniques starting from their
general mathematical principles and detailing them down to fully
specified algorithms. We pay attention to online variants of policy
iteration, and provide a numerical example highlighting the behavior
of representative offline and online methods. For the policy
evaluation component as well as for the overall resulting approximate
policy iteration, we provide guarantees on the performance obtained
asymptotically, as the number of samples processed and iterations
executed grows to infinity. We also provide finite-sample results,
which apply when a finite number of samples and iterations are
considered. Finally, we outline several extensions and improvements to
the techniques and methods reviewed.