Online Least-Squares Policy Iteration for Reinforcement Learning
Control
Reference
L. Buşoniu,
D. Ernst,
B. De Schutter, and
R. Babuška,
"Online Least-Squares Policy Iteration for Reinforcement Learning
Control," Proceedings of the 2010 American Control
Conference, Baltimore, Maryland, pp. 486-491, June-July 2010.
Abstract
Reinforcement learning is a promising paradigm for learning optimal
control. We consider policy iteration (PI) algorithms for
reinforcement learning, which iteratively evaluate and improve control
policies. State-of-the-art, least-squares techniques for policy
evaluation are sample-efficient and have relaxed convergence
requirements. However, they are typically used in offline PI, whereas
a central goal of reinforcement learning is to develop online algorithms. Therefore, we propose an online
PI algorithm that evaluates policies with the so-called least-squares
temporal difference for Q-functions (LSTD-Q). The crucial difference
between this online least-squares policy
iteration (LSPI) algorithm and its offline counterpart is that,
in the online case, policy improvements must be performed once every
few state transitions, using only an incomplete evaluation of the
current policy. In an extensive experimental evaluation, online LSPI
is found to work well for a wide range of its parameters, and to learn
successfully in a real-time example. Online LSPI also compares
favorably with offline LSPI and with a different flavor of online PI,
which instead of LSTD-Q employs another least-squares method for
policy evaluation.
Downloads
- Corresponding technical report:
pdf
file
(454 KB)
Bibtex entry
@inproceedings{BusErn:10-009,
author={L. Bu{\c{s}}oniu and D. Ernst and B. {D}e Schutter and R.
Babu{\v{s}}ka},
title={Online Least-Squares Policy Iteration for Reinforcement Learning
Control},
booktitle={Proceedings of the 2010 American Control Conference},
address={Baltimore, Maryland},
pages={486--491},
month=jun # {--} # jul,
year={2010}
}
This page is maintained by Bart De Schutter.
Last update: February 21, 2026.