Reference:
L. Busoniu,
B. De Schutter,
R. Babuska, and
D. Ernst,
"Exploiting policy knowledge in online least-squares policy iteration:
An empirical study," Automation, Computers, Applied
Mathematics, vol. 19, no. 4, pp. 521-529, 2010.
Abstract:
Reinforcement learning (RL) is a promising paradigm for learning
optimal control. Traditional RL works for discrete variables only, so
to deal with the continuous variables appearing in control problems,
approximate representations of the solution are necessary. The field
of approximate RL has tremendously expanded over the last decade, and
a wide array of effective algorithms is now available. However, RL is
generally envisioned as working without any prior knowledge about the
system or the solution, whereas such knowledge is often available and
can be exploited to great advantage. Therefore, in this paper we
describe a method that exploits prior knowledge to accelerate online
least-squares policy iteration (LSPI), a state-of-the-art algorithm
for approximate RL. We focus on prior knowledge about the monotonicity
of the control policy with respect to the system states. Such
monotonic policies are appropriate for important classes of systems
appearing in control applications, including for instance nearly
linear systems and linear systems with monotonic input nonlinearities.
In an empirical evaluation, online LSPI with prior knowledge is shown
to learn much faster and more reliably than the original online LSPI.