Reinforcement Learning: Comparing Exploration Strategies
Abstract
Reinforcement learning (RL) is an area of
machine learning in which an agent interacts with an
environment to learn a task and maximise a cumulative reward
through trial and error using an RL algorithm. The trade-off
between exploitation and exploration is a common challenge
in RL. Exploitation refers to a situation wherein an agent
behaves greedily by selecting well-discovered actions that
have high values. Conversely, exploration involves the agent
selection of less-known actions or ones that have never been
chosen at this point of the learning process to gain more
knowledge. The problem encountered in pure exploitation is
that an agent may be trapped in local optima; meanwhile,
persistent exploration diminishes the learning performance of
RL algorithms. This research presented three new actionselection methods that are a fusion of a popular optimisation
algorithm and existing exploration strategies. These methods
are LBSA-states-counts, which is based on the LBSA
optimisation algorithm and count-based exploration; Tabu
decaying ε-greedy, which is grounded in tabu search and
decaying ε-greedy; and UCB-1+, which is anchored in the
UCB-1 action-selection method. A wide variety of current
action-selection methods were also implemented and used in
comparing the proposed algorithms with existing approaches,
including ε-greedy, decaying ε-greedy, softmax, pursuit,
UCB-1, SA and ε-UCB. All the previously proposed strategies
were implemented in conjunction with the Sarsa(λ) algorithm
to investigate the effectiveness of the proposed methods and
ascertain which strategy provided the best performance and
the largest cumulative reward, as well as which one was the
least sensitive or most consistent over different parameter
settings. The comparison was carried out in different
environmental settings. Maze dimensions of 10×10 and 25×25
were adopted to extract comprehensive insights on both smalland large-scale problems. A test was also performed on the
windy gridworld test problem, which is characterised by an
action space larger than that of the maze. The stochastic
version of this environment was also used to verify
performance in both stationary and non-stationary
environments. The results showed that although ε-greedy is
the most frequently used selection method, its performance
was one of the worst, if not the worst, in all the different test
problems. However, adding a decaying factor to its
exploration parameter enhanced its performance, which was
even more improved upon the incorporation of a tabu feature.
The softmax and pursuit strategies both performed well,
however, tuning the parameter of softmax was difficult.
Amongst the three UCB-based approaches, UCB-1 and UCB1+ obtained similar results in regard to average cumulative
reward, with UCB-1 generating a somewhat higher value in
the stochastic windy domain. Nonetheless, UCB-1+ was
slightly faster in the two stationary domains. The ε-UCB
method exhibited the worst performance amongst the UCB
approaches. Finally, LBSA-states-counts showed remarkable
performance, and was the strategy that maintained a consistent
learning performance that was also robust to different
parameter settings.