Reinforcement Learning: Comparing Exploration Strategies

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Reinforcement learning (RL) is an area of machine learning in which an agent interacts with an environment to learn a task and maximise a cumulative reward through trial and error using an RL algorithm. The trade-off between exploitation and exploration is a common challenge in RL. Exploitation refers to a situation wherein an agent behaves greedily by selecting well-discovered actions that have high values. Conversely, exploration involves the agent selection of less-known actions or ones that have never been chosen at this point of the learning process to gain more knowledge. The problem encountered in pure exploitation is that an agent may be trapped in local optima; meanwhile, persistent exploration diminishes the learning performance of RL algorithms. This research presented three new actionselection methods that are a fusion of a popular optimisation algorithm and existing exploration strategies. These methods are LBSA-states-counts, which is based on the LBSA optimisation algorithm and count-based exploration; Tabu decaying ε-greedy, which is grounded in tabu search and decaying ε-greedy; and UCB-1+, which is anchored in the UCB-1 action-selection method. A wide variety of current action-selection methods were also implemented and used in comparing the proposed algorithms with existing approaches, including ε-greedy, decaying ε-greedy, softmax, pursuit, UCB-1, SA and ε-UCB. All the previously proposed strategies were implemented in conjunction with the Sarsa(λ) algorithm to investigate the effectiveness of the proposed methods and ascertain which strategy provided the best performance and the largest cumulative reward, as well as which one was the least sensitive or most consistent over different parameter settings. The comparison was carried out in different environmental settings. Maze dimensions of 10×10 and 25×25 were adopted to extract comprehensive insights on both smalland large-scale problems. A test was also performed on the windy gridworld test problem, which is characterised by an action space larger than that of the maze. The stochastic version of this environment was also used to verify performance in both stationary and non-stationary environments. The results showed that although ε-greedy is the most frequently used selection method, its performance was one of the worst, if not the worst, in all the different test problems. However, adding a decaying factor to its exploration parameter enhanced its performance, which was even more improved upon the incorporation of a tabu feature. The softmax and pursuit strategies both performed well, however, tuning the parameter of softmax was difficult. Amongst the three UCB-based approaches, UCB-1 and UCB1+ obtained similar results in regard to average cumulative reward, with UCB-1 generating a somewhat higher value in the stochastic windy domain. Nonetheless, UCB-1+ was slightly faster in the two stationary domains. The ε-UCB method exhibited the worst performance amongst the UCB approaches. Finally, LBSA-states-counts showed remarkable performance, and was the strategy that maintained a consistent learning performance that was also robust to different parameter settings.