Sampling Strategies for Tackling Imbalanced Data in Human Activity Recognition

dc.contributor.advisorJamie A Ward
dc.contributor.authorFAYEZ SULAIMAN MOHIA ALHARBI
dc.date2021
dc.date.accessioned2022-05-26T20:36:42Z
dc.date.available2022-05-26T20:36:42Z
dc.degree.departmentComputer Science
dc.degree.grantorGoldsmiths, University of London
dc.description.abstractHuman activity recognition (HAR) using wearable sensors is a topic that is being actively researched in machine learning. Smart, sensor embedded devices, such as smartphones, fitness trackers, or smartwatches that collect detailed data on movement, are widely available now. HAR may be applied in areas such as healthcare, physiotherapy, and fitness to assist users of these smart devices in their daily lives. However, one of the main challenges facing HAR, particularly when it is used in supervised learning, is how balanced data may be obtained for algorithm optimisation and testing. Because users engage in some activities more than others, e.g. walking more than running, HAR datasets are typically imbalanced. The lack of dataset representation from minority classes, therefore, hinders the ability of HAR classifiers to sufficiently capture new instances of those activities. Inspired by the concept of data fusion, this thesis will introduce three new hybrid sampling methods. Thus, the diversity of the synthesised samples will be enhanced by combining output from separate sampling methods into three hybrid approaches. The advantage of the hybrid method is that it provides diverse synthetic data that can increase the size of the training data from different sampling approaches. This leads to improvements in the generalisation of a learning activity recognition model. The first strategy, known as the distance-based method (DBM), combines synthetic minority oversampling techniques (SMOTE) with Random SMOTE, both of which are built around the k-nearest neighbours’ algorithm. The second technique, called the noise detection based method (NDBM), combines Tomek links (SMOTE Tomeklinks) and the modified synthetic minority oversampling technique (MSMOTE). The third approach, titled the cluster-based method (CBM), combines cluster-based synthetic oversampling (CBSO) and the proximity weighted synthetic oversampling technique (ProWSyn). The performance of the proposed hybrid methods is compared with existing methods using accelerometer data from three commonly used benchmark datasets. The results show that the DBM, NDBM and CBM can significantly reduce the impact of class imbalance and enhance F1 scores of the multilayer perceptron (MLP) by as much as 9 % to 20 % compared with their constituent sampling methods. Also, the Friedman statistical significance test was conducted to compare the effect of the different sampling methods. The test results confirm that the CBM is more effective than the other sampling approaches. This thesis also introduces a method based on the Wasserstein generative adversarial network (WGAN) for generating different types of data on human activity. The WGAN is more stable to train than a generative adversarial network (GAN) and this is due to the use of a stable metric, namely Wasserstein distance, to compare the similarity between the real data distribution with the generated data distribution. WGAN is a deep learning approach, and in contrast to the six existing sampling methods referred to previously, it can operate on raw sensor data as convolutional and recurrent layers can act as feature extractors. WGAN is used to generate raw sensor data to overcome the limitations of the traditional machine learning-based sampling methods that can only operate on extracted features. The synthetic data that is produced by WGAN is then used to oversample the imbalanced training data. This thesis demonstrates that this approach significantly enhances the learning ability of the convolutional neural network (CNN) by as much as 5 % to 6 % from imbalanced human activity datasets. This thesis concludes that the proposed sampling methods based on traditional machine learning are efficient when human activity training data is imbalanced and small. These methods are less complex to implement, require less human activity trainin
dc.identifier.urihttps://drepo.sdl.edu.sa/handle/20.500.14154/33578
dc.language.isoen
dc.titleSampling Strategies for Tackling Imbalanced Data in Human Activity Recognition
sdl.thesis.levelDoctoral
sdl.thesis.sourceSACM - United Kingdom
Files
Collections