Applications of Hyper-parameter Optimisations for Static Malware Detection

dc.contributor.advisorClark, John
dc.contributor.authorALgorain, Fahad
dc.date.accessioned2023-06-01T07:30:20Z
dc.date.available2023-06-01T07:30:20Z
dc.date.issued2023-05-30
dc.descriptionThesis
dc.description.abstractMalware detection is a major security concern and a great deal of academic and commercial research and development is directed at it. Machine Learning is a natural technology to harness for malware detection and many researchers have investigated its use. However, drawing comparisons between different techniques is a fraught affair. For example, the performance of ML algorithms often depends significantly on parametric choices, so the question arises as to what parameter choices are optimal. In this thesis, we investigate the use of a variety of ML algorithms for building malware classifiers and also how best to tune the parameters of those algorithms – a process generally known as hyper-parameter optimisation (HPO). Firstly, we examine the effects of some simple (model-free) ways of parameter tuning together with a state-of-the-art Bayesian model-building approach. We demonstrate that optimal parameter choices may differ significantly from default choices and argue that hyper-parameter optimisation should be adopted as a ‘formal outer loop’ in the research and development of malware detection systems. Secondly, we investigate the use of covering arrays (combinatorial testing) as a way to combat the curse of dimensionality in Gird Search. Four ML techniques were used: Random Forests, xgboost, Light GBM and Decision Trees. cAgen (a tool that is used for combinatorial testing) is shown to be capable of generating high-performing subsets of the full parameter grid of Grid Search and so provides a rigorous but highly efficient means of performing HPO. This may be regarded as a ‘design of experiments’ approach. Thirdly, Evolutionary algorithms (EAs) were used to enhance machine learning classifier accuracy. Six traditional machine learning techniques baseline accuracy is recorded. Two evolutionary algorithm frameworks Tree-Based Pipeline Optimization Tool (TPOT) and Distributed Evolutionary Algorithms in Python (Deap) are compared. Deap shows very promising results for our malware detection problem. Fourthly, we compare the use of Grid Search and covering arrays for tuning the hyper-parameters of Neural Networks. Several major hyper-parameters were studied with various values and results. We achieve significant improvements over the benchmark model. Our work is carried out using EMBER, a major published malware benchmark dataset of Windows Portable Execution (PE) metadata samples, and a smaller dataset from kaggle.com (also comprising of Windows Portable Execution metadata). Overall, we conclude that HPO is an essential part of credible evaluations of ML-based malware detection models. We also demonstrate that high-performing hyper-parameter values can be found by HPO and that these can be found efficiently.
dc.format.extent111
dc.identifier.urihttps://hdl.handle.net/20.500.14154/68244
dc.language.isoen
dc.publisherSaudi Digital Library
dc.subjecthyperparameter optimisation
dc.subjectstatic malware detection
dc.subjectneural network
dc.subjectdeep learning
dc.subjectgrid search
dc.subjectcAgen
dc.subjectcombinatorial testing
dc.subjectcovering arrays
dc.subjectautomated machine learning
dc.subjecttree parzen estimators
dc.subjectbayesian optimisation
dc.subjectrandom search
dc.subjectmachine learning
dc.titleApplications of Hyper-parameter Optimisations for Static Malware Detection
dc.typeThesis
sdl.degree.departmentEngineering
sdl.degree.disciplineComputer Science
sdl.degree.grantorUniversity of Sheffield
sdl.degree.nameDoctor of Philosophy

Files

Copyright owned by the Saudi Digital Library (SDL) © 2024