Applications of Hyper-parameter Optimisations for Static Malware Detection

Thumbnail Image

Date

2023-05-30

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

Malware detection is a major security concern and a great deal of academic and commercial research and development is directed at it. Machine Learning is a natural technology to harness for malware detection and many researchers have investigated its use. However, drawing comparisons between different techniques is a fraught affair. For example, the performance of ML algorithms often depends significantly on parametric choices, so the question arises as to what parameter choices are optimal. In this thesis, we investigate the use of a variety of ML algorithms for building malware classifiers and also how best to tune the parameters of those algorithms – a process generally known as hyper-parameter optimisation (HPO). Firstly, we examine the effects of some simple (model-free) ways of parameter tuning together with a state-of-the-art Bayesian model-building approach. We demonstrate that optimal parameter choices may differ significantly from default choices and argue that hyper-parameter optimisation should be adopted as a ‘formal outer loop’ in the research and development of malware detection systems. Secondly, we investigate the use of covering arrays (combinatorial testing) as a way to combat the curse of dimensionality in Gird Search. Four ML techniques were used: Random Forests, xgboost, Light GBM and Decision Trees. cAgen (a tool that is used for combinatorial testing) is shown to be capable of generating high-performing subsets of the full parameter grid of Grid Search and so provides a rigorous but highly efficient means of performing HPO. This may be regarded as a ‘design of experiments’ approach. Thirdly, Evolutionary algorithms (EAs) were used to enhance machine learning classifier accuracy. Six traditional machine learning techniques baseline accuracy is recorded. Two evolutionary algorithm frameworks Tree-Based Pipeline Optimization Tool (TPOT) and Distributed Evolutionary Algorithms in Python (Deap) are compared. Deap shows very promising results for our malware detection problem. Fourthly, we compare the use of Grid Search and covering arrays for tuning the hyper-parameters of Neural Networks. Several major hyper-parameters were studied with various values and results. We achieve significant improvements over the benchmark model. Our work is carried out using EMBER, a major published malware benchmark dataset of Windows Portable Execution (PE) metadata samples, and a smaller dataset from kaggle.com (also comprising of Windows Portable Execution metadata). Overall, we conclude that HPO is an essential part of credible evaluations of ML-based malware detection models. We also demonstrate that high-performing hyper-parameter values can be found by HPO and that these can be found efficiently.

Description

Thesis

Keywords

hyperparameter optimisation, static malware detection, neural network, deep learning, grid search, cAgen, combinatorial testing, covering arrays, automated machine learning, tree parzen estimators, bayesian optimisation, random search, machine learning

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library © 2024