AN UNSUPERVISED FRAMEWORK FOR ANALYSING HETEROGENEOUS LOG-FILES TO IDENTIFY MULTI-STAGE ATTACKS
AHMED ABDULRAHMAN ALGHAMDI
Cyberattacks have become increasingly advanced and prevalent on a global scale. One of the most detrimental types of cyberattacks is the multi-stage attack, often referred to as an Advanced Persistent Threat (APT), which combines espionage and sabotage, often over long time periods. Detection of these attacks is extremely challenging due to their deceptive approaches. The sequential events of these attacks might appear benign when performed individually or from different sources. Furthermore, existing tools often restrict their attention to single sources or rely on known patterns of behaviour. Thus, there is a need for approaches that employ empirical behaviour analysis to overcome the lack of existing tools and enhance existing multi-layered defence strategies. This research develops a novel framework to identify patterns and correlations be- tween malicious behaviours of multi-stage attacks such as APT. This framework applies unsupervised learning to heterogeneous logs and is therefore called the Unsupervised Analysis for Heterogeneous Log-files (UAHL) framework. This framework investigates multi-origin heterogeneous log files, using machine learning, in three main phases to extract inner-behaviours of log files and construct patterns of those behaviours over the analysed files. Finally, an Action Centre is developed to present sequential behaviours of attacks, utilising a custom visualisation method. In addition, the Action Centre allows administrators to browse and filter attack profiles along with the ability to show similarity rates between those profiles in terms of their contained behaviours. The framework utilises a dynamic method to eliminate the need for manually pre- defining the clustering parameters, requiring a high field knowledge and significantly affecting results. Moreover, to evaluate the framework, we have produced a (publicly available) labelled version of the SotM43 dataset, as well as using another dataset for the evaluation. Our results demonstrate that the framework can (i) efficiently cluster inner-behaviours of security-related logs with high accuracy, (ii) extract patterns of malicious behaviour and correlation between those patterns from real-world data, and (iii) present results in a meaningful format, along with effectively measuring similarities between the attack profiles.