AN UNSUPERVISED FRAMEWORK FOR ANALYSING HETEROGENEOUS LOG-FILES TO IDENTIFY MULTI-STAGE ATTACKS
Abstract
Cyberattacks have become increasingly advanced and prevalent on a global scale. One
of the most detrimental types of cyberattacks is the multi-stage attack, often referred
to as an Advanced Persistent Threat (APT), which combines espionage and sabotage,
often over long time periods. Detection of these attacks is extremely challenging due to
their deceptive approaches. The sequential events of these attacks might appear benign
when performed individually or from different sources. Furthermore, existing tools often
restrict their attention to single sources or rely on known patterns of behaviour. Thus,
there is a need for approaches that employ empirical behaviour analysis to overcome
the lack of existing tools and enhance existing multi-layered defence strategies.
This research develops a novel framework to identify patterns and correlations be-
tween malicious behaviours of multi-stage attacks such as APT. This framework applies
unsupervised learning to heterogeneous logs and is therefore called the Unsupervised
Analysis for Heterogeneous Log-files (UAHL) framework. This framework investigates
multi-origin heterogeneous log files, using machine learning, in three main phases to
extract inner-behaviours of log files and construct patterns of those behaviours over the
analysed files. Finally, an Action Centre is developed to present sequential behaviours
of attacks, utilising a custom visualisation method. In addition, the Action Centre
allows administrators to browse and filter attack profiles along with the ability to show
similarity rates between those profiles in terms of their contained behaviours.
The framework utilises a dynamic method to eliminate the need for manually pre-
defining the clustering parameters, requiring a high field knowledge and significantly
affecting results. Moreover, to evaluate the framework, we have produced a (publicly
available) labelled version of the SotM43 dataset, as well as using another dataset for
the evaluation. Our results demonstrate that the framework can (i) efficiently cluster
inner-behaviours of security-related logs with high accuracy, (ii) extract patterns of
malicious behaviour and correlation between those patterns from real-world data, and
(iii) present results in a meaningful format, along with effectively measuring similarities
between the attack profiles.