Investigating the Detection of Stored Scripting Attacks Using Machine Learning
Abstract
Web applications now play an essential role in our daily lives; through them we can
make bank transfers, purchase products and/or make bookings on the Internet. This
makes them a target for attackers who will attempt to exploit security vulnerabilities
in web applications in order to obtain access to sensitive user information or
gain unauthorized privileges. One of the most common attacks aimed at stealing
user information is Cross-Site Scripting; this is ranked among the top 10 security
vulnerabilities in web applications. Traditional defense systems rely on a signature
database describing known attacks; however, XSS attacks written in JavaScript are
very variable; they do not exist only in a single form. The most common cause of
XSS security vulnerabilities is weakness of verification of the user’s input. This
provides the motivation for finding a method for identifying malicious code, written
in JavaScript, that an attacker attempts to have executed on the server.
Machine learning has contributed to the security of web applications. Several
studies have been conducted in relation to Intrusion Detecting Systems (IDS) which
detect and prevent attacks against web applications. Cross-Site Scripting is one of
the attacks that has been studied employing a number of methods: for example,
using features to identify obfuscated scripts or using JavaScript keywords, evaluating
machine learning algorithms in term of detecting attacks against web applications
such as random forest, and SVM. These studies have achieved highly accurate results
by using machine learning to detect XSS attacks. They often attained better results
than dynamic and static analysis in terms of acting as a protection layer for web
applications.
This present study will demonstrate the use of machine learning methods, incorporated
into a web application at the user input validation stage - prior to the
request being passed to the application server. Classifiers will be used to prevent
persistent or stored XSS attacks, which are caused by malicious code injections
via an input point in the web application. This study relies on supervised machine
learning and the application of Boolean feature sets, in order to achieve ease and
speed of classification. Furthermore, this study examined the use of such methods on two other types of injection attacks: SQL-i and LDAP. Cascading classifiers and
ensemble techniques were used to reduce complexity while maintaining accuracy
and speed. To understand how a decision is made in the classifier, an approximate
Boolean function is extracted; this is done based on the techniques which have been
employed to extract rules from black box classifiers.