Classifying Imbalanced Data for DDoS Attack Detection

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Saudi Digital Library
In the first quarter of 2021, researchers witnessed over 2.8 million Distributed Denial of Service (DDoS) attacks —a 32% increase from the same period in 2020, as reported by Info-Security magazine on May 18, 2021. The magazine also noted that the number of attacks against educational institutions has increased by 41% over the past three quarters. DDoS has become a serious issue for many organizations and individuals. The evolution of networks has ushered in a level of complexity that is the enemy of security. Currently, attacks are more prevalent and at the same time more noticeable due to the variety of features that exist on networks, a consequence of the constant escalation between attackers and defenders. Machine learning algorithms (MLAs) have become a tool to help thicken the layers of defense. To be effective, MLAs must be trained in ways that provide high confidence for detection and prevention, which boils down to precision and accuracy (i.e., low false positives and/or high true positives). This work has developed a setup for establishing a measured intrusion detection system (IDS) that can help to better understand and identify the various unique features of a network to better prevent DoS and DDoS attacks from being successful. The goal is to develop models that can predict (i.e., classify) with high precision and accurately identify different types of DoS/DDoS attacks with low false positive/negative rates. In addition to dealing with the multiclass classification and extremely imbalanced problems, the derived model leverages two feature selection techniques to reduce the number of features in the dataset and help improve the model's execution time, thereby reducing the IDS complexity. A combination of under-sampling combined with adjusting weights was applied to handle the imbalance problem. The extracted data was evaluated using supervised MLAs, including Random Forest, Decision tree, Naive Bayes, Logistic regression, and ensemble methods. Ensemble methods using supervised outcomes aim to improve the overall performance of the classification. The experiments utilized the popular benchmark NSL-KDD and CICIDS2017 datasets. Random Forest achieved the best performance results, decreasing by 37% the training and testing time. In addition to solving the imbalance problem caused by feature selection, it increased accuracy 6.25% and FPR 21%. The random forest model has achieved 99% accuracy and 0.0001 for the False-Positive rate. Furthermore, using this setup, we can detect minor classes with more than 80% accuracy.