An Exploration of Word Embedding Models for Phishing Email Detection

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
University of Southampton
Phishing emails are dangerous cyberattacks that attackers use to steal information. Manual solutions such as blacklists can be used to detect phishing emails. However, The emergence of machine learning solutions has made phishing email detection faster and easier. This study explored and compared the performance of three deep learning models for detecting text-based phishing emails. The models used different word embedding techniques: Word2Vec, FastText, and GloVe. All three models used a Long Short-Term Memory (LSTM) classifier. Two publicly available datasets were merged to create a balanced dataset of phishing and legitimate emails using only the body text of the emails, excluding the header. The first dataset is the Fraudulent E-mail Corpus - Nigerian Letter or ”419” Fraud, which contains phishing emails. The second dataset is the Enron Email Dataset, which contains legitimate emails. The Word2Vec- LSTM model achieved the best performance, with an F1 score of 98.62% and an accuracy of 98.62%. The FastText-LSTM also performed well, but its performance was slightly lower than the Word2Vec-LSTM model, with an F1 score of 95.73% and an accuracy of 95.73%. The GloVe-LSTM model performed poorly, with an F1 score of 55.79% and an accuracy of 60.53%. We therefore conclude that using different embedding techniques with the same classifier can result in different performances for detecting and classifying phishing and legitimate emails.
data science, machine learning, AI, phishing emails, deep learning