An Exploration of Word Embedding Models for Phishing Email Detection
Date
2023-09-21
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Southampton
Abstract
Phishing emails are dangerous cyberattacks that attackers use to steal information. Manual
solutions such as blacklists can be used to detect phishing emails. However, The emergence
of machine learning solutions has made phishing email detection faster and easier. This study
explored and compared the performance of three deep learning models for detecting text-based
phishing emails. The models used different word embedding techniques: Word2Vec, FastText,
and GloVe. All three models used a Long Short-Term Memory (LSTM) classifier. Two publicly
available datasets were merged to create a balanced dataset of phishing and legitimate emails
using only the body text of the emails, excluding the header. The first dataset is the Fraudulent
E-mail Corpus - Nigerian Letter or ”419” Fraud, which contains phishing emails. The
second dataset is the Enron Email Dataset, which contains legitimate emails. The Word2Vec-
LSTM model achieved the best performance, with an F1 score of 98.62% and an accuracy of
98.62%. The FastText-LSTM also performed well, but its performance was slightly lower than
the Word2Vec-LSTM model, with an F1 score of 95.73% and an accuracy of 95.73%. The
GloVe-LSTM model performed poorly, with an F1 score of 55.79% and an accuracy of 60.53%.
We therefore conclude that using different embedding techniques with the same classifier can
result in different performances for detecting and classifying phishing and legitimate emails.
Description
Keywords
data science, machine learning, AI, phishing emails, deep learning