Metadata-Centric Cybersecurity Classification: A Fair Benchmark of LLMs and Classical Models

dc.contributor.advisorChaudhry, Umair Bilal
dc.contributor.authorBinothman, Elyas
dc.date.accessioned2025-11-18T15:39:11Z
dc.date.issued2025
dc.description.abstractCybersecurity breach classification supports triage and risk response but is hindered by heterogeneous reporting, class imbalance, and limited semantic coverage in traditional pipelines. Prior work has relied on rule-based heuristics and classical models (SVM, Random Forest) with heavy feature engineering, while recent LLM studies rarely evaluate breach metadata under identical, fair splits; severity labels are often absent or not reproducibly constructed. We present a metadata-centric benchmark on the Privacy Rights Clearinghouse chronology spanning two tasks: breach-type classification and severity tiering in three and five labels, with severity derived reproducibly from native fields using a Breach Level Index style mapping. All models share one preprocessing recipe and a single stratified 80/20 train–test split. We compare parameter-efficient transformers (DistilBERT and T5 with LoRA) against tuned tabular baselines (Linear SVM, Random Forest, compact ANN). On breach type, DistilBERT achieves the strongest results (Accuracy 0.943; Macro– F1 0.840), surpassing tabular baselines. For severity, a classweighted ANN on TF–IDF and categorical features attains the highest Macro–F1 at both granularities, while T5 shows high accuracy but low Macro–F1, indicating majority-class bias. The study contributes a unified PRC schema with transparent severity construction, a fair head-to-head comparison under identical conditions, and an efficiency-oriented training recipe suitable for modest hardware.
dc.format.extent12
dc.identifier.urihttps://hdl.handle.net/20.500.14154/77035
dc.language.isoen
dc.publisherSaudi Digital Library
dc.subjectcybersecurity
dc.subjectdata breaches
dc.subjectmetadata
dc.subjectbreach type classification
dc.subjectseverity classification
dc.subjectPrivacy Rights Clearinghouse
dc.subjectparameter-efficient fine-tuning
dc.subjectLoRA
dc.subjectmulti-model benchmarking
dc.subjectArtificial Intelligence
dc.subjectLLM
dc.subjectfine-tuning
dc.subjectNeural Networks
dc.titleMetadata-Centric Cybersecurity Classification: A Fair Benchmark of LLMs and Classical Models
dc.typeThesis
sdl.degree.departmentSchool of Electronic Engineering and Computer Science - Department of Computer Science
sdl.degree.disciplineCybersecurity
sdl.degree.grantorQueen Mary University of London
sdl.degree.nameMaster of Science with Distinction in Artificial Intelligence

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
SACM-Dissertation.pdf
Size:
1.96 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed to upon submission
Description:

Copyright owned by the Saudi Digital Library (SDL) © 2026