Metadata-Centric Cybersecurity Classification: A Fair Benchmark of LLMs and Classical Models
| dc.contributor.advisor | Chaudhry, Umair Bilal | |
| dc.contributor.author | Binothman, Elyas | |
| dc.date.accessioned | 2025-11-18T15:39:11Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Cybersecurity breach classification supports triage and risk response but is hindered by heterogeneous reporting, class imbalance, and limited semantic coverage in traditional pipelines. Prior work has relied on rule-based heuristics and classical models (SVM, Random Forest) with heavy feature engineering, while recent LLM studies rarely evaluate breach metadata under identical, fair splits; severity labels are often absent or not reproducibly constructed. We present a metadata-centric benchmark on the Privacy Rights Clearinghouse chronology spanning two tasks: breach-type classification and severity tiering in three and five labels, with severity derived reproducibly from native fields using a Breach Level Index style mapping. All models share one preprocessing recipe and a single stratified 80/20 train–test split. We compare parameter-efficient transformers (DistilBERT and T5 with LoRA) against tuned tabular baselines (Linear SVM, Random Forest, compact ANN). On breach type, DistilBERT achieves the strongest results (Accuracy 0.943; Macro– F1 0.840), surpassing tabular baselines. For severity, a classweighted ANN on TF–IDF and categorical features attains the highest Macro–F1 at both granularities, while T5 shows high accuracy but low Macro–F1, indicating majority-class bias. The study contributes a unified PRC schema with transparent severity construction, a fair head-to-head comparison under identical conditions, and an efficiency-oriented training recipe suitable for modest hardware. | |
| dc.format.extent | 12 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14154/77035 | |
| dc.language.iso | en | |
| dc.publisher | Saudi Digital Library | |
| dc.subject | cybersecurity | |
| dc.subject | data breaches | |
| dc.subject | metadata | |
| dc.subject | breach type classification | |
| dc.subject | severity classification | |
| dc.subject | Privacy Rights Clearinghouse | |
| dc.subject | parameter-efficient fine-tuning | |
| dc.subject | LoRA | |
| dc.subject | multi-model benchmarking | |
| dc.subject | Artificial Intelligence | |
| dc.subject | LLM | |
| dc.subject | fine-tuning | |
| dc.subject | Neural Networks | |
| dc.title | Metadata-Centric Cybersecurity Classification: A Fair Benchmark of LLMs and Classical Models | |
| dc.type | Thesis | |
| sdl.degree.department | School of Electronic Engineering and Computer Science - Department of Computer Science | |
| sdl.degree.discipline | Cybersecurity | |
| sdl.degree.grantor | Queen Mary University of London | |
| sdl.degree.name | Master of Science with Distinction in Artificial Intelligence |
