Metadata-Centric Cybersecurity Classification: A Fair Benchmark of LLMs and Classical Models
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
Cybersecurity breach classification supports triage
and risk response but is hindered by heterogeneous reporting, class
imbalance, and limited semantic coverage in traditional pipelines.
Prior work has relied on rule-based heuristics and classical
models (SVM, Random Forest) with heavy feature engineering,
while recent LLM studies rarely evaluate breach metadata
under identical, fair splits; severity labels are often absent or
not reproducibly constructed. We present a metadata-centric
benchmark on the Privacy Rights Clearinghouse chronology
spanning two tasks: breach-type classification and severity tiering
in three and five labels, with severity derived reproducibly from
native fields using a Breach Level Index style mapping. All models
share one preprocessing recipe and a single stratified 80/20
train–test split. We compare parameter-efficient transformers
(DistilBERT and T5 with LoRA) against tuned tabular baselines
(Linear SVM, Random Forest, compact ANN). On breach type,
DistilBERT achieves the strongest results (Accuracy 0.943; Macro–
F1 0.840), surpassing tabular baselines. For severity, a classweighted ANN on TF–IDF and categorical features attains the
highest Macro–F1 at both granularities, while T5 shows high
accuracy but low Macro–F1, indicating majority-class bias. The
study contributes a unified PRC schema with transparent severity
construction, a fair head-to-head comparison under identical
conditions, and an efficiency-oriented training recipe suitable for
modest hardware.
Description
Keywords
cybersecurity, data breaches, metadata, breach type classification, severity classification, Privacy Rights Clearinghouse, parameter-efficient fine-tuning, LoRA, multi-model benchmarking, Artificial Intelligence, LLM, fine-tuning, Neural Networks
