Metadata-Centric Cybersecurity Classification: A Fair Benchmark of LLMs and Classical Models

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

Cybersecurity breach classification supports triage and risk response but is hindered by heterogeneous reporting, class imbalance, and limited semantic coverage in traditional pipelines. Prior work has relied on rule-based heuristics and classical models (SVM, Random Forest) with heavy feature engineering, while recent LLM studies rarely evaluate breach metadata under identical, fair splits; severity labels are often absent or not reproducibly constructed. We present a metadata-centric benchmark on the Privacy Rights Clearinghouse chronology spanning two tasks: breach-type classification and severity tiering in three and five labels, with severity derived reproducibly from native fields using a Breach Level Index style mapping. All models share one preprocessing recipe and a single stratified 80/20 train–test split. We compare parameter-efficient transformers (DistilBERT and T5 with LoRA) against tuned tabular baselines (Linear SVM, Random Forest, compact ANN). On breach type, DistilBERT achieves the strongest results (Accuracy 0.943; Macro– F1 0.840), surpassing tabular baselines. For severity, a classweighted ANN on TF–IDF and categorical features attains the highest Macro–F1 at both granularities, while T5 shows high accuracy but low Macro–F1, indicating majority-class bias. The study contributes a unified PRC schema with transparent severity construction, a fair head-to-head comparison under identical conditions, and an efficiency-oriented training recipe suitable for modest hardware.

Description

Keywords

cybersecurity, data breaches, metadata, breach type classification, severity classification, Privacy Rights Clearinghouse, parameter-efficient fine-tuning, LoRA, multi-model benchmarking, Artificial Intelligence, LLM, fine-tuning, Neural Networks

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2026