Evaluating Static, Contextual, and End-to-End Embedding Techniques for Malware Detection on Dynamic API Call Data

Basfar, Mohammed Raed

Evaluating Static, Contextual, and End-to-End Embedding Techniques for Malware Detection on Dynamic API Call Data

Files

Primary SACM-Dissertation.pdf (1.08 MB)

Date

2026

Authors

Basfar, Mohammed Raed

Publisher

Saudi Digital Library

Abstract

The rate of malware development continues to challenge cybersecurity, with traditional signature- and heuristic-based techniques overwhelmed by polymorphic and zero-day attacks. Natural language processing (NLP) offers a promising direction by modeling dynamic API call sequences as semantic linguistic data, enabling sophisticated embedding and sequence-learning methods to be used for malware detection. This dissertation contrasts and analyzes three typical embedding methods static, contextual, and end-to-end task-learned representations—under a shared experimental framework. Specifically, it employs Word2Vec embeddings with a Convolutional Neural Network (CNN), contextual BERT embeddings with a CNN, and a Bidirectional Long Short-Term Memory (BiLSTM) network with a trainable embedding layer and weighted loss function to address class imbalance. The experiments were conducted on a dynamic API call dataset of around 44,000 malware and 1,000 benign samples, summarized by the first 100 API calls executed under sandboxed conditions. Results indicate that the Word2Vec + CNN pipeline had the highest overall accuracy and malware detection precision but the lowest benign recall. The BERT + CNN model provided more balanced class performance, but at the expense of added computational overhead. The BiLSTM had the highest benign recall, as it was able to easily distinguish from non-malicious activity, but the lowest precision and hugely added resource use. The findings point out the competing trade-offs among detection accuracy, benign recall, and processing efficiency, highlighting the issue of aligning model selection with actual security contexts' resource constraints and priorities. The study contributes by reporting a comparative systematic review of the embedding approaches for malware detection and offering informative insights into performance vs. efficiency trade-offs. Apart from its scientific significance, it proves the larger potential of NLP-based approaches to supporting malware detection systems and to informing the design of responsive, resource-aware cybersecurity systems.

Keywords

Natural Language Processing, (NLP), Artificial Intelligence, AI, Machine Learning, Embeddings

URI

https://hdl.handle.net/20.500.14154/78355

Collections

SACM - United Kingdom

Full item page

Evaluating Static, Contextual, and End-to-End Embedding Techniques for Malware Detection on Dynamic API Call Data

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By