LARGE LANGUAGE MODELS FOR TEST CODE COMPREHENSION

Aljohani, Ahmed

LARGE LANGUAGE MODELS FOR TEST CODE COMPREHENSION

dc.contributor.advisor	Hyunsook, Do
dc.contributor.author	Aljohani, Ahmed
dc.date.accessioned	2025-12-21T15:22:22Z
dc.date.issued	2025
dc.description.abstract	Context: Unit testing is critical for software reliability, yet challenges persist in ensuring the quality, comprehensibility, and maintainability of test code. Transformer/LLM-based approaches have improved the readability of generated tests, but they also introduce new risks and omissions. Empirically, transformer-based test generation (e.g., AthenaTest) produces tests with nontrivial smell rates, mirroring issues present in the training corpus (e.g., Methods2Test). In parallel, assertion messages—essential for debugging and failure interpretation—are often missing or generic in both developer- and LLM-generated tests, and test-level summaries frequently omit the core predicate unless test-aware structure is made explicit. Beyond code artifacts, the rise of LLM APIs has created new forms of Self-Admitted Technical Debt (SATD) centered on prompt design, hyperparameter configurations, and framework orchestration. Objective: This dissertation systematically investigates (i) how transformer/LLM pipelines affect test quality (with a focus on test smells), (ii) whether and how LLMs can be guided to produce useful assertion messages and concise, faithful test summaries, and (iii) what LLM-specific SATD emerges in real projects and how to manage it. The overarching goal is to improve test comprehensibility and maintainability while establishing practices that reduce long-term technical debt in LLM-enabled development. Method: We conduct three complementary empirical studies. First, we analyze transformer-generated tests (AthenaTest) against a curated set of test smells, then trace likely causes to two factors: properties learned from the Methods2Test training corpus and the model’s design tendencies (e.g., assertion density and test size). Second, we evaluate the contribution of lightweight documentation to comprehension through two test-aware tasks. For assertion messages, we benchmark four FIM-style code LLMs on a dataset of 216 Java test methods where developer-written messages serve as ground truth, comparing model outputs to human messages with semantic and human-like scoring. For test code summarization, we introduce a benchmark of 91 Java unit tests paired with developer-written comments and run an ablation over seven prompt variants—varying test code, MUT, assertion messages, and assertion semantics—across four code LLMs, assessed with BLEU, METEOR, ROUGE-L, BERTScore, and an LLM-as-a-judge rubric. Third, we study maintainability at the application layer by mining and classifying Self-Admitted Technical Debt (SATD) in LLM-based projects, identifying LLM-specific categories (prompt debt, hyperparameter debt, framework debt, cost debt, and learning debt) and quantifying which prompt techniques (e.g., instruction-first, few-shot) accrue the most debt in practice.
dc.format.extent	138
dc.identifier.uri	https://hdl.handle.net/20.500.14154/77598
dc.language.iso	en_US
dc.publisher	Saudi Digital Library
dc.subject	Code Intelligence
dc.subject	Code LLMs / LLMs for Code
dc.subject	Test Code
dc.subject	NLP
dc.subject	SE
dc.title	LARGE LANGUAGE MODELS FOR TEST CODE COMPREHENSION
dc.type	Thesis
sdl.degree.department	Department of the Computer Science and Engineering
sdl.degree.discipline	Computer Science, NLP
sdl.degree.grantor	The University of North Texas (UNT)
sdl.degree.name	PhD Computer Science and Engineering

Files

Original bundle

Now showing 1 - 1 of 1

Name:: SACM-Dissertation.pdf
Size:: 1.29 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.61 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

SACM - United States of America