LARGE LANGUAGE MODELS FOR TEST CODE COMPREHENSION

dc.contributor.advisorHyunsook, Do
dc.contributor.authorAljohani, Ahmed
dc.date.accessioned2025-12-21T15:22:22Z
dc.date.issued2025
dc.description.abstractContext: Unit testing is critical for software reliability, yet challenges persist in ensuring the quality, comprehensibility, and maintainability of test code. Transformer/LLM-based approaches have improved the readability of generated tests, but they also introduce new risks and omissions. Empirically, transformer-based test generation (e.g., AthenaTest) produces tests with nontrivial smell rates, mirroring issues present in the training corpus (e.g., Methods2Test). In parallel, assertion messages—essential for debugging and failure interpretation—are often missing or generic in both developer- and LLM-generated tests, and test-level summaries frequently omit the core predicate unless test-aware structure is made explicit. Beyond code artifacts, the rise of LLM APIs has created new forms of Self-Admitted Technical Debt (SATD) centered on prompt design, hyperparameter configurations, and framework orchestration. Objective: This dissertation systematically investigates (i) how transformer/LLM pipelines affect test quality (with a focus on test smells), (ii) whether and how LLMs can be guided to produce useful assertion messages and concise, faithful test summaries, and (iii) what LLM-specific SATD emerges in real projects and how to manage it. The overarching goal is to improve test comprehensibility and maintainability while establishing practices that reduce long-term technical debt in LLM-enabled development. Method: We conduct three complementary empirical studies. First, we analyze transformer-generated tests (AthenaTest) against a curated set of test smells, then trace likely causes to two factors: properties learned from the Methods2Test training corpus and the model’s design tendencies (e.g., assertion density and test size). Second, we evaluate the contribution of lightweight documentation to comprehension through two test-aware tasks. For assertion messages, we benchmark four FIM-style code LLMs on a dataset of 216 Java test methods where developer-written messages serve as ground truth, comparing model outputs to human messages with semantic and human-like scoring. For test code summarization, we introduce a benchmark of 91 Java unit tests paired with developer-written comments and run an ablation over seven prompt variants—varying test code, MUT, assertion messages, and assertion semantics—across four code LLMs, assessed with BLEU, METEOR, ROUGE-L, BERTScore, and an LLM-as-a-judge rubric. Third, we study maintainability at the application layer by mining and classifying Self-Admitted Technical Debt (SATD) in LLM-based projects, identifying LLM-specific categories (prompt debt, hyperparameter debt, framework debt, cost debt, and learning debt) and quantifying which prompt techniques (e.g., instruction-first, few-shot) accrue the most debt in practice.
dc.format.extent138
dc.identifier.urihttps://hdl.handle.net/20.500.14154/77598
dc.language.isoen_US
dc.publisherSaudi Digital Library
dc.subjectCode Intelligence
dc.subjectCode LLMs / LLMs for Code
dc.subjectTest Code
dc.subjectNLP
dc.subjectSE
dc.titleLARGE LANGUAGE MODELS FOR TEST CODE COMPREHENSION
dc.typeThesis
sdl.degree.departmentDepartment of the Computer Science and Engineering
sdl.degree.disciplineComputer Science, NLP
sdl.degree.grantorThe University of North Texas (UNT)
sdl.degree.namePhD Computer Science and Engineering

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
SACM-Dissertation.pdf
Size:
1.29 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed to upon submission
Description:

Copyright owned by the Saudi Digital Library (SDL) © 2026