LARGE LANGUAGE MODELS FOR TEST CODE COMPREHENSION
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
Context:
Unit testing is critical for software reliability, yet challenges persist in ensuring the quality, comprehensibility, and maintainability of test code. Transformer/LLM-based approaches have improved the readability of generated tests, but they also introduce new risks and omissions. Empirically, transformer-based test generation (e.g., AthenaTest) produces tests with nontrivial smell rates, mirroring issues present in the training corpus (e.g., Methods2Test). In parallel, assertion messages—essential for debugging and failure interpretation—are often missing or generic in both developer- and LLM-generated tests, and test-level summaries frequently omit the core predicate unless test-aware structure is made explicit. Beyond code artifacts, the rise of LLM APIs has created new forms of Self-Admitted Technical Debt (SATD) centered on prompt design, hyperparameter configurations, and framework orchestration.
Objective:
This dissertation systematically investigates (i) how transformer/LLM pipelines affect test quality (with a focus on test smells), (ii) whether and how LLMs can be guided to produce useful assertion messages and concise, faithful test summaries, and (iii) what LLM-specific SATD emerges in real projects and how to manage it. The overarching goal is to improve test comprehensibility and maintainability while establishing practices that reduce long-term technical debt in LLM-enabled development.
Method:
We conduct three complementary empirical studies. First, we analyze transformer-generated tests (AthenaTest) against a curated set of test smells, then trace likely causes to two factors: properties learned from the Methods2Test training corpus and the model’s design tendencies (e.g., assertion density and test size). Second, we evaluate the contribution of lightweight documentation to comprehension through two test-aware tasks. For assertion messages, we benchmark four FIM-style code LLMs on a dataset of 216 Java test methods where developer-written messages serve as ground truth, comparing model outputs to human messages with semantic and human-like scoring. For test code summarization, we introduce a benchmark of 91 Java unit tests paired with developer-written comments and run an ablation over seven prompt variants—varying test code, MUT, assertion messages, and assertion semantics—across four code LLMs, assessed with BLEU, METEOR, ROUGE-L, BERTScore, and an LLM-as-a-judge rubric. Third, we study maintainability at the application layer by mining and classifying Self-Admitted Technical Debt (SATD) in LLM-based projects, identifying LLM-specific categories (prompt debt, hyperparameter debt, framework debt, cost debt, and learning debt) and quantifying which prompt techniques (e.g., instruction-first, few-shot) accrue the most debt in practice.
Description
Keywords
Code Intelligence, Code LLMs / LLMs for Code, Test Code, NLP, SE
