Lam, WingAmmann, PaulAlshammari, Abdulrahman Turqi2024-08-062024-08-062024-07-26https://hdl.handle.net/20.500.14154/72785A critical component of modern software development practices, particularly continuous integration (CI), is the halt of development activities in response to test failures which requires further investigation and debugging. As software changes, regression testing becomes vital to verify that new code does not affect existing functionality. However, this process is often delayed by the presence of flaky tests—those that yield inconsistent results on the same codebase, alternating between pass and fail. Test flakiness introduces challenges to the trust in testing outcomes and undermines the reliability of the CI process. The typical approach to identifying flaky tests has involved executing them multiple times; if a test yields both passing and failing results without any modifications to the codebase, it is flaky, as discussed by Luo et al. in their empirical study. This approach, while straightforward, can be resource-intensive and time-consuming, resulting in considerable overhead costs for development teams. Moreover, this technique might not consistently reveal flakiness in tests that exhibit varied behavior across varying execution environments. Given these challenges, the research community has been actively seeking more efficient and reliable alternatives to the repetitive execution of tests for flakiness detection. These explorations aim to uncover methods that can accurately detect flaky tests without the need for multiple reruns, thereby reducing the time and resources required for testing. This dissertation addresses three principal dimensions of test flakiness. First, it presents a machine learning classifier designed to detect which tests are flaky, based on previously detected flaky tests. Second, the dissertation proposes three de-duplication-based approaches to assist developers in determining whether a flaky test failure is flaky or not. Third, it highlights the impact of test flakiness on other testing activities (particularly mutation testing) and discusses how to mitigate the effects of test flakiness on mutation testing. This dissertation explores the detection of test flakiness by conducting an empirical study on the limitations of rerunning tests as a method for identifying flaky tests, which results in a large dataset of flaky tests. This dataset is then utilized to develop FlakeFlagger, a machine learning classifier, which is designed to automatically predict the likelihood of a test being flaky through static and dynamic analysis. The objective is to leverage FlakeFlagger to identify flaky tests without the need for reruns by detecting patterns and symptoms common among previously identified flaky tests. In addressing the challenge of detecting whether a failure is due to flakiness, this dissertation demonstrates how developers can better manage flaky tests within their test suites. The dissertation proposes three deduplication-based methods to help developers determine whether a specific failure is genuinely flaky or not. Furthermore, the dissertation discusses the effects of test flakiness on mutation testing, a critical activity for assessing the quality of test suites. It includes an extensive rerun experiment on the mutation analysis of flaky tests identified earlier in the study. This is to highlight the significant impact of flaky tests on the validity of the mutation testing.112en-USSoftware TestingFlaky TestsMachine LearningData ScienceFlaky FailuresDetecting Flaky Tests Without Rerunning TestsThesis