The Terminator: An AI-Based Framework to Handle Dependability Threats in Large-Scale Distributed Systems

dc.contributor.advisorJuhmika, Arshad
dc.contributor.authorAlharthi, Khalid Ayed Budayai
dc.date.accessioned2023-07-06T08:29:57Z
dc.date.available2023-07-06T08:29:57Z
dc.date.issued2023-06-28
dc.description.abstractWith the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systems’ logs for dependability purposes, such as failure prediction, with varying results. In this work, three novel AI-based techniques are proposed to address two major dependability problems, those of (i) error detection and (ii) failure prediction. The proposed error detection technique leverages the sentiments embedded in log messages in a novel way, making the approach HPC system-independent, i.e., the technique can be used to detect errors in any HPC system. On the other hand, two novel self-supervised transformer neural networks are developed for failure prediction, thereby obviating the need for labels, which are notoriously difficult to obtain in HPC systems. The first transformer technique, called Clairvoyant, accurately predicts the location of the failure, while the second technique, called Time Machine, extends Clairvoyant by also accurately predicting the lead time to failure (LTTF). Time Machine addresses the typical regression problem of LTTF as a novel multi-class classification problem, using a novel oversampling method for online time-based task training. Results from six real-world HPC clusters’ datasets show that our approaches significantly outperform the state-of-the-art methods on various metrics.
dc.format.extent243
dc.identifier.urihttps://hdl.handle.net/20.500.14154/68532
dc.language.isoen
dc.publisherUniversity of Warwick
dc.subjectartificial intelligence (AI)
dc.subjectHPC systems
dc.subjectdependability
dc.subjectfailure prediction
dc.subjectDeep learning
dc.subjectMachine learning
dc.subjectTransformer
dc.subjectNatural Language processing (NLP)
dc.titleThe Terminator: An AI-Based Framework to Handle Dependability Threats in Large-Scale Distributed Systems
dc.typeThesis
sdl.degree.departmentComputer Science
sdl.degree.disciplineComputer Science - Artificial Intelligence - Distributed Systems
sdl.degree.grantorUniversity of Warwick
sdl.degree.nameDoctor of Philosophy in Computer Science

Files

Copyright owned by the Saudi Digital Library (SDL) © 2025