The Terminator: An AI-Based Framework to Handle Dependability Threats in Large-Scale Distributed Systems

Alharthi, Khalid Ayed Budayai

The Terminator: An AI-Based Framework to Handle Dependability Threats in Large-Scale Distributed Systems

dc.contributor.advisor	Juhmika, Arshad
dc.contributor.author	Alharthi, Khalid Ayed Budayai
dc.date.accessioned	2023-07-06T08:29:57Z
dc.date.available	2023-07-06T08:29:57Z
dc.date.issued	2023-06-28
dc.description.abstract	With the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systems’ logs for dependability purposes, such as failure prediction, with varying results. In this work, three novel AI-based techniques are proposed to address two major dependability problems, those of (i) error detection and (ii) failure prediction. The proposed error detection technique leverages the sentiments embedded in log messages in a novel way, making the approach HPC system-independent, i.e., the technique can be used to detect errors in any HPC system. On the other hand, two novel self-supervised transformer neural networks are developed for failure prediction, thereby obviating the need for labels, which are notoriously difficult to obtain in HPC systems. The first transformer technique, called Clairvoyant, accurately predicts the location of the failure, while the second technique, called Time Machine, extends Clairvoyant by also accurately predicting the lead time to failure (LTTF). Time Machine addresses the typical regression problem of LTTF as a novel multi-class classification problem, using a novel oversampling method for online time-based task training. Results from six real-world HPC clusters’ datasets show that our approaches significantly outperform the state-of-the-art methods on various metrics.
dc.format.extent	243
dc.identifier.uri	https://hdl.handle.net/20.500.14154/68532
dc.language.iso	en
dc.publisher	University of Warwick
dc.subject	artificial intelligence (AI)
dc.subject	HPC systems
dc.subject	dependability
dc.subject	failure prediction
dc.subject	Deep learning
dc.subject	Machine learning
dc.subject	Transformer
dc.subject	Natural Language processing (NLP)
dc.title	The Terminator: An AI-Based Framework to Handle Dependability Threats in Large-Scale Distributed Systems
dc.type	Thesis
sdl.degree.department	Computer Science
sdl.degree.discipline	Computer Science - Artificial Intelligence - Distributed Systems
sdl.degree.grantor	University of Warwick
sdl.degree.name	Doctor of Philosophy in Computer Science

Collections

SACM - United Kingdom

The Terminator: An AI-Based Framework to Handle Dependability Threats in Large-Scale Distributed Systems

Files

Collections