Monitoring Spark Application in Real Time

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Big Data's real-time analytics ensure the ability to make the best decisions and take purposeful action at the right time. Spark, with Hadoop YARN as a resource scheduler and an HDFS (Hadoop Distributed File System) storage layer, is an open-source distributed computing platform. Real-time monitoring enables operators to identify complications before they develop into problems, and to manage high availability and high service performance. Due to the complexity of Big Data applications analysis grows, the need to monitor clusters and Big Data applications for auditing, accounting, and performance its assessment is growing. Spark application monitoring should occur at three levels: Spark running in the YARN client mode that includes master and worker; applications running in the driver and worker Spark framework; and cluster nodes, i.e. CPU, memory, and disk. Information is collected for the current application regarding the status of the Spark application, jobs, stages, tasks, and detecting active and dead executors. Additionally, a percentage of CPU, memory, and disk utilization on each of the master node and slave nodes for JVM processes is consumed where Spark is running. We visualize metrics for users that are readable and easier to understand and notify them of any threshold.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025