Monitoring Spark Application in  Real Time

HISSA ABDULLAH ABDULRAHMAN ALSUBAI

Monitoring Spark Application in Real Time

Authors

Abstract

Big Data's real-time analytics ensure the ability to make the best decisions and take purposeful action at the right time. Spark, with Hadoop YARN as a resource scheduler and an HDFS (Hadoop Distributed File System) storage layer, is an open-source distributed computing platform. Real-time monitoring enables operators to identify complications before they develop into problems, and to manage high availability and high service performance. Due to the complexity of Big Data applications analysis grows, the need to monitor clusters and Big Data applications for auditing, accounting, and performance its assessment is growing. Spark application monitoring should occur at three levels: Spark running in the YARN client mode that includes master and worker; applications running in the driver and worker Spark framework; and cluster nodes, i.e. CPU, memory, and disk. Information is collected for the current application regarding the status of the Spark application, jobs, stages, tasks, and detecting active and dead executors. Additionally, a percentage of CPU, memory, and disk utilization on each of the master node and slave nodes for JVM processes is consumed where Spark is running. We visualize metrics for users that are readable and easier to understand and notify them of any threshold.

URI

https://drepo.sdl.edu.sa/handle/20.500.14154/44755

Collections

SACM - United Kingdom

Full item page

Monitoring Spark Application in Real Time

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By