Monitoring Spark Application in Real Time
Abstract
Big Data's real-time analytics ensure the ability to make the best decisions and take
purposeful action at the right time. Spark, with Hadoop YARN as a resource scheduler and an HDFS
(Hadoop Distributed File System) storage layer, is an open-source distributed computing platform.
Real-time monitoring enables operators to identify complications before they develop into
problems, and to manage high availability and high service performance. Due to the complexity of
Big Data applications analysis grows, the need to monitor clusters and Big Data applications for
auditing, accounting, and performance its assessment is growing. Spark application monitoring
should occur at three levels: Spark running in the YARN client mode that includes master and
worker; applications running in the driver and worker Spark framework; and cluster nodes, i.e.
CPU, memory, and disk. Information is collected for the current application regarding the status of
the Spark application, jobs, stages, tasks, and detecting active and dead executors. Additionally, a
percentage of CPU, memory, and disk utilization on each of the master node and slave nodes for
JVM processes is consumed where Spark is running. We visualize metrics for users that are readable
and easier to understand and notify them of any threshold.