The Spark Job Details page contains the following panels:
- Job Trends
- Spark Stages
- Timeseries Information
- Application Logs
The top panel displays the following information.
|User||The name of the user that ran the job.|
|Final Status||Status of the job. The state can be one of the following: |
Succeeded, Failed, Finished, Finishing, Killed, Running, Scheduled
|Start Time||The time at which the application started running.|
|Duration||The time taken to run the jobs in the application.|
|# of Jobs||The number of jobs in the application.|
|# of Stages||The number of stages of the Spark job.|
|# of Tasks||The total number of tasks the stages are broken into.|
|Avg Memory||The average memory used by the application of the selected user.|
|Avg VCore||The average VCore used by the application of the selected user.|
|Scheduling Delay||The time taken to start a task.|
The Job Trends panel displays a chart showing the pattern of jobs running at a particular time, based on the following factors as shown in the screenshot.
Note: The x-axis denotes the time at which the User executed a job.
|Elapsed Time||The time taken to run the jobs at a particular time.|
|VCores||The number of VCores used to run the job.|
|Memory||The amount of memory used to run the job.|
|Input Read||The size of the input dataset.|
|Output Written||The size of the output written to a file format.|
Switching Job Trends View
You can switch between bar chart view and line chart view. Click the view in which you want to display the job trend. Choose the view from the icon in the top left corner of the Job Trends tile.
With Configurations, you can view the Job Configurations and Anomalous hosts.
Job Configurations displays Current Value and Recommended Value for the following parameters.
- #Cores: Number of cores in the current job.
- #Executors: Number of executors in the current job.
- Executor Memory: Amount of memory used by a job executor.
- Driver #Cores: Number of driver cores.
- Driver Memory: Amount of memory used by the driver.
Anomalies board displays system metrics for the host which is used by the Spark job within the duration of that job. The host can be impacted by the usage of CPU, Memory, Network, or Disk.
With Anomalous data, you can monitor the host performance and make predictions on memory, CPU, Network I/O, and disk usage.
To view more details about Anomalous hosts, click the host link in the Anomalies tab.
You can detect anomalies based on the following metrics.
Note: If an anomaly exists, the associated chart is highlighted with the number of anomalies detected.
|CPU Usage||The processor capacity usage of the job on the host.|
|Memory Usage||The RAM usage of the job on the host.|
|Network I/O||The network status of the job on the host displaying Sent Bytes and Received Bytes.|
|Disk Usage||The host storage currently in use by the Spark job. The data is displayed in Write Bytes and Read Bytes.|
Stages are units in which a job is divided into small tasks. You can view Spark Stages in the form of a List or a Timeline. Click More Details to view the details of a particular stage.
In a List view, you can see the following fields in Spark Stages.
|Stage Id||The ID of the stage.|
|Task Count||The number of tasks in the stage.|
|Timeline||The graphical representation of the duration of the tasks.|
|Duration||The time taken to complete tasks in that stage.|
|Max Task Memory||The maximum memory occupied by tasks.|
|IO Percentage||The rate of input/output operations (in %).|
|Shuffle Write||Amount of shuffling data written.|
|Shuffle Read||Amount of shuffling data read.|
|PRatio||Ratio of parallelism in the stage. A higher PRatio is better.|
|Task Skew||The value of task skewness which is less than -1 or greater than +1. (refer the dashboard)|
|Failure Rate||The rate at which the tasks in the stage fail.|
|Status||The status of the stage.|
The timeframe in which tasks in the stage executed. The timeline also includes the driver execution time. You can sort the timeline of these tasks by Duration and Start Time other than the default view.
Timeseries information displays timeseries metrics of the application you are currently viewing. Within the time duration, you can see the time spent by the drivers, denoted by a red box. The drivers help in running Spark applications as sets of processes on a cluster.
Note: You can see the name of the application you are currently viewing, above the user name in the top panel.
Other Timeseries Charts
|Schedule Information||The number of tasks running at a particular time and the number of tasks that were yet to be executed.|
|IO||The chart describes the number of input bytes read and the number of output bytes written during the duration of the task.|
|Driver Memory Usage||The amount of memory consumed by the driver.|
|Executor Memory Usage||The amount of memory used by the executor.|
|GC and CPU Distribution||The amount of garbage collection (in %) and amount of CPU used (in %) to execute jobs.|
|Shuffle information||The chart describes the following shuffle information: Shuffle Bytes written, Shuffle local bytes read, Shuffle Remote bytes read, and Shuffle Remote Bytes Read to Disk.|
|Storage Memory||The chart describes the amount of the following types of memory: Block Disk Space Used, Block Off Heap Memory User, Block On Heap Memory Used, Block Max Off Heap Memory, and Block Max On Heap Memory.|
|HDFS Information||The chart describes number of HDFS read and written.|
|Efficiency Statistics||Driver versus executor time spent determines how well the Spark program has been written and if the right amount of parallelism is achieved.|
|Simulation||This determines what should be the ideal number of executors on the Spark program and what would be the effect of such changes to the number of executors on the overall time and utilization.|
|YARN Diagnostics||This shows details of the YARN application that was running in that duration executed by the user.|
|Aggregate Metrics||The aggregated usage of different metrics in that application.|
The Application Logs section displays the application logs for Spark jobs that failed that lets you identify exact reason of failure.