Spark Thrift Details
The Query Details page displays the details of a single query executed by a user in Spark.
The following information is displayed for a query.
Metric | Description |
---|---|
User | The name of the user that executed the query. |
State | The final state of the query execution. The state can be either of the following: Failed, Finished, Compiled, Succeeded. |
Duration | The duration of the query execution. |
Start Time | The time at which the query execution started. |
End Time | The time at which the query execution ended. |
# of Stages | The number of stages of the query. |
Shuffle Write Time | The time taken to write shuffle bytes. |
Disk Bytes Spilled | The amount of deserialized form of data on the disk at the time the data is spilt. |
Scheduling Pool | The name of the fair scheduler pool the query belongs to. |
Cores Used/Allocated | The number of cores used by the query and allocated to the query. |
Query Trends
The Query Trends panel displays a chart showing the pattern of queries running at a particular time, based on the following factors.
Factor | Description |
---|---|
Output Bytes | The number of output bytes written to a file format while executing the query at a given time. |
Input Bytes | The number of input bytes read while executing the query at a given time. |
Duration | The elapsed time of the query at a given time. |
Query
The Query panel displays the query that was executed. You can do the following operations in the Query panel.
- To copy the query to clipboard, click the Copy icon in the top right corner of the Query panel.
- To beautify the query, click the Beautify Query icon in the top right corner of the Query panel.
Data Details
The Data Details panel displays information of tables used in the query. You can view the following data details.
- Tables Used: Displays the name of the tables used in the query.
- Tables with full table scan: Displays the name of the tables whose every row and column were fully scanned in sequential order.
- Cached RDDs: Displays the datasets in RDD that were cached in the Spark Thrift server.
Query Stage Distribution
This panel shows the distribution of the query into execution stages. In Stage Duration (in seconds) distribution, you can see the number of stages executed at different time intervals.
The stage duration is distributed across the following metrics.
Metric | Description |
---|---|
Input Bytes Read | The number of input bytes read while executing the query across a given number of stages. |
Output Bytes Written | The number of output bytes written to a file format while executing the query across a given number of stages. |
Shuffle Read Bytes | The number of shuffle bytes read (in bytes) across a given number of stages. |
Shuffle Write Bytes | The number of bytes written in shuffle operations across a given number of stages. |
Reading the Query Stage Distribution Graph
Hover over any stage distribution bar to know more about the metric distribution of that stage. Consider the following screenshot as an example.
- The first graph represents the grouping of stages when the query is executed. x-axis represents the stage duration and the y-axis represents the number of stages.
- In the graph, there are three groups of two stages each.
- Hovering over the third group highlights the metrics that are a part of the selected stage group and greys out the remaining.
- The Input Bytes Read of size 503 KB - 79 MB belong to only one stage of the third stage group.
- The number of bytes in Shuffle Read Bytes belong to only one stage of the third stage group.
- The number of bytes in Shuffle Write Bytes belong to only one stage of the third stage group.
Query Execution Metrics
This panel displays the following information of a query execution.
- Stats
Metric | Description |
---|---|
Shuffle Read Records Read | The number of shuffle records read. |
Shuffle Write Records Written | The number of shuffle records written to disk. |
Shuffle Read Bytes Read | The amount of shuffle bytes read (in bytes). |
Input Bytes Read | The amount of input bytes read (in bytes). |
Shuffle Read Remote Blocks | The number of remote blocks fetched in shuffle read operations. |
Shuffle Read Local Blocks | The number of local blocks fetched in shuffle read operations. |
Shuffle Read Fetch Wait Time | The time spent in waiting for remote blocks in shuffle read operations. |
Output Bytes Written | The number of bytes written to the query executor. |
Memory Bytes Spilled | The amount of deserialized form of the data in memory at the time it is spilt. |
Shuffle Write Bytes Written | The number of bytes written in shuffle operations. |
Result Size | The number of bytes an executed query sends back to the driver as the TaskResult. |
- Execution Metrics
Metric | Description |
---|---|
Total Duration | The time elapsed to execute the query. |
Cumulative Executor Runtime | The cumulative time taken by the executor to run the query. |
Task Duration | The time elapsed to complete the task. |
JVM GC Time | Time spent by the JVM in garbage collection while executing a task. |
Query DAG and Plan
The panel displays the distribution of query logic in the form of a DAG and a physical execution plan.
DAG
The Spark Driver builds a logical flow of operations of tasks that can be computed in parallel with partitioned data in the cluster. This flow is represented in a directed and acyclic graph called the Directed Acyclic Graph (DAG).
DAG is a work scheduling graph with finite elements connected in edges and vertices. These elements are also called RDDs (Resilient Distributed Datasets). The RDDs are fault-tolerant in nature.
The order of execution of the jobs in DAG is specified by the directions of the edges in the graph. The graph is acyclic as it has no loops or cycles.
Plan
Plan is a logical representation of how Spark executes the query, where a query is broken into different logical plans.