Spark Thrift Details

The Query Details page displays the details of a single query executed by a user in Spark.

The following information is displayed for a query.

MetricDescription
UserThe name of the user that executed the query.
StateThe final state of the query execution. The state can be either of the following: Failed, Finished, Compiled, Succeeded.
DurationThe duration of the query execution.
Start TimeThe time at which the query execution started.
End TimeThe time at which the query execution ended.
# of StagesThe number of stages of the query.
Shuffle Write TimeThe time taken to write shuffle bytes.
Disk Bytes SpilledThe amount of deserialized form of data on the disk at the time the data is spilt.
Scheduling PoolThe name of the fair scheduler pool the query belongs to.
Cores Used/AllocatedThe number of cores used by the query and allocated to the query.

Query Trends

The Query Trends panel displays a chart showing the pattern of queries running at a particular time, based on the following factors.

FactorDescription
Output BytesThe number of output bytes written to a file format while executing the query at a given time.
Input BytesThe number of input bytes read while executing the query at a given time.
DurationThe elapsed time of the query at a given time.

Query

The Query panel displays the query that was executed. You can do the following operations in the Query panel.

  • To copy the query to clipboard, click the Copy icon in the top right corner of the Query panel.
  • To beautify the query, click the Beautify Query icon in the top right corner of the Query panel.

Data Details

The Data Details panel displays information of tables used in the query. You can view the following data details.

  • Tables Used: Displays the name of the tables used in the query.
  • Tables with full table scan: Displays the name of the tables whose every row and column were fully scanned in sequential order.
  • Cached RDDs: Displays the datasets in RDD that were cached in the Spark Thrift server.

Query Stage Distribution

This panel shows the distribution of the query into execution stages. In Stage Duration (in seconds) distribution, you can see the number of stages executed at different time intervals.

The stage duration is distributed across the following metrics.

MetricDescription
Input Bytes ReadThe number of input bytes read while executing the query across a given number of stages.
Output Bytes WrittenThe number of output bytes written to a file format while executing the query across a given number of stages.
Shuffle Read BytesThe number of shuffle bytes read (in bytes) across a given number of stages.
Shuffle Write BytesThe number of bytes written in shuffle operations across a given number of stages.

Reading the Query Stage Distribution Graph

Hover over any stage distribution bar to know more about the metric distribution of that stage. Consider the following screenshot as an example.

  • The first graph represents the grouping of stages when the query is executed. x-axis represents the stage duration and the y-axis represents the number of stages.
  • In the graph, there are three groups of two stages each.
  • Hovering over the third group highlights the metrics that are a part of the selected stage group and greys out the remaining.
    • The Input Bytes Read of size 503 KB - 79 MB belong to only one stage of the third stage group.
    • The number of bytes in Shuffle Read Bytes belong to only one stage of the third stage group.
    • The number of bytes in Shuffle Write Bytes belong to only one stage of the third stage group.

Query Stage Distribution

Query Execution Metrics

This panel displays the following information of a query execution.

  • Stats
MetricDescription
Shuffle Read Records ReadThe number of shuffle records read.
Shuffle Write Records WrittenThe number of shuffle records written to disk.
Shuffle Read Bytes ReadThe amount of shuffle bytes read (in bytes).
Input Bytes ReadThe amount of input bytes read (in bytes).
Shuffle Read Remote BlocksThe number of remote blocks fetched in shuffle read operations.
Shuffle Read Local BlocksThe number of local blocks fetched in shuffle read operations.
Shuffle Read Fetch Wait TimeThe time spent in waiting for remote blocks in shuffle read operations.
Output Bytes WrittenThe number of bytes written to the query executor.
Memory Bytes SpilledThe amount of deserialized form of the data in memory at the time it is spilt.
Shuffle Write Bytes WrittenThe number of bytes written in shuffle operations.
Result SizeThe number of bytes an executed query sends back to the driver as the TaskResult.
  • Execution Metrics
MetricDescription
Total DurationThe time elapsed to execute the query.
Cumulative Executor RuntimeThe cumulative time taken by the executor to run the query.
Task DurationThe time elapsed to complete the task.
JVM GC TimeTime spent by the JVM in garbage collection while executing a task.

Query DAG and Plan

The panel displays the distribution of query logic in the form of a DAG and a physical execution plan.

DAG

The Spark Driver builds a logical flow of operations of tasks that can be computed in parallel with partitioned data in the cluster. This flow is represented in a directed and acyclic graph called the Directed Acyclic Graph (DAG).

DAG is a work scheduling graph with finite elements connected in edges and vertices. These elements are also called RDDs (Resilient Distributed Datasets). The RDDs are fault-tolerant in nature.

The order of execution of the jobs in DAG is specified by the directions of the edges in the graph. The graph is acyclic as it has no loops or cycles.

DAG

Plan

Plan is a logical representation of how Spark executes the query, where a query is broken into different logical plans.