Introduction to EXPLAIN Plains
An EXPLAIN plan, often simply called an “explain”, is a tool in SQL-based database systems like PostgreSQL that provides information about how a database will execute a given query. By inputting a SQL query to the EXPLAIN command, the database returns a detailed execution plan that the database query optimizer will use to execute the query. This plan includes information such as the specific steps the query will take, the order of those steps, estimates of the cost in terms of disk I/O and CPU usage, the number of rows processed at each step, and how tables are joined.
EXPLAIN plans are crucial tools for developers and database administrators for understanding the performance characteristics of their queries, allowing them to find bottlenecks, optimize their SQL statements, and improve database design to ensure efficient and performant data operations.
Generating an EXPLAIN Plan
EXPLAIN plans are generated by simply prepending a query with the EXPLAIN keyword.
EXPLAIN <Your Query>
SQLInterpreting an EXPLAIN Plan
For the purpose of answering this question, lets consider the following PostgreSQL EXPLAIN Plan result
Hash Join (cost=8.33..16.61 rows=5 width=160)
Hash Cond: (orders.customer_id = customers.id)
-> Seq Scan on orders (cost=0.00..6.70 rows=170 width=112)
-> Hash (cost=4.60..4.60 rows=260 width=524)
-> Seq Scan on customers (cost=0.00..4.60 rows=260 width=524)
SQLWhen interpreting a PostgreSQL EXPLAIN plan, it’s critical to understand the basic components that comprise it. Here are the steps to take and the components that require focus.
Basic EXPLAIN plan components
Query Plan Nodes
The EXPLAIN output is structured as a tree, with each node representing a step in the execution plan. Starting at the top, we read through each node, taking note of the operation type such as a sequential scan, index scan, or sort.
In our example, the nodes are Hash Join
, Seq Scan
(on two tables), and Hash
. These nodes indicate the operations that the database will perform to execute the query.
Row Estimates
Each node includes an estimate of the number of rows it will process. If the actual number of rows processed is very different from the estimate, this could indicate a problem with database statistics, leading to inefficient query plans.
In our example, the estimated rows for the Hash Join
is 5, for the Seq Scan
on ‘orders’ table it’s 170, and for the Seq Scan
on ‘customers’ table it’s 260. If actual row counts differ significantly, it might indicate outdated or inaccurate statistics.
Cost Estimates
The cost estimate of each node, shown in the form cost=startup cost..total cost
, is important. The database planner assumes the lower the cost, the better. Inefficient queries often have high-cost estimates. Understanding this helps identify which part of the query to optimize.
In our example, the total cost for Hash Join
is estimated to be 16.61, for the Seq Scan
on ‘orders’ table it’s 6.70, and for the Seq Scan
on ‘customers’ it’s 4.60.
Order of Execution
Understanding the order of execution is key. Even though the plan is displayed top-down, it’s executed bottom-up. For instance, the output may display a “Sort” operation on top, but the data fetching operation listed below is executed first.
Even though the plan displays operations in a top-down manner, the execution follows a bottom-up approach. For example, although the output shows the “Sort” operation at the top, the data fetching operation listed below is executed first.
In our example plan, the Seq Scans on orders
and customers
tables are executed first. The results are then used in the Hash
and Hash Join
operations.
Node Dependencies
We take note of dependencies between nodes. In general, nodes that feed data into other nodes should be optimized first because improvements can cascade through the plan.
Here, the Hash Join
node depends on the two Seq Scan
nodes and the Hash node. Optimizing lower nodes can have a cascading effect on the overall plan efficiency.
Indexes
We look at whether the query uses indexes efficiently or if there’s a sequential scan where an index could be useful.
In this example, both scans are sequential (Seq Scan
on ‘orders’ and ‘customers’), suggesting there are no indexes being used. If these tables are large, using indexes could increase efficiency.
Join types
Understanding the types of joins used (like Nested Loop, Hash Join, or Merge Join) and their implications on performance is crucial.
Our sample plan uses a Hash Join
, which is efficient for larger datasets. However, the effectiveness depends on available memory and the distribution of the data in the joining columns.
After understanding these elements, optimization strategies would typically include rewriting the query for better performance, adding or modifying indexes, updating database statistics, or considering changes to the database schema or configuration.