Introduction to EXPLAIN Plains
An EXPLAIN plan, often simply called an “explain”, is a tool in SQL-based database systems like PostgreSQL that provides information about how a database will execute a given query. By inputting a SQL query to the EXPLAIN command, the database returns a detailed execution plan that the database query optimizer will use to execute the query. This plan includes information such as the specific steps the query will take, the order of those steps, estimates of the cost in terms of disk I/O and CPU usage, the number of rows processed at each step, and how tables are joined.
EXPLAIN plans are crucial tools for developers and database administrators for understanding the performance characteristics of their queries, allowing them to find bottlenecks, optimize their SQL statements, and improve database design to ensure efficient and performant data operations.
Generating an EXPLAIN Plan
EXPLAIN plans are generated by simply prepending a query with the EXPLAIN keyword.
Interpreting an EXPLAIN Plan
For the purpose of answering this question, lets consider the following PostgreSQL EXPLAIN Plan result
When interpreting a PostgreSQL EXPLAIN plan, it’s critical to understand the basic components that comprise it. Here are the steps to take and the components that require focus.
Basic EXPLAIN plan components
Query Plan Nodes
The EXPLAIN output is structured as a tree, with each node representing a step in the execution plan. Starting at the top, we read through each node, taking note of the operation type such as a sequential scan, index scan, or sort.
In our example, the nodes are
Seq Scan (on two tables), and
Hash. These nodes indicate the operations that the database will perform to execute the query.
Each node includes an estimate of the number of rows it will process. If the actual number of rows processed is very different from the estimate, this could indicate a problem with database statistics, leading to inefficient query plans.
In our example, the estimated rows for the
Hash Join is 5, for the
Seq Scan on ‘orders’ table it’s 170, and for the
Seq Scan on ‘customers’ table it’s 260. If actual row counts differ significantly, it might indicate outdated or inaccurate statistics.
The cost estimate of each node, shown in the form
cost=startup cost..total cost, is important. The database planner assumes the lower the cost, the better. Inefficient queries often have high-cost estimates. Understanding this helps identify which part of the query to optimize.
In our example, the total cost for
Hash Join is estimated to be 16.61, for the
Seq Scan on ‘orders’ table it’s 6.70, and for the
Seq Scan on ‘customers’ it’s 4.60.
Order of Execution
Understanding the order of execution is key. Even though the plan is displayed top-down, it’s executed bottom-up. For instance, the output may display a “Sort” operation on top, but the data fetching operation listed below is executed first.
Even though the plan displays operations in a top-down manner, the execution follows a bottom-up approach. For example, although the output shows the “Sort” operation at the top, the data fetching operation listed below is executed first.
In our example plan, the Seq Scans on
customers tables are executed first. The results are then used in the
Hash Join operations.
We take note of dependencies between nodes. In general, nodes that feed data into other nodes should be optimized first because improvements can cascade through the plan.
Hash Join node depends on the two
Seq Scan nodes and the Hash node. Optimizing lower nodes can have a cascading effect on the overall plan efficiency.
We look at whether the query uses indexes efficiently or if there’s a sequential scan where an index could be useful.
In this example, both scans are sequential (
Seq Scan on ‘orders’ and ‘customers’), suggesting there are no indexes being used. If these tables are large, using indexes could increase efficiency.
Understanding the types of joins used (like Nested Loop, Hash Join, or Merge Join) and their implications on performance is crucial.
Our sample plan uses a
Hash Join, which is efficient for larger datasets. However, the effectiveness depends on available memory and the distribution of the data in the joining columns.
After understanding these elements, optimization strategies would typically include rewriting the query for better performance, adding or modifying indexes, updating database statistics, or considering changes to the database schema or configuration.