sql query debug
March 6, 2022

5 steps to debug your complex SQL queries in Firebolt

Replacing application code logic with SQL statements is very common. It saves developers time, effort, and also mistakes. However, substituting code with SQL can oftentimes result in complex and long SQL statements that are not performing as fast as expected. If your query is running slow and you want to understand why - you’ve come to the right place.

In this article, we will guide you through the process of identifying the performance bottlenecks in your query in just 5 simple steps. Keep on going through the steps until you identify a query part that takes a significant amount of time out of the total execution time. Keep in mind that there could be more than one bottleneck in your query so you might need to repeat this process a few times.


We will be using the following query to demonstrate each of the steps:


Step 1: make sure no other queries are running

It’s important to verify that no other queries are running in the background while you are debugging your query. Other queries might compete for the same resources as your query (RAM, CPU, and even disk), thus impacting its execution time. Run this query to make sure no other queries are running:

SELECT * FROM catalog.running queries;  

Step 2: measure compilation duration

Before getting to the execution phase, every query needs to be compiled by the engine. In most cases, compilation runs in sub-seconds, but in some cases, it might take longer, and even take the majority of the query run time. To get an approximate estimation of the query compilation time, run an EXPLAIN over your query, like this:

EXPLAIN <ORIGINAL QUERY>;

Step 3: measure the fetch duration

The process of retrieving fetch result-set from the server from clients (AKA fetch) is heavily impacted by many factors such as the client’s network, number of records, number of columns, and the size of the column values. We are going to measure the fetch duration: the time from when the query execution is finished to when the result-set is available to the client. 

Remember that FETCH is extremely inefficient for very large result sets. To get the number of fetched records, run this query: SELECT COUNT(*) FROM (<ORIGINAL QUERY>);

To understand the FETCH duration of your query - compare the original query run time with the run time of the following query: SELECT CHECKSUM(*) FROM (<ORIGINAL QUERY LINES>);

Step 4: setup a warmup execution baseline

Next, we want to get the execution time of the query. It’s important to run the query at least twice in order to ensure at least one warm execution (where relevant data was loaded into the disk). We recommend running this baseline benchmark from a fresh session to ensure that there aren’t any unwanted settings that impact the performance (unless they are planned to be used). 

The most accurate way to get a query execution time is by fetching its duration from the catalog.query_history table, like this:  


While you’re at it, it is also recommended to keep an eye on some other metrics that can point you to the actual bottleneck:

  1. A high number of scanned rows might indicate that the bottleneck is scan time.
  2. A high number of bytes scanned from cold storage, might indicate that the bottleneck is cold execution.
  3. A high number of RAM bytes consumed might indicate that the bottleneck is some memory-based process, like aggregation or join.

Once you got a consistent execution time for the query, we are ready to move forward to the next step and dive deeper into the syntax of your SQL query, and identify the bottleneck.

Step 5: query reduction

We will now start stripping elements from the query, top to bottom. The idea is to reduce the query until a certain reduction results in significant performance improvement - which means the last reduced element is the bottleneck. To remove certain query parts, we recommend commenting the part out so it is easy to bring it back when required. In case of a more complex query, which contains subqueries, we’ll start with the top-level, and then make our way down until we find the bottleneck. Let’s now analyze the query step by step and list the potential bottlenecks.

ORDER BY clause

Not a common time-waster, and yet something we have to rule out before moving forward. There are multiple factors that might affect ORDER BY step duration:

  1. The size of the sorted result set.
  2. Whether or not the sorted fields are part of the primary index.
  3. The number of fields in the ORDER BY and their granularity.

In our specific example there is an ORDER BY in the final SELECT statement:


As the ORDER BY step executed as one of the very last steps of the query execution, we can just comment it out and check performance without it: 

SELECT clause

Next, we should inspect the result set target list. A straightforward SELECT should not take too much time, but there are some functionalities that take longer to compute and might end up as bottlenecks:

Note: aggregation functions part of the SELECT statement, accompanied by GROUP BY clause, are addressed separately in the GROUP BY clause section.
  1. Window functions or analytical functions - might be a huge time consumer. Keep in mind that the actual under the hood implementation of window functions, often includes aggregations (NEST and UNNEST is one approach) thus exceeded the boundaries of SELECT clause computation.
  2. JSON functions - JSON parsing and extract functions durations might accumulate to a bottleneck, depending on the number of parsed JSONS, the size of the JSON, and the number of extracted keys from each JSON.
  3. Lambda functions - might also accumulate to a bottleneck, depending mainly on the size of the array/arrays it is running over.
  4. Case statements with long conditions list.
  5. Fields with very long values.
  6. A large number of fields in the target list.


In the case of a long target list, we recommend commenting out half of the list recursively, until you are able to focus on the problematic fields. In case a function is involved in the field that is suspected as the bottleneck, replace the function with a plain fetch of the processed underlying value. In our example:


We should examine reducing the window function into a regular fetch:

Then check the performance again.

UNION clause

There are 2 typical patterns that might cause a UNION clause to become a bottleneck:

  1. A UNION is performed (rather than UNION ALL) over very large data sets, thus the engine takes a long time to compute the final deduplicated result set.
  2. The UNION is performed over complexed / many subqueries, thus the computation of all the subqueries (even if executed in parallel) accumulates into a performance bottleneck.


Going back to our example, we finished analyzing the final SELECT, so now we are going one level in, into subquery_b:


As a first inspection, we should replace the UNION with UNION ALL and check if it improves run time. In case you are seeing some improvement - run SELECT COUNT(*) over the UNION ALL clause, to see how many records are part of the UNION operation.

Then, we go over the UNION subqueries and execute them one by one separately. If nothing stands out we should look into the accumulated time spent on the UNION by adding more made up similar subqueries to query, so in our example, we could do: 

If adding more subqueries results in linear increase in run time - that means the UNION indeed is a bottleneck. In this case, we recommend checking whether your UNIONs are essentially subqueries accessing the same data set each time for different results, or each subquery accesses a different data set (that would impact the optimization technique).

GROUP BY clause

We can divide the typical aggregation-related performance issues into 2 groups:

  1. Functions-related issues - performance issues related to the aggregation functions themselves:
  1. The number of calculated functions.
  2. The complexity of the aggregated values - aggregation over values that are computed on the fly like case statements, text parsing, and semi-structured data parsing.
  3. The specific aggregation function that is being used - some functions are more complex and require more resources to compute than others (for example count vs count distinct).
  1. Group by related issues -  performance issues related to the fields listed in the GROUP BY clause:
  1. Number of fields in the clause.
  2. The granularity of the listed fields.
  3. The types and length of the listed fields.


Back to our query, let’s take one of the subqueries under the union in subquery_b as an example:


We start by inspecting the aggregation functions. While the SUM is straightforward and does not reflect special complexity COUNT(DISTINCT) over a CASE STATEMENT just might. To verify whether the function itself is a potential bottleneck, we can replace the COUNT(DISTINCT) with a simpler COUNT and/or replace the computed CASE STATEMENT with a plain value, like this: 


Next move on to inspect the GROUP BY clause itself. We start by identifying how many unique values does the clause produces by running this type of query: 


If we suspect that one of the fields in the GROUP BY clause is the main contributor to the clause complexity, we can try and mark it out / replace it with another field. So for example, in our query we have field col4 which contains long string values, so we can try and replace it and see the effect of it on the execution time: 

JOIN clause

This is considered as one of the most common bottlenecks. The idea of the JOIN clause inspection is to verify whether the actual operation of joining the 2 or more data sets is what’s hearting the performance (rather than the individual scanning of each data set). There are multiple factors that can affect JOIN performance:

  1. The number of joined data sets.
  2. The number of rows of each data set.
  3. The complexity of the JOIN condition.
  4. The JOIN type (INNER / LEFT / CROSS)


Back to our query, as we finished inspecting subquery_b  it’s now time to inspect subquery_a:


As we can see, there is a JOIN between 3 tables in this subquery. As a first step, we recommend replacing the JOIN with a UNION ALL which should provide a clear indication of whether the issue is within the actual JOIN operation (and based on this operation we’ll know if we need to continue diving in). So in our query, we can do something like this:


Note that this method isn’t flawless, and you might need to make some compromises regarding each table scan (in our example we had to compromise table_c scan to make the UNION work). Nevertheless, if the UNION query is significantly faster than the JOIN query, then it’s most likely that the JOIN itself is the bottleneck, and it’s recommended to continue with its inspection.

As a next step, we can start by trying to identify if there’s a specific table that is responsible for the bottleneck, by commenting out each table at at time. First, table_b:


And then table_c:


Once you identify the specific table JOIN that takes most of the time, try inspecting the JOIN condition (in case it’s not straight forward) by marking out complexed conditions: 

 

If there is no issue with the JOIN condition itself, the bottleneck is probably a result of the number of records that need to be crossed as part of the JOIN operation. Run SELECT COUNT(*) over each side of the JOIN starting with table_a:


Then table_c:


Finally, run a count over the JOIN product, to check wether there is a one to many / many to many JOIN:


In case the result of the COUNT on the JOIN product is significantly higher than the COUNT on each side of the JOIN - the JOIN itself is most likely a bottleneck. 

FROM and WHERE CLAUSES

Last 2 clauses we would inspect are the FROM and the WHERE clauses. We should inspect them together, as they are tightly connected, and are essentially the heart of every query execution.


We refer to the time spent on these 2 clauses as scan time which is determined by the number of records that have to be scanned in order to return the desired results. If you got to a point where you are able to narrow down your query to a single scan, and still performance does not improve - there is only one more potential bottleneck you need to clear before you can determine that the bottleneck is the actual scan - WHERE clause related bottleneck. A WHERE clause bottleneck is an applied filter that makes the query run slower. Some of the typical situations where you might encounter this:

  1. WHERE IN clauses which are based on another subquery - where the subquery computation might be a bottleneck.
  2. WHERE IN clauses which are based on very long lists of values - where the long list of values parsing might be a bottleneck.
  3. Complexed text searches over long strings - like regular expressions and other string functions
  4. Semi-structured objects filtering - filters applied on JSONs / arrays


Back to our query and subquery_a:


As we cleared out the JOIN clause earlier, now it’s time to inspect each table scan individually. In terms of potential WHERE clause related bottleneck, we can see that only table_a is filtered, so we can focus on it:


Our goal is to identify whether there is a filter that is applied and has a bad impact on the overall run time. We do that by simply commenting out one of the filters at a time, and checking if the run time improves. For example:


If run time indeed improved by this move, the filtering subquery is most likely your bottleneck (you can check run time of it to verify):  


If you followed all the steps, and got here without finding your bottleneck (your reduced query is still performing slow) - then by this process of elimination, it is most likely that your query bottleneck is the actual table scan.

Read all the posts

Intrigued? Want to read some more?

Curious to learn more?

Discover how Firebolt achieves extreme performance at a fraction of the cost