<- Back to all posts
July 28, 2025
July 30, 2025

Eliminating Redundant Joins in Firebolt for Faster SQL

July 28, 2025
July 30, 2025

Eliminating Redundant Joins in Firebolt for Faster SQL

No items found.

Listen to this article

Powered by NotebookLM
Listen to this article

Tl;dr

In this blog post, we provide an explanation of how our query optimizer is able to detect and remove redundant joins. Eliminating redundant joins is critical for real-world SQL queries, especially those generated by query generators which often contain a significant amount of redundancy. It is also essential for Firebolt's ability to optimize complex correlated subqueries. We’ll start by explaining what we call a redundant join operation, and then we’ll introduce the data structures we use for detecting them in a complex query plan.

Introduction

In Firebolt, we are obsessed with removing redundancy from complex SQL queries. We recently presented a paper at BTW 2025 about how we are able to optimize correlated aggregated subqueries in a way no other system can. In the paper, we explain our approach for optimizing the following query pattern with aggregated correlated subqueries in the projection of the query scan the same data:

SELECT c.*,
       (SELECT min(price) FROM orders WHERE status='O' AND c.key=ckey) "min",
       (SELECT max(price) FROM orders WHERE status='O' AND c.key=ckey) "max"
FROM customers c;


Our optimizer is able to rewrite the query in such a way that the orders table is only scanned once. That is achieved through a combination of query rewrites, all of which are explained in the paper. One such rewrite, and the focus of this blog, is our ability to remove redundant join operations. The usefulness of this query transformation is not limited to the query pattern presented above, but also in other scenarios as well, for example, the result of inlining views accessing the same set of tables may lead to this type of redundancy in the queries.

Eliminating redundant joins is crucial for several reasons in real-world SQL query optimization. Firstly, we have observed that many complex SQL queries, especially those generated by automatic tools, often contain a significant amount of redundancy. Identifying and removing redundant joins in such queries lead to faster query execution and improved overall system efficiency. Secondly, redundant join removal is an essential step in Firebolt's de-correlation process. Our ability to optimize complex correlated aggregate subqueries, as detailed in our BTW 2025 paper, heavily relies on simplifying the query plan by eliminating redundant operations. This not only keeps the final execution plan concise and manageable, but also enables further optimizations that would otherwise be impossible, ensuring highly efficient query processing.

Redundant inner joins

A redundant join is a join that could be removed from the query plan without affecting the query results. Eliminating redundant joins reduces the amount of data the engine needs to process, which leads to faster query execution. In other words, you get the same correct results with less computational effort – improving overall performance. Additionally, removing redundant operations can simplify the query plan in ways that enable further optimizations that wouldn’t have been possible otherwise.

Let’s consider the following example:

-- joining t with itself
SELECT x.*, y.* FROM t AS x INNER JOIN t AS y ON x.a = y.a;

Any non-null value of x.a is guaranteed to have at least one matching partner from y, since x.a and y.a have the same provenance, t.a. If a was a unique key in t, then the following query would produce the same results without performing any join:

-- self join of t is removed
SELECT x.*, x.* FROM t AS x WHERE x.a IS NOT NULL;

See how the entire join has been replaced with just a filter removing the null values from a, as the former join condition used to do. Also, notice that all the columns selected from y have been replaced with the corresponding columns from x. In our terminology, we call y the redundant join operand, as it is the join operand being removed, and we call x the replacing operand, since we are replacing all the appearances of columns from y with the corresponding columns from x.

This self-join on a unique key is the most basic example of a removable join. Let’s consider now the following query where column a is no longer required to be a unique key of t:

SELECT x.*,
       y.a1
FROM   t AS x
       INNER JOIN (SELECT DISTINCT a AS a1
                   FROM   t) AS y
               ON x.a = y.a1;

Similarly, y can be removed from the query because every tuple from x with a non-null value in column a is guaranteed to have a unique matching partner from y. x and y are equi-joined on columns with the same provenance and the joining columns form a unique key in the redundant operand y. The equivalent query without the redundant join is:

SELECT x.*, x.a AS a1 FROM t AS x WHERE x.a IS NOT NULL;

The join needs to be on columns that form a unique key in the redundant join operand. Otherwise, the output cardinality of the rewritten plan is not guaranteed to be the same as with the original plan with the join.

Another required condition is that the replacing join operand should be able to provide all the required columns from the redundant join operand. Consider the following example where y cannot be replaced with x even though they are joined on a column with the same provenance that forms a unique key on y, because x cannot provide the value for y.max after removing the join.

SELECT x.*,
       y.*
FROM   t AS x
       INNER JOIN (SELECT a      AS a1,
                          Max(b) AS max
                   FROM   t
                   GROUP  BY a) AS y
               ON x.a = y.a1;

Finally, the common source between the replacing and the redundant join operands must be unfiltered in the redundant side. For example, the replacement in the following case wouldn’t be valid, even though the first two requirements are satisfied, because the rows from t1 are filtered in y.

SELECT x.*
FROM   t AS x
       INNER JOIN (SELECT DISTINCT a AS a1
                   FROM   t
                   WHERE  b > 10) AS y
               ON x.a = y.a1;

The query above is effectively a semi-join filtering the rows from x where there is at least one row in t1 with the same value for with a value for b greater than 10. Rewriting the filter to filter on x.b, as in the following query, would not be valid unless b is a functional dependency of a, where the value of column b is fully determined by the value of column a.

-- if b is functionally dependent on a
SELECT x.*, x.* FROM t AS x WHERE x.a IS NOT NULL AND x.b > 10;

Redundant outer joins

Query optimizer people know that the real fun always starts with outer joins. There are two types of redundant outer joins we can identify and remove. The first type is shown by the following example where a is again required to be a unique key of t:

SELECT x.*, y.* FROM t AS x LEFT JOIN t AS y ON x.a = y.a;

This query could be rewritten as the following:

-- if a is a unique key of t, remove the left join
SELECT x.*,
       CASE WHEN x.a IS NOT NULL THEN a ELSE NULL END,
       CASE WHEN x.a IS NOT NULL THEN b ELSE NULL END,
       ...
FROM t AS x;

Note that the filter “x.a IS NOT NULL” is no longer added since left outer joins must always preserve the rows from its left input. Instead, we need to replace all columns from y with columns from x, but making them NULL when column a is NULL, using CASE WHEN operator, since that is the only case where a row from the left-hand side won’t find any matching partner in the right-hand side of the left join.

For an equi-join whose join condition does not reject nulls, for example, a join using IS NOT DISTINCT FROM comparison, there is no need to wrap the columns from x within these CASE WHEN expressions when replacing the columns from y. Implementation-wise though, we always add this wrapping CASE WHEN expressions, use the original join condition as their condition, and let the optimizer simplify them. This way, the code has less special paths to maintain. The resulting optimization steps for these two queries are presented below.

-- example 1
SELECT x.*, y.* FROM t AS x LEFT JOIN t AS y ON x.a = y.a;
-> (redundant join removal)
SELECT x.*,
       CASE WHEN x.a = x.a THEN x.a ELSE NULL END,
       CASE WHEN x.a = x.a THEN x.b ELSE NULL END,
       ...
FROM t AS x;
-> (expression reduction)
SELECT x.*,
       CASE WHEN x.a IS NOT NULL THEN x.a ELSE NULL END,
       CASE WHEN x.a IS NOT NULL THEN x.b ELSE NULL END,
       ...
FROM t AS x;
-> (expression reduction, if a is not nullable)
SELECT x.*,
       x.a,
       x.b
       ...
FROM t AS x;
-- example 2
SELECT x.*, y.* FROM t AS x LEFT JOIN t AS y ON x.a IS NOT DISTINCT FROM y.a;
-> (redundant join removal)
SELECT x.*,
       CASE WHEN x.a IS NOT DISTINCT FROM x.a THEN x.a ELSE NULL END,
       CASE WHEN x.a IS NOT DISTINCT FROM x.a THEN x.b ELSE NULL END,
       ...
FROM t AS x;
-> (expression reduction)
SELECT x.*,
       x.a,
       x.b
       ...
FROM t AS x;

What if there are more conditions on the join? As long as the unique key from the redundant operand is equi-joined with columns from the replacing operand through columns with the same provenance, we can simply use these wrapping CASE expressions, using the entire join condition as their condition. The following example shows this.

SELECT x.*, y.* FROM t AS x LEFT JOIN t AS y ON x.a = y.a AND x.b > 10;
-> (redundant join removal)
SELECT x.*,
       CASE WHEN x.a = x.a AND x.b > 10 THEN x.a ELSE NULL END,
       CASE WHEN x.a = x.a AND x.b > 10 THEN x.b ELSE NULL END,
       ...
FROM t AS x;
-> (expression reduction)
SELECT x.*,
       CASE WHEN x.a IS NOT NULL AND x.b > 10 THEN x.a ELSE NULL END,
       CASE WHEN x.a IS NOT NULL AND x.b > 10 THEN x.b ELSE NULL END,
       ...
FROM t AS x;
-> (expression reduction, if a is not nullable)
SELECT x.*,
       CASE x.b > 10 THEN x.a ELSE NULL END,
       CASE x.b > 10 THEN x.b ELSE NULL END,
       ...
FROM t AS x;

Let’s now consider the second type of redundant outer join shown in the query below, which is the one that we need to cover for optimizing the queries in our BTW 2025 paper.

SELECT t2.*,
       x.*,
       y.*
FROM   t2
       LEFT JOIN t1 AS x
              ON t2.i = x.a
       LEFT JOIN t1 AS y
              ON t2.i = y.a;

In this case, we can remove y and replace it with x (or vice versa) as long as column a forms a unique key of t1 even though x and y are not directly joined but indirectly through t2. The resulting rewritten query is:

-- remove redundant join operand y
SELECT t2.*, x.*,
       CASE t2.i = x.a THEN x.a ELSE NULL END,
       CASE t2.i = x.a THEN x.b ELSE NULL END,
       ...
FROM t2 LEFT JOIN t1 AS x ON t2.i = x.a

Which can be further simplified to:

SELECT t2.*, x.*,
       x.a,
       x.b,
       ...
FROM t2 LEFT JOIN t1 AS x ON t2.i = x.a;

Since the columns from x are already guaranteed to be NULL when the predicate doesn’t evaluate to true.

In the example above, both the replacing and the redundant join operands are in the non-preserving side of an outer join. In order for the redundant join removal to be valid, the outer join predicates of the replacing operand must be a subset of the outer join predicates of the redundant operand. In the following example, the join with y can still be removed from the query, replacing its referenced columns with columns from x, as long as column a forms a unique key in t1. However, x cannot be replaced with y due to this new requirement.

-- cannot remove the join to x due to the extra join condition on y
SELECT t2.*,
       x.*,
       y.*
FROM   t2
       LEFT JOIN t1 AS x
              ON t2.i = x.a
       LEFT JOIN t1 AS y
              ON t2.i = y.a
                 AND t2.j > 10; -- extra join condition

Again, the predicates of the removed operand become the condition of the wrapping CASE expressions:

SELECT t2.*, x.*,
       CASE t2.i = x.a AND t2.j > 10 THEN x.a ELSE NULL END,
       CASE t2.i = x.a AND t2.j > 10 THEN x.b ELSE NULL END,
       ...
FROM t2 LEFT JOIN t1 AS x ON t2.i = x.a

In order for these rewrites to be valid, we’ve mentioned earlier that there needs to be a join directly or indirectly joining columns with the same provenance from both sides covering a unique key from the redundant join operand. And that alone is not sufficient. We need to make sure the common source between the redundant and the replacing operands must remain unfiltered in the redundant side. Consider the following example:

SELECT *
FROM   t3
       LEFT JOIN (t1
                  INNER JOIN t2
                          ON t1.a = t2.i) AS x
              ON t3.m = t1.a
       LEFT JOIN t1 AS y
              ON t3.m = y.a

In this case, the inner join between t1 and t2 is the non-preserving side of the outer join with t3. That inner join may filter rows from t1. As a result, we cannot remove any outer join as a redundant join.

Other types of joins

Performing redundant join removal for semi joins is trivial since they are just a special type of inner joins. There is no extra consideration needed for them. In fact, for any equi-semi-join, the columns from the semi-side involved in the join conditions always form a unique key. In the following example, removing the semi-join is possible even if a is not a unique key of t, thanks to the semi-join semantics.

SELECT * FROM t AS x WHERE x.a IN (SELECT y.a FROM t AS y);
->
SELECT * FROM t AS x WHERE x.a IS NOT NULL;

On the other hand, anti joins are not simple. If fun starts with outer joins, pain starts with anti joins. One reason is that there are effectively two types of anti-joins with different semantics on nulls. The following example shows how the two most basic types of redundant anti-joins could be rewritten.

-- NOT IN
SELECT * FROM t AS x WHERE x.a NOT IN (SELECT y.a FROM t AS y);
->
SELECT * FROM t AS x WHERE FALSE;

-- NOT EXISTS
SELECT * FROM t AS x WHERE NOT EXISTS (SELECT * FROM t AS y WHERE x.y = y.a)
->
SELECT * FROM t AS x WHERE x.a IS NULL;

Although we all have rewritten some IN predicate to EXISTS or vice versa in SQL queries, the negated version the two predicates have a subtle difference in null semantic, which is often overlooked.

A simple example is shown below. x and y are indirectly joined. It is obvious that one of the joins is redundant and can be removed.

SELECT *
FROM   t2
WHERE  t2.i NOT IN (SELECT x.a FROM t1 AS x)
  AND  t2.i NOT IN (SELECT y.a FROM t1 AS y);  -- redundant (either x or y)
->
SELECT *
FROM   t2
WHERE  t2.i NOT IN (SELECT x.a FROM t1 AS x);

In the following example, y can still be removed since it doesn’t filter out any row from t2 that x isn’t already filtering.

SELECT *
FROM   t2
WHERE  t2.i NOT IN (SELECT x.a FROM t1 AS x)  -- cannot remove
  AND  t2.i NOT IN (SELECT y.a FROM t1 AS y WHERE y.b > 10);  -- still redundant
->
SELECT *
FROM   t2
WHERE  t2.i NOT IN (SELECT x.a FROM t1 AS x);

For NOT IN anti-joins, the third condition is a bit different as it is the replacing operand, instead of the redundant operand, where the common source of the data must remain unfiltered.

Implementation details

Our redundant join removal implementation searches for redundant join operands by analyzing the join graph. Within a join graph, the process of removing a redundant join corresponds to removing a vertex (which represents a plan) and an associated edge (which represents the join) from the graph.

Below is a trace from our optimizer after removing the redundant join from one of the examples we saw before.

SELECT x.*,
       y.a1
FROM   t AS x
       INNER JOIN (SELECT DISTINCT a AS a1
                   FROM   t1) AS y
               ON x.a = y.a1;

Sub-plan before rule: RedundantJoinRemovalRule
[0] [Projection] t.a, t.b, a1
|   [JoinGraph]
|    - Projection: class_0, vertex_1.t.b, class_0
|    - Vertices:
|        Vertex 0 -> Node 3, provided classes: 0
|        Vertex 1 -> Node 2, provided classes: 0
|    - Edges:
|        type: Inner, predicate: [equals(class_0, class_0)]
|    - Equivalence classes:
|        Class 0: vertex_0.a1, vertex_1.t.a
 \_[1] [Join] Mode: Inner [equals(t.a, a1)]
    \_[2] [StoredTable] Name: "t"
    \_[3] [Aggregate] GroupBy: [a1: t.a] Aggregates: []
      |   [Unique Keys]: [a1]
       \_[4] [StoredTable] Name: "t"

Sub-plan after rule: RedundantJoinRemovalRule
[6] [Projection] t.a, t.b, a1: t.a
 \_[7] [Filter] equals(t.a, t.a)
    \_Recurring Node --> [2]

Our plan representation uses a DAG representation with binary join operators. However, every node that represents the root node of a join tree is annotated on demand with a simple join graph representing the join graph under it. In the example above, the join graph contains two vertices, which correspond to nodes 2 and 3 in the plan, and one edge, joining the two vertices. 

This join graph representation is also used for other optimizations, such as join reordering or aggregation push down.

Equality predicates among expressions from different vertices are represented as equivalence classes. This way we can represent the relation among several vertices with a single edge. The following snippet is the result of adding an extra join operand to the previous example, which is joined to the other two through the same equivalence class, leading to a single edge.

SELECT x.*,
       y.a1
FROM   t AS x
       INNER JOIN (SELECT DISTINCT a AS a1
                   FROM   t) AS y
               ON x.a = y.a1
       INNER JOIN t2
               ON t2.i = x.a;

Sub-plan before rule: RedundantJoinRemovalRule
[0] [Projection] t.a, t.b, a1
|   [JoinGraph]
|    - Projection: class_0, vertex_0.t.b, class_0
|    - Vertices:
|        Vertex 0 -> Node 3, provided classes: 0
|        Vertex 1 -> Node 4, provided classes: 0
|        Vertex 2 -> Node 7, provided classes: 0
|    - Edges:
|        type: Inner, predicate: [equals(class_0, class_0)]
|    - Equivalence classes:
|        Class 0: vertex_0.t.a, vertex_1.a1, vertex_2.t2.i
 \_[1] [Join] Mode: Inner [equals(t2.i, t.a)]
    \_[2] [Join] Mode: Inner [equals(t.a, a1)]
    |  \_[3] [StoredTable] Name: "t"
    |  \_[4] [Aggregate] GroupBy: [a1] Aggregates: []
    |    |   [Unique Keys]: [a1]
    |     \_[5] [Projection] a1: t.a
    |        \_[6] [StoredTable] Name: "t"
    \_[7] [StoredTable] Name: "t2"

Sub-plan after rule: RedundantJoinRemovalRule
[8] [Projection] t.a, t.b, a1: t.a
|   [JoinGraph]
|    - Projection: class_0, vertex_0.t.b, class_0
|    - Vertices:
|        Vertex 0 -> Node 10, provided classes: 0
|        Vertex 1 -> Node 11, provided classes: 0
|    - Edges:
|        type: Inner, predicate: [equals(class_0, class_0)]
|    - Equivalence classes:
|        Class 0: vertex_0.t.a, vertex_1.t2.i
 \_[9] [Join] Mode: Inner [equals(t.a, t2.i)]
    \_[10] [Projection] t.a, t.b
    |  \_Recurring Node --> [3]
      \_[11] [Projection] t2.i
       \_Recurring Node --> [7]

If the extra vertex is joined to only one the other two vertices, i.e. through a different column, our simple join graph will contain two inner join edges corresponding to the two join equivalence classes, as shown in the example below.

SELECT x.*,
       y.a1
FROM   t AS x
       INNER JOIN (SELECT DISTINCT a AS a1
                     FROM t) AS y
               ON x.a = y.a1
       INNER JOIN t2
               ON t2.i = x.b;

Sub-plan before rule: RedundantJoinRemovalRule
[0] [Projection] t.a, t.b, a1
|   [JoinGraph]
|    - Projection: class_0, class_1, class_0
|    - Vertices:
|        Vertex 0 -> Node 3, provided classes: 0, 1
|        Vertex 1 -> Node 4, provided classes: 0
|        Vertex 2 -> Node 7, provided classes: 1
|    - Edges:
|        type: Inner, predicate: [equals(class_0, class_0)]
|        type: Inner, predicate: [equals(class_1, class_1)]
|    - Equivalence classes:
|        Class 0: vertex_0.t.a, vertex_1.a1
|        Class 1: vertex_0.t.b, vertex_2.t2.i
 \_[1] [Join] Mode: Inner [equals(t2.i, t.b)]
    \_[2] [Join] Mode: Inner [equals(t.a, a1)]
    |  \_[3] [StoredTable] Name: "t"
    |  \_[4] [Aggregate] GroupBy: [a1] Aggregates: []
    |    |   [Unique Keys]: [a1]
    |     \_[5] [Projection] a1: t.a
    |        \_[6] [StoredTable] Name: "t"
    \_[7] [StoredTable] Name: "t2"

Sub-plan after rule: RedundantJoinRemovalRule
[8] [Projection] t.a, t.b, a1: t.a
|   [JoinGraph]
|    - Projection: vertex_0.t.a, class_0, vertex_0.t.a
|    - Vertices:
|        Vertex 0 -> Node 10, provided classes: 0
|        Vertex 1 -> Node 12, provided classes: 0
|    - Edges:
|        type: Inner, predicate: [equals(class_0, class_0)]
|    - Equivalence classes:
|        Class 0: vertex_0.t.b, vertex_1.t2.i
 \_[9] [Join] Mode: Inner [equals(t.b, t2.i)]
    \_[10] [Projection] t.a, t.b
    |  \_[11] [Filter] equals(t.a, t.a)
    |     \_Recurring Node --> [3]
    \_[12] [Projection] t2.i
       \_Recurring Node --> [7]

As explained before, in order for this optimization to happen we need to check that the redundant vertex is joined to the replacing vertex through one of its unique keys. As shown in the examples above, we keep track of the unique keys of each node in the plan as computed information that gets attached to the query plan.

The last remaining bit is how to check whether two columns joined in the join graph have the same source. For that, we use another computed property that, for every node, keeps track of the steps its projected columns went through. The following snippet shows our first example again with this extra annotation.

SELECT x.*,
       y.a1
FROM   t AS x
       INNER JOIN (SELECT DISTINCT a AS a1
                   FROM   t) AS y
               ON x.a = y.a1;

Sub-plan before rule: RedundantJoinRemovalRule
[0] [Projection] t.a, t.b, a1
|   [JoinGraph]
|    - Projection: class_0, vertex_1.t.b, class_0
|    - Vertices:
|        Vertex 0 -> Node 3, provided classes: 0
|        Vertex 1 -> Node 2, provided classes: 0
|    - Edges:
|        type: Inner, predicate: [equals(class_0, class_0)]
|    - Equivalence classes:
|        Class 0: vertex_0.a1, vertex_1.t.a
 \_[1] [Join] Mode: Inner [equals(t.a, a1)]
    \_[2] [StoredTable] Name: "t"
    |     [Column Provenance]:
    |     - source: [2]
    |       column expressions: t.a, t.b
    |       inverse path: 
    |     - source: t1
    |       column expressions: t.a, t.b
    |       inverse path: 
    \_[3] [Aggregate] GroupBy: [a1] Aggregates: []
      |   [Column Provenance]:
      |   - source: [3]
      |     column expressions: a1
      |     inverse path: 
      |   - source: [4]
      |     column expressions: a1
      |     inverse path: 0
      |   - source: [5]
      |     column expressions: t.a
      |     inverse path: 0, 0
      |   - source: t
      |     column expressions: t.a
      |     inverse path: 0, 0
      |   [Unique Keys]: [0]
       \_[4] [Projection] a1: t.a
          \_[5] [StoredTable] Name: "t"

By using this property, we can quickly check that vertex_0.a1 and vertex_1.t.a have a common source, which is the t.a column.

This column provenance property is effectively a flattened version of the subtree under each node, which makes it a memory intensive property which is also costly to compute. However, it is quite useful for several other optimizations.

To all posts

Intrigued? Want to read some more?