Indexes are the primary way for users to accelerate query performance in Firebolt. In practice, they are smaller in size than the base tables and often outperform any other query acceleration methods, including full in-memory caching of entire data sets. Using a book as an analogy, a book index helps you navigate and locate data faster without browsing every page of the book. A database index is similar and enables faster queries by scanning less data.
Firebolt supports three types of indexes:
- primary index (clustered)
- join index (non-clustered)
- aggregating indexes (non-clustered)
Primary Indexes (clustered)
Primary indexes help extract only those much smaller, exact ranges needed for any given query by efficiently pruning data ranges at query runtime. Primary indexes are defined at table creation within your DDL (SQL). During ingestion, data is physically sorted on disk and indexed based on the primary index. You can specify only one primary index per table, containing one or multiple columns.
Behind the scenes, Firebolt primary indexes are sparse indexes. A sparse index is a smaller construct, which holds only one entry per smaller data block in a file or partition, as compared to a dense index, which maps every key value. Sparse indexes also require less maintenance. With Firebolt, data along with all indexes are stored together in F3 (Firebolt File Format pronounced “Triple F”), and data is always accessed using the primary index.
Firebolt also supports table partitioning. In this case, the partition columns are the first stage of data pruning before the primary index. A primary index can contain one or multiple columns and is mandatory for fact tables and highly recommended for dimension tables. If using a multi-column primary index, the ordering of the columns is critical.
We will start with the store_sales fact table from the TPC-DS 10TB version:
We will go ahead and insert 28+ billion rows from the TPC-DS 10TB version.
The table has the primary index set on ss_sold_date_sk and ss_item_sk columns.
If we perform a simple query like the one below, filtering by ss_sold_date_sk, it will return instantly (0.01s):
Adding more filters will only help the query navigate within the segment. Let’s say we are only interested in items sold on that day with a price of over 200:
This is why it’s critical to choose primary indexes based on your query patterns. Columns that show up frequently in the where clause and filter down drastically the result set. Additionally, consider columns used in join clauses. If your analytical queries involve often filtering by user, customer, or product id, those are good candidates for the primary index. Our recommendation is to also order columns based on their cardinality: from lowest (few unique values) to highest. The best order will depend on your queries and their relative usage (how many times they’re called) as well, so make sure to benchmark actual usage patterns.
Join indexes (non-clustered)
Join indexes are cached in RAM to accelerate joins. The query optimizer uses join indexes instead of performing the actual join between tables. Unlike primary indexes, you can define as many join indexes as you want on your dimension tables. We strongly encourage you to set the join key as unique in dimensions and use only the columns that play an active role in your queries.
We will need to add a couple of dimension tables:
and create the following join indexes:
I have intentionally added multiple columns, even though, for our example, we won’t need all of them. You can add as many columns as you want and define as many join indexes as you want on a dimension. The query optimizer will figure out the most appropriate index to use at runtime.
Let’s attempt joining the four tables and find all the sales for a given customer in a specific month:
Keep in mind this is joining a table containing 28 billion rows, with three other dimensions, one of which has 65 million rows.
It finished in 0.69s.
Were join indexes effective? Or did we achieve this time only by setting primary indexes? Let’s find out!
Turning join indexes off and rerunning the query, we get a response time of 78s. That’s a 113x improvement with join indexes enabled.
Another way to confirm the above query is leveraging the join indexes is to look at the explain plan.
But join indexes are just one of the many techniques you can employ to speed up queries in Firebolt.
Another technique that can be combined with join indexes is leveraging aggregating indexes.
Aggregating indexes (non-clustered)
Aggregating indexes are precomputed versions of your data maintained by Firebolt from the base table to accelerate queries with aggregate functions. The query optimizer automatically uses aggregating indexes instead of scanning the base table whenever possible. As with join indexes, you can define as many as you want on your fact tables. We recommend choosing columns that give you an aggregating-index to table rows ratio of at least 1:10. Therefore avoid using high cardinality columns as group keys in aggregating indexes.
Let’s calculate the sales as sum(ss_ext_sales_price) and the discounted amount as sum(ss_ext_discount_amt) by item for a given date.
We can define an aggregating index as below:
And run the following query:
It returns in 0.06s. On a table with 28 billion rows, and on that given day, a total of 29 million rows and 201000 items.
With the above aggregating index, any of the following variations would return in under 0.1s.
The query optimizer determines if replacing the base table with the aggregating index version at query runtime makes sense. The definition of the aggregating index doesn’t have to match exactly the query, as long as all the columns in the query are described by the aggregating index definition.
Was the aggregating index effective? Let’s turn it off and see. We obtain a response time of 0.21s, which is 3x slower than with aggregating index turned on (0.06s).
Are aggregating indexes automatically maintained? Let’s find out!
Re-running the select query again, we can see the two metrics values have been increased by 20000:The query plan for the above query with the aggregating index on and off:
The query plan for the above query with the aggregating index on and off:
We’ve seen how indexes can improve the performance of analytical queries. All three indexes can give you an extreme performance boost depending on your query patterns and how you choose to model your indexes based on those query patterns. The query optimizer rapidly scans all available indexes and checks if it can leverage any or multiple indexes at runtime.
The definition of the index doesn’t have to match exactly the query pattern. The index can contain more columns than the query requires, and the optimizer will know how to leverage the index without hitting the base table. Furthermore, you can define as many join and aggregating indexes as you want. All indexes are automatically maintained behind the scenes by Firebolt.
The best part? The indexes are maintained automatically by Firebolt when you ingest new data or make changes to it, so you don’t have to worry about it!
For the geeky readers, here are the sizes of the indexes used:
Firebolt query optimizer employs many other techniques to provide the fastest possible experience, such as deep filters or join elimination. We’ll go through those capabilities in other articles.