Introduction
Efficient data management is essential for modern data warehouses, and Firebolt's primary indexing strategy plays a vital role. Inspired by ClickHouse, Firebolt has refined primary indexing to meet the low-latency needs of large-scale analytics workloads.
This blog post aims to provide a comprehensive understanding of primary indexing in Firebolt, covering the basics, the construction process as well as management of primary index, and the advantages and limitations involved. Using a real-world example of a 1TB query history dataset with over 90 million rows, we'll illustrate how Firebolt optimizes data handling. For this demonstration, the data is stored in S3 and managed by a small 8-node engine that processes 341 tablets. Below is a snippet showing how the `QueryHistory` table is created in Firebolt:
CREATE TABLE QueryHistory(
QueryID TEXT,
QueryText TEXT,
SubmitDate DATE,
EngineName TEXT,
SubmitTime DATE,
Latency INT,
-- additional columns...
) ;
By the end of this post, you will clearly understand how Firebolt leverages primary indexes to boost query performance, along with insights into the trade-offs that come with their implementation. With this knowledge, you’ll be better equipped to design effective indexing strategies tailored to your data needs.
Fundamentals of primary index
Before we explore how the primary index functions in Firebolt, let's revisit the fundamental concepts. A primary index is a crucial data structure employed by databases to efficiently organize and optimize the retrieval of records from a table. Its primary role is to streamline the process of searching and retrieving data based on the values in one or more columns, commonly referred to as the index key or keys.
While the primary index is often associated with the primary key columns of a table, it's important to note that it doesn't necessarily have to be limited to just the primary key. In traditional relational database management systems, the primary index usually matches the primary key, meaning there's one index entry for each row in the table. For instance, in a scenario with a dataset of 10 million rows, the primary index would contain the same number of entries. This design choice enables the rapid location of specific rows, resulting in highly efficient lookup queries and point updates. However, this efficiency comes at a cost. The primary index imposes additional resource requirements in terms of both disk and memory usage. Moreover, the insertion of new rows into the table and the corresponding entries into the index becomes a more resource-intensive process. This issue is further exacerbated as the volume of data increases, potentially impacting overall system performance.
While it's typical to have a primary index on primary key columns for data integrity and efficient lookup, it's not always optimal for performance. For instance, if a table has a composite primary key or if we frequently retrieve data by filtering using columns beyond the primary key, relying solely on the primary index may lead to suboptimal query performance. Let's consider a running example of a query history table in a database that records the execution history of various queries. Each entry in this table might have a unique identifier (query ID) as its primary key, along with additional columns such as query text, execution time, user ID, and timestamp. In this scenario, while the primary index on the query ID ensures uniqueness and facilitates quick retrieval of specific query executions, it may not efficiently support certain query patterns. For example, if there's a common requirement to retrieve all queries executed by a particular user within a specific time range, querying solely based on the primary index might not be the most efficient approach. By creating an additional index on columns like user ID and timestamp, specifically tailored to support such query patterns, the database can optimize performance by directly accessing relevant entries without scanning the entire table. This illustrates the importance of considering query patterns and access requirements when designing indexing strategies, even in tables where the primary key is straightforward. Moreover, factors like uneven data distribution, concurrency requirements, and storage overhead should also be considered when designing indexing strategies. It's essential to strike a balance between ensuring data integrity, optimizing query performance, and managing resource utilization effectively.
Primary Index in Firebolt
In the context of Firebolt, which is known for its cloud-native and high-performance analytics capabilities, the implementation of primary indexes has optimizations and variations compared to traditional systems. Firebolt is designed to handle large-scale analytics workloads, and its approach to indexing involves considerations for both performance and resource efficiency.
Unlike traditional database systems where primary indexes are created solely on primary key columns, Firebolt offers users flexibility. In Firebolt, users can create primary indexes on any column. Also Firebolt today doesn't support the concept of primary key. Furthermore, crafting an effective primary index is a nuanced decision-making process that requires careful consideration of various factors. In Firebolt, creating an efficient primary index requires choosing columns frequently used in WHERE and GROUP BY clauses, with a focus on high-cardinality columns to enhance selectivity and performance.
CREATE TABLE QueryHistory( QueryID TEXT, QueryText TEXT, SubmitDate DATE, EngineName TEXT, Latency INT) PRIMARY INDEX SubmitDate, EngineName ;
When users insert data into tables managed by Firebolt, the system initiates a process that involves breaking down the data into manageable units known as tablets, as depicted in Figure 1.
Each of these tablets typically boasts a compressed size of approximately 3GB. Within the confines of each tablet, the data undergoes a sorting process based on primary index keys. This organized dataset is then further segmented into logical chunks, or intra-tablet-ranges, with each intra-tablet-range accommodating approximately 8,000 rows (Figure 2). Reading happens at the tablet-range level, contributing significantly to the overall efficiency of data handling. Data retrieval is optimized by carefully managing the use of S3 storage and local SSDs, ensuring that only the necessary slices of column ranges are read from S3. What we mean here is that it involves a more refined strategy of column and range-specific SSD caching. By retrieving only the necessary segments directly from S3, we circumvent the need for extensive in-memory operations, allowing for a more precise and resource-efficient approach to data handling. This approach minimizes resource usage and accelerates data access, further improving overall efficiency
The subsequent step in this process involves the creation of a primary index for each tablet. This primary index encapsulates primary index key values along with associated offset, designed to facilitate the swift and efficient retrieval of data, as illustrated in Figure 3.
In summary, each tablet has an index file that stores the first record of each tablet-range, capturing key column values to efficiently represent the data distribution within the tablet, as shown in Figure 4. The table metadata incorporates essential elements such as sorting information, stats, and additional indexing structures like aggregating index along with PI, all tailored to provide efficient data pruning and high-performance analytics that's core to Firebolt.
To offer a tangible example, envision a scenario where a tablet of a table contains 400,000 rows distributed across 50 tablet-ranges within tablet 1. The corresponding index file for tablet 1 will have one entry for each tablet-range, resulting in a total of 50 records within the index file. This configuration ensures a one-to-one relationship between rows in the index file and tablet-ranges, strategically minimizing metadata size while retaining the tangible performance benefits inherent to the primary index.
Now, let's delve into how Firebolt manages updates to the primary index, such as when new rows are inserted, deleted, or modified, using an example of a query history table. Firebolt's primary indexing system efficiently handles these operations, ensuring minimal disruption to the data structure.
As mentioned earlier, using a query history table as an example, each row corresponds to an individual query execution identified by a unique query ID. Suppose a user wishes to remove a query from the query history table that ran on the particular engine on a specific date. Firebolt utilizes a deletion mask vector to indicate that the corresponding row is marked for deletion. This deletion process doesn't immediately remove the row but rather flags it for eventual cleanup as shown in Figure 5 below, ensuring consistency and avoiding immediate data loss.
In the above figure, the user ran a query to delete records from the query history table where query_id was QID-8101 and e. If users request to delete all records where the query latency was less than 5 and more than 1000, this action will introduce a new version of the deletion mask while preserving the previous deletions. This means that rows meeting the new deletion criteria will be marked in a new version of the deletion mask, while rows previously marked for deletion will remain unchanged. This ensures that both previous deletions and the application of new deletion criteria are maintained. This is depicted in the figure below by the green arrow and the version-n deletion mask of each tablet.
Similarly, if a user wants to update the details of a specific query. Firebolt handles this by marking the original row for deletion within the current tablet and simultaneously adding a new row with the updated values to a new tablet. This approach maintains data integrity while accommodating changes.
In both scenarios, Firebolt's incremental update mechanism ensures that modifications are seamlessly integrated without disrupting the primary index's efficiency. Additionally, the use of separate tablets for new data and deletion marks allows for efficient management of updates while preserving query performance.
To monitor the impact of deletion marks on data and assess fragmentation levels, users can utilize the fragmentation metric available through information_schema.tables. Despite the presence of deletion marks, query performance remains largely unaffected. However, if fragmentation issues arise, users can employ the VACUUM command to optimize system performance by cleaning up deleted rows and recreating the index files per tablet.
Navigating the Terrain:
Handling large datasets efficiently is a key challenge in modern data management systems. Our approach described above aims to streamline data insertion processes, optimize query performance, and mitigate resource constraints commonly encountered in large-scale data management. However, while this approach offers several potential advantages, its implementation also introduces complexities and trade-offs that must be carefully considered. Moreover, when sparse indexes are utilized—storing only a fraction of records in the index—additional nuances emerge in how filter and join queries are executed. Below, we highlight the benefits and drawbacks of our approach during data insertion and examine its impact on downstream queries, particularly in the context of sparse indexing.
Benefits During Data Insertion:
Firebolt’s approach to data insertion leverages a tablet-based system, which offers several significant advantages. By breaking down large files into smaller, more manageable units, Firebolt enhances the efficiency and performance of data and index handling processes. Here are the key benefits:
- Efficient Insertion: Breaking the large file into tablets enables more efficient insertion, as smaller portions of data are processed at a time, reducing the time required for insertion.
- Parallel Processing: Tablets can be processed concurrently, leveraging multiple threads and/or nodes, thereby speeding up the insertion process.
- Improved Sorting Performance: Sorting individual tablets enhances sorting performance, requiring less memory and resulting in quicker sorting times, thus reducing data insertion latency and resource usage.
- Reduced Memory Requirements: Sorting at the tablet level reduces memory footprint, particularly beneficial for engines with limited memory resources.
- Incremental Updates: The ability to update or replace individual tablets instead of the entire dataset reduces downtime and resource usage during updates.
- Ease of New Data Insertion and Deletion: Thanks to the append-only nature of the storage, inserting new data is straightforward and does not require complex restructuring of existing data or index files. New tablets are simply added, and new index files are created. Deletion operations are handled by marking data for deletion on the side, which allows for efficient unblocking without impacting the insertion process.
Benefits When Querying the Inserted Data:
Firebolt’s tablet-based system optimizes data insertion and enhances query performance. By dividing and organizing data into smaller, focused units, Firebolt enables faster, more efficient querying, particularly in complex and large-scale datasets. Here are the key benefits when querying the data:
- Faster Initial Reads: Loading smaller index files into memory quicker improves initial query response times.
- Targeted Reads: Queries with filters aligned with specific tablets can limit data access to relevant tablets, improving query efficiency by reducing unnecessary data retrieval.
- Parallel Processing: Concurrent sorting and indexing of tablets can speed up the overall query process, especially on multi-core systems.
In contrast, despite these benefits, this approach also has limitations:
- Firebolt cannot maintain the constraint for the primary key.
- Potential for Fragmentation: As new data is continually appended, fragmentation within tablets may occur over time, leading to decreased storage efficiency and potentially impacting query performance.
- Query Performance Overhead: Sparse indexes allow targeted reads and parallel processing, increasing query performance. However, even for very selective filters, Firebolt might have to read one tablet range from many tablets. This can lead to scanning more data than if a globally sorted index on the column existed.
- Limited Flexibility in Data Organization: Once data is organized into tablets, making structural changes or optimizing data organization may be challenging. This can constrain the system's ability to adapt to evolving data requirements or optimize query performance over time.
Guidelines for Creating a Primary Index
Now that you’re familiar with Firebolt’s indexing approach let’s address a most common question: “Which columns should be part of the primary index?” Creating an effective primary index in Firebolt is all about making strategic decisions on column selection in order to optimize query performance and data management efficiency. By following these key guidelines, users can improve data pruning, reduce scan sizes, and boost query execution times:
- Focus on Selective Predicates
- Select columns that are frequently used in the WHERE clauses of your queries, particularly those that significantly reduce the number of rows returned. These are known as highly selective predicates. For example, if your queries are often filtered by user_id or timestamp, including these columns in the primary index can help Firebolt prune data more effectively and improve query performance.
- Avoid columns with low selectivity, such as those used in broad filters like event_ts < now(), as they do not sufficiently reduce the data scan size.
- Start with Low-Cardinality Columns
- Begin your primary index with low-cardinality columns—those with a small number of distinct values. This approach leads to long ordered runs of data, which enhances Firebolt’s pruning efficiency. For instance, if you have a column like month in your data, where the number of distinct values is limited, placing it at the start of the primary index can be advantageous.
- In compound primary indexes, start with low-cardinality columns and then include higher-cardinality columns that are involved in highly selective predicates.
- Include Join Key Columns in Fact Tables
- In a star schema, where a fact table references dimension tables, including the join key columns (foreign keys) in the primary index of the fact table can accelerate queries. For example, if your fact table records each query and references a dimension table containing user information, including the user_id (join key) in the primary index of the fact table, it could significantly improve query performance.
- For dimension tables, only include the join key in the primary index if it is frequently used as a filter.
- Consider the Impact of WHERE Clause Transformations
- Ensure that the columns in your primary index are not subjected to complex transformations in the WHERE clause, as Firebolt cannot effectively utilize the primary index if the column values are altered.
- If transformations are necessary, consider creating a virtual column to store the transformed data and use this virtual column in your primary index.
- Include as Many Columns as Necessary
- Don’t hesitate to include multiple columns in your primary index to support your query workload effectively. While adding more columns might slightly impact ingestion performance, the trade-off is typically worth it for the improved query performance and flexibility.
- However, be mindful of the potential trade-offs, such as increased ingestion time, and balance them with the need for efficient querying.
- Using Partitions with Primary Indexes
- Partitioning is generally not necessary due to the efficiency of primary indexes. However, if partitioning is used, the partition column will be the first level of sorting, with the primary index applied within each segment for further pruning and organization.
To summarize, mastering the fundamentals of primary indexing in Firebolt opens doors to enhanced data management and query optimization. Understanding how Firebolt constructs and utilizes primary indexes empowers users to harness the full potential of their data infrastructure. While the advantages of primary indexing in Firebolt are clear, it's essential to remain mindful of potential trade-offs. However, it's worth noting that our team is actively working on addressing these limitations without compromising on the benefits. By leveraging these insights and advancements, users can navigate the intricacies of primary indexing with confidence, driving efficiency and unlocking valuable insights from their data.
Acknowledgment
Firebolt's primary index concept, a core element in optimizing database performance, is built upon the foundational work of ClickHouse, the widely respected open-source columnar database management system. We acknowledge and appreciate ClickHouse's contributions to the database community, which have inspired us to further refine and expand upon this idea, tailoring it to fit the advanced data processing needs and performance demands of Firebolt's users.
Conclusion
This document explains the key concepts of primary indexing in Firebolt, including how it is constructed and managed, and its benefits and limitations. Using the query history dataset as an example, we showed how Firebolt efficiently handles data by dividing large datasets into tablets and organizing index files.
Firebolt’s flexible approach allows indexing on any column and supports various query patterns, providing efficiency in large-scale operations. Understanding these principles helps users design indexes that balance performance with resource use. While primary indexes offer significant performance advantages, users should be aware of trade-offs and adjust strategies as data and queries evolve. With this knowledge, users can optimize both data management and query performance using Firebolt's indexing system.