<- Back to all posts

Data Management Lifecycle in Firebolt

June 5, 2023

June 5, 2023

Data Management Lifecycle in Firebolt

VP of Product

No items found.

Listen to this article

Powered by NotebookLM

Listen to this article

Introduction

Firebolt is a Cloud Data Warehouse (CDW) for engineers. We’ve developed our system to help SQL developers create sub-second analytics experiences.

We have customers across many industries, and a common pattern and behavior that we see across all those companies is the dynamic nature of their business and a need to manage and handle constant changes that are happening. Changes are typically reflected as data changes that need to be captured, preserved, accessed, and analyzed. Examples range from simple data corrections that happen in various systems of record, to implementing slowly changing dimensions in data warehouses to ensure historical data is preserved, and all the way to compliance and regulatory requirements/needs (GDPR and other regulations).

To address the above needs, one of the core aspects of any Cloud Database Warehouse (CDW) system and Relational Database Management System (RDBMS) for that matter is data management capability. The ability to manage the full lifecycle of data as changes occur is expected. Not only do full Create Read Update Delete (CRUD) operations need to be provided, but the expectation is also that each operation comes with full ACID (Atomicity Consistency Isolation Durability) guarantees. At Firebolt, we do not look at these capabilities and requirements any differently. In addition to delivering next-generation efficiency at scale, full ACID compliance for all CRUD operations is built into the system for all operations that Firebolt supports.

In this article, we will explain how Firebolt delivers full data management capabilities, while providing insights into some internal components and how we enabled this functionality. We will explain simple CRUD operations to illustrate how Firebolt works while introducing and explaining necessary Firebolt concepts where needed.

Data management lifecycle in Firebolt

In sections below, we will demonstrate the data management lifecycle in Firebolt, and how to: 1) create a simple table, 2) ingest data into it using bulk insert and singleton insert capabilities, and finally, 3) modify the data by performing delete and update operations. We’ll also explain what happens behind the scenes, how data is being stored internally and how internal data structures evolve as data gets manipulated and changed, as well as how firebolt handles concurrency, managing indexes and optimizing storage.

We’ll provide code snippets to illustrate user experiences as we go through various scenarios and examples. One of our core principles as a database product is to align behind PostgreSQL as a dialect of choice. While the supported dialect is expected to align with the standard PostgreSQL dialect, Firebolt, as a distributed system, comes with extensions to the PostgreSQL language.

Transactions and concurrency

Before we get into how data manipulation and changes work in Firebolt, it is important that we first explain transactions and concurrency controls that are implemented within the system.

In Firebolt, each request/statement is treated as a new implicit transaction. As such, every statement has a transaction ID and timestamp associated with it. Furthermore, Firebolt leverages an optimistic concurrency method and implements Multi Version Concurrency Control (MVCC) to handle concurrent requests coming into the system.

Optimistic concurrency comes with increased throughput, because multiple transactions can complete frequently without interfering with each other. Each transaction verifies that no other transaction has modified the same data concurrently. If the check reveals conflicting modifications, the committing transaction is rolled back. MVCC implementation keeps multiple copies of data items to handle concurrent and potentially contradicting data changes. This way, each transaction sees a snapshot of the database at a particular instant in time. Any changes made by one transaction will not be seen by other transactions until the changes have been completed.

All of this is implemented in the product to ensure Firebolt can satisfy performance, scale, and throughput requirements that data intensive applications need.

Creating tables

Let’s start by creating a simple table called rankings.

CREATE TABLE IF NOT EXISTS rankings (
    GameID INTEGER,
    PlayerID INTEGER,
    MaxLevel INTEGER,
    TotalScore BIGINT,
    PlaceWon INTEGER,
    TournamentID INTEGER,
    SOURCE_FILE_NAME TEXT,
    SOURCE_FILE_TIMESTAMP TIMESTAMP) 
PRIMARY INDEX GameID, TournamentID, PlayerID;

A few things to notice in the above CREATE TABLE statement:

Firebolt supports a variety of data types, including integer, text, date and timestamp data types and many others. This provides lots of flexibility to create data models that suit your needs. For a full list of supported data types, refer to Firebolt documentation.
The statement also contains a primary index definition. While primary index creation is optional in Firebolt, we encourage defining primary indexes on all tables. Primary indexes, and indexes in general, are critical for delivering performance and achieving efficient query execution.

Each table in Firebolt is composed of internal objects called tablets. Tablets store data in columnar format with built-in data compression, reducing disk I/O and resulting in efficient, performant query processing. Input data gets automatically converted into this columnar data format as the data gets ingested into the table. The size of individual tablets and the number of tablets per table is automatically determined and managed by the Firebolt system. When a table is defined as a partitioned table, individual tablets are bound to a particular partition, as tablets cannot span multiple partitions.

Data management lifecycle - creating tables — "F3" File Format Explained

How do indexes help with delivering performance?

Defining a primary index on a table forces data sorting within each tablet. Primary indexes can be defined using one or multiple columns, and data ordering follows that index definition. Implementing proper primary indexes and having data ordered correctly is critical for many reasons. First, since each tablet contains granular metadata information about data stored within the tablet itself, scalar predicates and semi-join queries that use primary index columns can be optimized by scanning only the data that is required (data pruning process). Second, only tablets needed to satisfy the query can be touched, even though a table may contain many tablets behind the scenes – unneeded tablets are skipped during querying time (tablet pruning process). Finally, the ordered dataset typically compresses to a smaller size, leading to reduced IO operations.

In addition to above, in situations where users create partitioned tables, queries containing filters on partitioned columns can eliminate unnecessary data scans by querying only partitions needed for results (partition pruning process).

All techniques mentioned above lead to efficiency when querying and processing data.

Inserting data

To support loading data from a data lake (Amazon S3) into Firebolt tables, Firebolt provides utilities for loading data in bulk. Users can perform both initial and incremental data ingestion using either a COPY statement or INSERT INTO SELECT FROM <external_table> syntax.

In addition, Firebolt supports singleton inserts via INSERT INTO VALUES statement. Typical scenarios where singleton inserts come handy are:

Refreshing tables with small amounts of dimensional data, and
Implementing slowly changing dimension patterns.

Let’s review how these utilities work in more detail, and their corresponding performance.

Bulk inserts

In Firebolt, we designed bulk data ingestion utilities to be inherently scalable. There are 4 major stages that each loading process goes through: 1) reading data from Amazon S3, 2) converting data into tablets and columnstore, 3) merging tablets to optimize tablet quality (needed for efficient tablet pruning), and 4) uploading tablets into Amazon S3. To scale loading processes, Firebolt supports both parallel and pipelined data processing. At the node level, Firebolt fully employs intra-node parallelism by having multiple reader/writer processes. Data loading parallelism automatically increases as the engine increases in scale (multi-node parallelism). Pipelined execution ensures that all deployed resources are kept busy while data is being streamed across transformation stages mentioned above. This way, users can ingest large amounts of data into Firebolt tables while utilizing deployed resources as efficiently as possible.

data management lifecycle - Bulk inserts

As the bulk data load gets executed, Firebolt creates new tablets behind the scenes while preserving necessary metadata information about the data being ingested. Factors such as min/max values, tablet size, data offsets and ranges, tablet location, etc., are kept as metadata artifacts and used during query execution to deliver differentiated performance characteristics.

Performing data ingestion comes with transactional semantics and ACID guarantees. Each COPY execution is treated as a separate transaction. With snapshot isolation support by default, data reads (and other operations) can be executed as the data are being ingested into tables. Different data readers will not see ingested data until the data ingestion process finishes and the transaction gets committed. Similarly, the same table can be loaded simultaneously with multiple ingestion processes.

Simplifying ingestion with “Copy From”

Data ingestion from S3 into Firebolt managed storage can be performed by using the “COPY FROM” command. Schema inference is provided to quickly create tables as a part of ingestion. Additionally, incremental ingestion is supported through the use of metadata fields: $source_file_name , $source_file_timestamp. Other capabilities include filters for file and record level ingestion, error file specification and more. A simple example of ingestion is shown below.

COPY INTO rankings 
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/' 
WITH CREDENTIALS = (AWS_KEY_ID = … AWS_SECRET_KEY = … ) 
PATTERN = '*rankings/.parquet' AUTO_CREATE = TRUE 
TYPE = PARQUET  ERROR_FILE = 's3://firebolt-publishing-public/help_center_assets/errors/' 
ERROR_FILE_CREDENTIALS = (AWS_KEY_ID = … AWS_SECRET_KEY = … );

Additionally, Firebolt provides table-valued functions to explore objects directly in Amazon S3. By using Firebolt's LIST_OBJECTS,READ_CSV, and READ_PARQUET functions, you can explore and validate your data directly from S3, ensuring that any issues are caught early in the process.
‍

Singleton insert

Data inserted by a singleton insert statement gets stored within a single tablet. From the tablet quality perspective, situations where many singleton inserts are being executed can cause table fragmentation. This in turn can lead to suboptimal query performance. One way to mitigate this situation is to use a mini batch pattern (a collection of INSERT INTO VALUES statements separated by a comma) whenever possible. When a batch of INSERT INTO VALUES statements are executed, Firebolt tries to create a single tablet which improves tablet quality and reduces table fragmentation.

To minimize operational overhead and system maintenance that table fragmentation can cause, Firebolt implements a built-in optimization process that merges tablets with suboptimal size. This optimization process is fully autonomous and runs in the background. The background process wakes up periodically to search for suboptimal tablets and merges them while keeping optimal tablet size in mind. In addition, Firebolt supports the VACUUM command that allows users full control to defragment tables on-demand too.

While singleton insert is a supported pattern in Firebolt, to optimize performance of large data ingestion we recommend using bulk insert utilities.

Singleton insert statement example

INSERT INTO rankings (GameID, PlayerID, MaxLevel, TotalScore, PlaceWon, TournamentID) VALUES (10, 231, 10, 1000020, 1, 5);

Bulk insert statement example

INSERT INTO rankings SELECT * FROM rankings_ext;

Note: rankings_ext object referenced in above SQL statement is an external table pointing to S3 data

Deleting data

We built a system that supports storing as much data as needed for as long as needed. However, there are situations where data does need to be deleted. Situations like data corrections that occur in the systems of records, or GDPR compliance where a single (or multiple) customer record(s) must be deleted to preserve privacy, have caused us to enable DELETE statements in Firebolt.

Firebolt supports deleting a single record in a table or deleting data in bulk. In addition to the tablet level metadata, each tablet comes with a delete log. As the records get deleted, the delete log is maintained. Given rich tablet-level metadata, only tablets that are affected by the delete operation are touched. In cases where DELETE query uses a primary key(s) as a predicate, Firebolt leverages primary key information to quickly find tablets and records that are affected by the DELETE operation, leading to performant deletes. Similarly, deleting all data that belongs to one or more partitions is almost instantaneous as deleting data in these cases is only a metadata operation in Firebolt.

Having frequent deletes in the system could lead to tablet fragmentation. For these reasons, we designed a background process that continuously and automatically optimizes tablets behind the scenes. For more information, refer to the optimizing data section of this document.

Note: For optimal performance, leverage primary key(s) for deleting the data whenever possible.

Bulk delete statement example

DELETE FROM rankings WHERE PlayerID = 231;

Singleton delete statement example

DELETE FROM rankings WHERE GameID = 10 AND TournamentID = 5 AND PlayerID = 231;

Updating data

In real life, data updates happen often. Individuals regularly update their contact information as their residency changes, prospects change their contact preference for marketing campaigns, an order changes its status as it goes through the fulfillment process - all these scenarios cause data stored in the data warehouses to be updated.

In Firebolt, we could not claim full data management support without enabling full and proper support for updates.

Since an update can be represented as a composite operation of DELETE followed by an INSERT operation, UPDATE comes as a relatively simple extension to the system. This holds true for both singleton as well as bulk update statements. Firebolt offers transactional guarantees and ACID compliance for the update operations too.

Firebolt supports both simple as well as complex update functionality, including complex predicates and multi-table joins. Similarly, any column defined in a Firebolt table can be updated, including partitioning columns. While updating values for a partitioning column may lead to a longer execution time (depending on the number of records that exist in the partition, as updated data may need to be moved to newly assigned partitions), this functionality simplifies usage and avoids the need for manual workarounds to be done by our users.

Update statement example

UPDATE rankings SET GameID = 11 WHERE GameID = 10 AND TournamentID = 5 AND PlayerID = 231;

Optimizing storage

As mentioned in earlier sections, certain data modification scenarios (deletes/updates) could lead to situations where tables get fragmented. To help spot fragmentation situations and identify tables that are fragmented, Firebolt provides a view called information_schema.tables.

The information_schema.tables view provides a number of useful columns, such as the number of rows, compressed and uncompressed table size, among others. For the purposes of this article, two columns are of interest as fragmentation metrics: 1) number_of_tablets and 2) fragmentation. As the name suggests, the number_of_tablets column projects information about the total number of tablets in each table. The fragmentation column specifies the table fragmentation percentage (number between 0-100).

Note: As best practice and general guidance, we recommend keeping fragmentation below 80%.

Tables that are fragmented have been identified, now what?

There are two paths that Firebolt provides to tackle fragmentation state: 1) the user-driven option of using a VACUUM statement, and 2) system-driven clean up that runs in the background. Both options lead to optimizing fragmented tablets and merging them into tablets with ideal sizes. VACUUM functionality leverages the user-defined engine to perform this optimization task, while system-driven behavior runs completely autonomously and uses Firebolt-provided capacity. The column number_of_tablets can be used to identify the total number of tablets before and after optimization tasks are performed. Similarly, the fragmentation metrics from information_schema.tables can be used to monitor progress and assess effectiveness of the background process to keep the table in a healthy state.

Due to ease of the management and administration, we recommend using the background process. However, we provide both options to our users to fit their desired behavior and needs.

Maintaining indexes

Aggregating indexes (AIs) are one of the key features in Firebolt that help accelerate query performance. Users can create these indexes while using one or more Firebolt supported aggregate functions (such as COUNT, SUM, AVG, MIN, MAX, COUNT(DISTINCT) …). However, data modifications (and specifically deletes) are not the friendliest operations when it comes to maintaining aggregated indexes. Aggregated functions such as COUNT, SUM, AVG are composable aggregates, and as such can be easily updated/modified as data changes. However, aggregates such as MIN, MAX, COUNT(DISTINCT) do not fall into that category. One of the key promises that databases provide is that indexes are automatically maintained and updated as the data in the base table changes. Firebolt keeps this promise and delivers on it by automatically maintaining aggregated indexes even in cases where AIs contain non-update-friendly aggregates.

Traditional materialized views are very vulnerable to the same problems described above. For certain non-composable aggregates (such as MIN/MAX/COUNT(DISTINCT)) other data warehousing systems fall short as they require the whole materialized view to be refreshed or recreated. Depending on the size of a base table, refreshing materialized view could be a very costly and lengthy operation. More importantly, queries that are executed while the materialized view is being refreshed are slower in performance. Not in Firebolt.

Summary

At Firebolt, the focus is on delivering the highest levels of performance and efficiency while minimizing the heavy lift that needs to be done by an end user. Through this article, we have illustrated the foundational elements of Firebolt that provide the full data management capabilities needed by modern data applications. These include our optimized storage layer, parallel-pipelined data ingestion, primary and aggregated indexes, streamlined deletes and updates, and core database semantics. Delivering consistent performance and efficiency requires careful considerations, modeling and optimal indexing strategy. With automated index maintenance, Firebolt enables a hands-free approach to delivering consistent performance without the operational burden, even for cases where data modification functionality is needed. Various types of DML statements can impact system performance as it can result in storage fragmentation, negating the effectiveness of indexes. However, Firebolt addresses this with the choice of user-controlled or system-controlled backend optimization.

Table of Contents

This is some text inside of a div block.

This is some text inside of a div block.

Technical Deep Dive: Automated Column Statistics

Collect statistics about the values in your columns to improve query plans.

Hans-Peter Lehmann

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

Robin Tang explains how Artie simplifies real-time data streaming and CDC for teams at ClickUp, Substack, and Alloy.

Firebolt Team

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Lyft's Ritesh Varyani details their polyglot data strategy unifying Spark, Trino, and ClickHouse with AI.

Firebolt Team

Intrigued? Want to read some more?