June 10, 2025
June 11, 2025

Querying Apache Iceberg with Sub-Second Performance

June 10, 2025
June 11, 2025

Querying Apache Iceberg with Sub-Second Performance

No items found.

Listen to this article

Powered by NotebookLM
Listen to this article

As the industry is moving toward open table formats, Apache Iceberg is rapidly gaining popularity as the preferred technology to manage massive datasets in data lakes. By bringing ACID transactions to data lakes, it enables multiple systems to concurrently and safely operate on the same data without vendor lock-in. For someone using Iceberg, this opens up new ways to optimize latency and cost. You can mix-and-match different query engines and choose the best one for each workload. Today, we are adding Firebolt to the list of tools to work with your Iceberg tables, releasing a public preview of native support for querying Iceberg tables with the extreme performance and efficiency you've come to expect from us. 

Iceberg emerged as an open table format for large-scale batch processing. This means that it wasn't optimized for low latency. Yet being able to deliver low-latency analytics has become essential to serve many modern data applications. At the moment, this means that people have to ingest a copy of their Iceberg tables into low-latency query accelerators. This blog post gives a deep dive on the work we're doing at Firebolt to bring low latency and high performance directly on top of your Iceberg tables.

The Elephant in the Room: Why is Low Latency on Iceberg So Hard?

Querying Iceberg tables with the speed needed for interactive analytics or AI-driven exploration runs into a few common roadblocks:

Metadata Overhead: Iceberg's architecture involves a hierarchy of metadata: a catalog points to the current metadata file, which points to a manifest list, which in turn points to manifest files that finally track the actual data files. Each step in traversing this hierarchy, especially over a network to cloud storage, introduces latency. For large tables with many partitions and files, this can easily sum up to multiple seconds before even touching the data. But even for smaller tables, object storage latency and the sheer number of hops means this can easily add 0.5 seconds to every query.

Data Access Latency: Iceberg tables often reside in cloud object storage like Amazon S3, which has orders of magnitude higher latencies for accessing data compared to local storage. Reading numerous small files or even just the footers of larger Parquet files can quickly become a bottleneck.

Query Engine Inefficiencies: Traditional query engines might not be optimized for the fine-grained metadata understanding and aggressive caching required to mitigate these latencies effectively.

Firebolt's Approach: Engineered for Speed on Iceberg

We've engineered our Iceberg integration from the ground up to tackle these challenges head-on, enabling you to unlock the value in your Iceberg tables at blistering speeds. At the heart of this is our new READ_ICEBERG table-valued function (TVF).

SELECT *
FROM READ_ICEBERG(
    LOCATION => my_iceberg_location,
    MAX_STALENESS => INTERVAL '10 seconds'
)
LIMIT 100;

This function serves as your gateway to Iceberg data, allowing you to read Iceberg tables from file-based and REST catalogs (e.g., Snowflake Polaris) as well as the Databricks Unity Catalog. Firebolt recommends using a LOCATION object, which encapsulates Iceberg parameters and credentials, for secure, reusable access – more on that a bit later.

Intelligent Metadata Management: Slashing Latency at the Source

Constantly re-fetching and re-parsing Iceberg metadata is a big source of slowdowns. Firebolt employs several strategies to get metadata access out of the critical query path:

  • In-Memory Caching of Iceberg Snapshots: We aggressively cache Iceberg metadata(like metadata files, manifest lists, and manifest files). Not only do we cache the raw files on disk, but to avoid re-reading them over and over, we also cache the deserialized metadata in memory using our subresult reuse machinery, which transparently handles Iceberg in a fully transactional way. This means that after the initial read, subsequent queries can often find the metadata they need almost instantly, moving metadata resolution out of the query path and cutting latencies dramatically.
  • Configurable Data Freshness with MAX_STALENESS: For many interactive use cases, having data that is a few seconds or minutes out of date is perfectly acceptable if it means dramatically faster queries. The MAX_STALENESS parameter allows you to define this tolerance (e.g., INTERVAL '30 seconds'). When specified, Firebolt can use a cached metadata file if it is within the allowed staleness, avoiding costly catalog checks. The default is 0 seconds, ensuring queries always see the latest snapshot if no staleness is specified.
  • Asynchronous Snapshot Refreshing (coming soon): To keep the metadata cache warm and up-to-date without impacting foreground query latency, Firebolt will soon asynchronously refresh snapshots before they expire.

Accelerated Data Access: Bringing Data Closer, Faster

Beyond metadata, accessing the actual Parquet data files efficiently is key:

  • Optimized Parquet Reading: Firebolt is designed for efficient processing of Parquet files, the de-facto standard for columnar data in Iceberg tables. Our Parquet reader has fully decoupled (network) I/O and file decoding, and only keeps as much data in memory as is required to achieve maximum throughput (provided that row group sizes are reasonable). It has an extensible architecture that allows us to bring exciting features previously only known on managed tables to Iceberg and Parquet workloads, with some launching today and more coming in future releases.
  • Multi-Tiered Caching: Firebolt utilizes multi-tiered caching for data, spanning memory and local disk (SSD). This means frequently accessed data from your Iceberg tables can be served from much faster local tiers, significantly reducing the need to go to remote object storage for every request. Because the security of your data is paramount, cached data is tied to the credentials used for access. Data caching is also fully independent of MAX_STALENESS and catalog caching, as data files for Iceberg are immutable and new snapshots typically reuse the vast majority of the previous snapshots’ data files. Added some new rows? Firebolt will read those from object storage and serve the rest from cache.

Smart Query Optimization: Leveraging Iceberg's Structure

Firebolt's query optimizer is now Iceberg-aware, using the rich metadata within Iceberg tables to make smarter decisions:

  • Subresult Caching: Because Iceberg gives ACID guarantees, Iceberg queries can leverage Firebolt’s subresult and result caches (FireCache) just like managed tables can. If a portion of a query, e.g. a complex join build side, has been computed before and the underlying data (considering MAX_STALENESS) hasn't changed, Firebolt can reuse this cached subresult. This is incredibly effective for dashboarding workloads or iterative query refinement, where queries often share common sub-structures. Of course, the same applies for Firebolt’s query result cache. Combined, these are hugely powerful and far exceed what you could achieve with your own caching on top of Firebolt.

Co-Located Joins and Aggregations:

  • When Iceberg tables share compatible partitioning schemes, Firebolt can leverage this information to perform co-located joins, eliminating data shuffling and dramatically speeding up join operations. For example, for an inner join between tables that are both bucketed on the join key, Firebolt can distribute entire partitions to nodes and run the join fully locally.
  • Similar to joins, if an aggregation key matches the Iceberg table's partitioning key, Firebolt can optimize the aggregation process, potentially making local aggregations more efficient and even eliminating entire global aggregation stages.
  • Use the enable_iceberg_partitioned_scan setting to enable these optimizations when the number of partitions is high enough and their size is sufficiently balanced to make this worthwhile. In the future, we want to apply these optimizations automatically.

Metadata Used In Query Planning:

  • Firebolt's optimizer applies its state-of-the-art join ordering algorithm on Iceberg tables. The optimizer gets Iceberg table row counts from the metadata in manifest files. The join ordering works for all scenarios, be it joins between Iceberg tables, or joins between Iceberg tables and managed tables.
  • Iceberg table metadata like table row counts are also used in smart query rewrites. For example, a count(*) query on an Iceberg table is answered purely using the metadata. No actual data file is scanned in this case.
  • File-Level Data Pruning: Iceberg metadata often includes statistics about columns within each data file (e.g., min/max values). Firebolt leverages these statistics to perform aggressive data pruning. For instance, if a query filters the TPC-H orders table on o_orderdate = '1998-01-01', Firebolt will only read data files whose order date ranges include this date, potentially skipping massive amounts of irrelevant data.

Simplicity and Interoperability

Getting started is straightforward. Here’s an example reading from a public Iceberg table in a file-based catalog on S3. This is a real example you can try!

SELECT *
FROM READ_ICEBERG(
    URL => 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_iceberg/tpch/iceberg/lineitem',
    MAX_STALENESS => INTERVAL '30 seconds'
)
LIMIT 5;

Firebolt also supports reading from Iceberg REST catalogs, including the ability to query tables in Databricks Unity Catalog via its Iceberg REST API interface. You can create a LOCATION object to store the location and credential information, with either a generic REST catalog syntax, or with syntax specializations tailored for the type of catalog. For example, you can create a LOCATION for a table in Databricks Unity:

-- Example: Creating a Databricks Unity table LOCATION
CREATE LOCATION my_uc_location_name WITH
  SOURCE = 'ICEBERG'
  CATALOG = 'DATABRICKS_UNITY'
  CATALOG_OPTIONS = (
    WORKSPACE_INSTANCE = '<your-workspace>.cloud.databricks.com'
    CATALOG = 'my_uc_catalog_name'
    SCHEMA = 'my_uc_schema_name'
    TABLE = 'my_table_name'
  )
  CREDENTIALS = (
    OAUTH_CLIENT_ID = '<client_id>'
    OAUTH_CLIENT_SECRET = '<client_secret>'
  );

Which can be queried using the simple syntax:  

SELECT *
FROM READ_ICEBERG(
    LOCATION => 'my_uc_location_name',
    MAX_STALENESS => INTERVAL '60 seconds'
)

Current Support and Transparency

We are launching support for Iceberg tables with Parquet data files in S3 and support the most important features of versions 1 and 2 of the Apache Iceberg specification. It's important to be transparent about current capabilities. As of this release, the following are not yet supported:

  • Row-level deletes (position or equality deletes).
  • Schema evolution.
  • Partition evolution.
  • Reading past snapshots with time travel (but you can specify an older metadata file manually).
  • Iceberg v3 features and data types.

It’s also important to note that cross-region costs may be incurred based on the storage location of underlying Iceberg files.

Conclusion: Unlock Your Iceberg Data with Firebolt

Firebolt's new READ_ICEBERG capability does a lot of heavy lifting to provide low-latency access to your Iceberg tables. It leverages multiple different caching tiers, from MAX_STALENESS to cache catalogs to completely transactionally caching the metadata lists and files on disk and their contents in memory, to cut metadata overhead to the absolute minimum. With Firebolt’s subresult caching, commonly used join hash tables are transparently cached and reused while maintaining full transactional integrity. And our advanced query planner will make sure to optimize complex query plans on Iceberg.

We are confident that our approach delivers a significant leap in performance, and we're excited for you to experience it. We believe in building exceptional systems with an exceptional team, and telling this story. This public preview is just the beginning of our Iceberg journey, and we are committed to continuously enhancing our support and performance.

Dive into the documentation, try out the READ_ICEBERG function on your own tables, and let us know what you think! Sign up now for $200 in free credits and try it out – no card needed.

Read all the posts

Intrigued? Want to read some more?