Listen to this article
TL:DR: With read TVFs (table-valued functions), Firebolt makes it easy to explore CSV and Parquet data in your data lake or object storage. You can use our read functions and Firebolt takes care of the rest: locating the object, inferring the schema, querying them in an efficient way, and returning a table value. This blog post describes how TVFs work under the hood and how to use them effectively.
The read_parquet()
and read_csv()
functions are straightforward to use:
- Provide either:
- A URL to your data, with credentials if required, or
- A Firebolt Location object.
- Firebolt infers the schema from the most recently modified file in the provided location.
- The function returns structured rows of data.
- Profit.
The simplicity makes these functions useful for ad-hoc data exploration and incremental data loading. There is no overhead of defining a persistent schema object (as required by external tables) or ingesting the data (as with COPY FROM).
Show Me What To Do
Here are some examples. All of these work on public S3 buckets. You can provision a Firebolt engine in us-east-1 on AWS and run them right away:
1. Query public data directly. Use glob patterns to query multiple files at once.
SELECT * FROM
read_parquet('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/rankings/*.parquet')
LIMIT 10;
It is often helpful to use the list_objects()
TVF to discover your file name and directory structure.
SELECT * FROM
list_objects('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/');
2. Create permanent tables from query results while customizing CSV parsing options:
CREATE TABLE levels AS
SELECT * FROM read_csv('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv',
header => TRUE,
empty_field_as_null => FALSE);
3. To access non-public data, specify the credentials using a location object. This approach is recommended for both ease of use and security.
CREATE LOCATION my_location WITH
SOURCE = AMAZON_S3
CREDENTIALS = (AWS_ROLE_ARN = '<YOUR_ROLE_ARN>')
URL = 's3://foo/bar.csv';
SELECT * FROM
read_csv(location => 'my_location');
Alternatively, you can specify your credentials using named parameters.
SELECT * FROM
read_csv(url => 's3://foo/bar.csv',
aws_role_arn => 'YOUR_AWS_ROLE');
Working with TVF Data
After retrieving data with a TVF, you can filter, aggregate and transform using standard SQL operations. If desired, you can also ingest the data into managed storage using CREATE TABLE AS SELECT or INSERT INTO SELECT commands.
1. Apply Transformations: TVFs like read_parquet() not only access data but also enable immediate transformations. Here, we're analyzing gaming performance data with column-level transformations applied directly within the query:
The file in this example contains capital letters in the column names. For this reason, we use quoted identifiers to select columns.
-- Which cars correlate with best player performance?
SELECT
p."SelectedCar",
COUNT(DISTINCT p."PlayerID") AS "PlayerCount",
MEDIAN(r."PlaceWon") AS "MedianPlacement",
AVG(r."TotalScore") AS "AvgScore",
AVG(p."CurrentSpeed") AS "AvgSpeed"
FROM read_parquet('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/*') p
JOIN read_parquet('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/rankings/*') r
ON p."GameID" = r."GameID" AND p."PlayerID" = r."PlayerID"
GROUP BY p."SelectedCar"
ORDER BY "MedianPlacement", "AvgScore" DESC;
Notice how the TVF allows you to join data from different sources without first materializing it into tables. This query demonstrates joining two separate Parquet file collections, performing aggregations, and formatting the results - all in a single operation.
2. Filter by metadata: Firebolt supports filtering by metadata when using TVFs, which you can leverage to implement retry capability for transient network failures. This works by deduplicating files based on their names, making your data pipelines more robust.
Here’s how to implement it:
- Add a source_file_name column to your table.
- Use the following pattern when inserting data.
INSERT INTO table_name
SELECT *, $source_file_name FROM read_parquet(...)
WHERE $source_file_name NOT IN
(SELECT DISTINCT source_file_name FROM table_name);
Supported File Formats
Today, the supported formats are Parquet (via read_parquet()) and CSV (via read_csv()), the two most common data formats in Firebolt.
We are actively working on support for additional data formats, including Apache Iceberg. Check back soon for more.
Performance Benefits
TVFs offer scan performance similar to standard query operations. Their advantage is immediate, flexible data access – ideal for exploration and analytics where the schema validation and indexing of formal ingestion can be implemented later if needed.
- Direct Data Access: TVFs read directly from external storage without ingestion delays.
- Note that direct reads can be slower than querying indexed tables. For repeated analytical queries on static data, consider ingesting into managed storage.
- Note that direct reads can be slower than querying indexed tables. For repeated analytical queries on static data, consider ingesting into managed storage.
- Efficient Schema Inference: We infer schema through a two-step process: First, we download the initial file to detect structure. Then, we convert its columns to Firebolt columns.
- The conversion step adds minimal latency (only ~70 for a 50 MB file with 10 columns in our tests).
- For most exploratory workloads, this additional download is acceptable. If you need to optimize for minimal latency and networking calls, consider ingesting using another ingestion approach with predefined schemas.
Under the Hood: Schema Inference and Listing
To make it easy to explore your data, Firebolt does quite a bit of heavy lifting. This section describes what happens behind the scenes when you call read_csv() or read_parquet().
Before working with your data, Firebolt needs to understand its structure. This is important to validate your SQL query and make sure it’s actually well formed. Let’s look at some examples:
-- Firebolt needs to be able to detect whether the column names
-- reference valid names in the underlying files.
-- This query throws an exception: Line 1, Column 8: Column
-- 'this_column_does_not_exist' does not exist.
SELECT this_column_does_not_exist
FROM
read_parquet('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/rankings/*.parquet');
-- Even when you mention valid column names, you might use operations
-- that aren’t well defined on the underlying data types. For example,
-- you should not be able to call text functions on integer columns.
-- This query throws an exception: Line 1, Column 8: function signature 'upper(bigint)' not found, supported signatures are
-- upper(text)
SELECT upper("PlayerID")
FROM
read_parquet('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/rankings/*.parquet')
To perform this validation, Firebolt needs to understand the underlying structure of your data. First, we execute a file discovery process at the specified URL. File listing runs on a single coordinator node, so we've created optimizations to make this process fast:
- Firebolt uses recursive prefix iteration to traverse your storage, breaking down large directories into manageable parallel operations.
- When you apply metadata filters, we exclude any unneeded directory paths from the file listing.
Next, Firebolt's data reader examines the most recent file in your dataset for schema inference. This file typically represents your current structure and allows us to quickly infer schema across multi-file queries, while still giving predictable results. Firebolt uses the Apache Arrow library to parse and infer the file's types, then converts these Arrow types into Firebolt's native type system.
What happens if the inferred schema isn't exactly what you need? No problem. You can either create a destination table with your preferred types (allowing automatic casting during insertion) or leverage external tables for more control.
Once the above steps are completed, Firebolt’s planner has validated that your SQL query is well-formed. We can now begin crunching data. To do this as efficiently as possible, we want to use all nodes in your engines and all cores on every node.
Once we've cataloged all files, Firebolt distributes them across your engine's nodes. Our algorithm minimizes the maximum data volume any one node must process, ensuring no node becomes a bottleneck. After the files are distributed, every node can work on individual files in isolation. Each node uses multi-threading to keep all CPU cores busy. By default, we assign different files to different cores. For Parquet, we get even smarter: we can have multiple cores work together on reading a single Parquet file. We distribute work at the granularity of row groups within the Parquet files. This way, even a single large Parquet file can properly utilize all available resources.
To demonstrate scale-out performance, we ran the following query on Type S engines with 1 cluster and varying node counts. The query scans about 130 Gigabytes:
SELECT * FROM
list_objects('s3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/');
The improvement is near linear, with only a small constant time used for planning, file listing, and coordination.
Summary
The read TVFs allow you to query external data directly. They automatically detect data structure, support Parquet and CSV formats, and integrate seamlessly with Firebolt's SQL functionality. This capability lets you explore external datasets without full ingestion, saving time and resources during data discovery.
Explore data directly from your data lake using Firebolt's read_parquet() and read_csv() TVFs - no ingestion required.
If you’re new to Firebolt, you can sign up for a free trial here.