semi structured data
May 3, 2021

JSON the SQL: Choosing the best data warehouse for semi-structured data

If you have worked with semi-structured data using an on-premises data warehouse and then moved to a modern cloud data warehouse, you probably understand: working with semi-structured data can be more like a Jason (horror movie) Sequel than JSON SQL.

The reasons are basically the same; if you wanted performance you usually had to flatten your semi-structured data into a (wide) table. The only other option has been to store JSON as “unstructured” text, and then do full scans through the text to perform any functions.

Snowflake is better than most, but it still has the same limitations.  Snowflake offers a VARIANT type to store raw JSON.  While Snowflake builds the metadata for the query engine to enable faster access, it requires all of the JSON to be loaded into RAM, and the engine needs to perform full scans (until it gets the data.) This may seem reasonable if it is in RAM, but for larger data sets the JSON data can quickly spill over into disk, and the performance can plummet.  You can choose a big enough warehouse size, but that requires doubling the number of nodes and the cost with each warehouse size just to get a little more RAM. You can also (partially) flatten the JSON, the old approach, or a combination of the two.

Firebolt offers a very different approach. First, it allows you to store semi-structured data natively as a nested array structure. You have the choice to perform any combination of flattening or loading the semi-structured data as a nested array structure. Flattening is done within SQL using an UNNEST(tags) command that supports GROUP BY, ORDER BY and indexing. For each part that is flattened, you can turn on indexing for blazingly fast performance and query using standard SQL.

Second, Firebolt provides a Lambda-style set of native array functions that can be invoked within SQL to query the semi-structured data. Firebolt automatically loads the semi-structured data such as JSON into an internal representation. As these array functions are invoked based on the schema, the Firebolt query engine directly traverses the underlying nested structure without having to do full scans. The result is sub-second performance against semi-structured data without having to do extensive preprocessing, and without having to provision RAM-heavy nodes.

If you want to learn how it works with specific examples, read through a deeper explanation with some examples.

Or you can watch the webinar where the brains behind the JSON mask at Firebolt explains how it works and gives examples.


Read all the posts

Intrigued? Want to read some more?