Native Semi-Structured Analytics

Semi-Structured Data is a Key Source of Insights

The advent of cheap, infinite cloud storage and massive data collection from the web and IoT has dramatically transformed the nature of raw data subject to analytical query workloads. With endless streams of JSON log files and other nested serialization formats like Avro and Parquet, the ability to handle semi-structured data is a must.
‍
This paper reviews the challenges posed by semi-structured data and demonstrates different approaches, from NoSQL document stores and Snowflake’s VARIANT to Firebolt’s tabular schema with support for nested types. Based on concrete implementations, it will walk you through making semi-structured data analytics fast, real-time and cost-effective as an alternative to complex and expensive ETL-like transformations that flatten the data into columns, or to raw unstructured data, neither of which perform as well.

NoSQL Document Stores: Unacceptable Response Times

While NoSQL solutions like MongoDB are designed to support the varying nature of the schema, running queries usually requires full scans on the data. The use of complex aggregations and filters often result in unacceptable response times. A partial solution might be to add a search technology like Elasticsearch. This may improve the query response time for searches but not general-purpose analytics for two reasons:

Elasticsearch takes text and indexes it instead of structuring it for analytics, which is very expensive. For one company, indexing 250GB in Elasticsearch took 10 hours. In addition, until the reindexing is finished the query results will lag behind and not include the new data.
Most of the indexes need to be defined in advance to support specific queries. This does not work well when the queries are ad hoc in nature.

Snowflake VARIANT and Flattening: A Huge Resource Hog

Snowflake provides two options to deal with semi-structured data:

ETL and flatten: In this first option, each field in the nested structure that will be used for analytics becomes a separate column. Flattening semi-structured data can result in a huge, denormalized table that requires nested queries to extract the relevant data with painfully slow performance.
Dump raw JSON into a VARIANT: In the second option as implemented in Snowflake, the semi-structured data is stored in a single column, and metadata is stored to optimize access to the data when querying with SQL. While metadata enables faster access, querying still isn’t performant enough. The JSON must first be loaded from the VARIANT field entirely into RAM. The query engine then needs to perform full scans across all the JSON. Unless the server has an incredible amount of free RAM, the JSON will quickly spill over into disk, which will dramatically slow down the query. The only options for improving performance are flattening, pairing down the JSON, or both.

Firebolt’s Tabular Schema with Support for Nested Types: The Best of All Worlds

Firebolt allows SQL experts to combine native array manipulation functions with partial flattening for the best performance.

Native array manipulation functions

Array manipulation functions directly embed objects with their full structure in the database and transform them into Firebolt’s native nested types. Users no longer work with objects as a whole, they transform them into an efficient format which maintains their nested structures. This enables more compact queries that run in SQL. Since the arrays are stored in a columnar format, querying does not require loading the entire object into memory.

Partial flattening

Firebolt enables flattening on the fly. Users have the flexibility to balance flattening and manipulation functions, and work on a table which has parts that are fully flattened, parts that are partially flattened and parts that are left as they are.

When partially flattened data is natively supported in your database, it also results in fewer ETL steps. JSON manipulation functions are used to seamlessly cleanse, fix and organize semi-structured data as it’s ingested through the Firebolt pipeline. Partial flattening can then be performed to transform the nested types to an ordinary table that can be further manipulated with familiar SQL techniques, and indexed on each column or join to improve performance. It also enables creating views without nested types on tables containing nested types, which is useful when working with BI tools that cannot access nested types directly. Partial flattening as opposed to full flattening provides the benefits mentioned above while avoiding huge denormalized tables.

Sub-second performance: Firebolt provides extreme performance across structured and semi-structured data, regardless of the amount of attributes and arrays it contains. For flattened data this is achieved with the various optimizations of a modern columnar database, including vectorized execution, aggressive compression and encoding techniques as well as indexing. For semi-structured data, Firebolt provides native Lambda expressions that can be invoked within SQL. These functions are optimized to traverse the nested structures without loading the data into RAM or performing full scans. The end result is sub-second performance at scale without requiring nearly as many resources and therefore costs.

Substantial cost reduction: Firebolt’s flattening is extremely cost-efficient compared to Snowflake’s, since there's no need to duplicate a column’s data for each nested value. Although this duplication can have less of an impact on storage with column-based compression, it may require additional GROUP BY or FILTER operations, which can have a huge impact on performance. With arrays and direct array manipulation functions, GROUP BY and FILTER operations can be used only during the final result set generation. In addition, since the arrays are already stored in an efficient, columnar and compressed format, the UNNESTING is just a matter of presentation, not a transformation of the underlying storage representation.

Hands-On Example

Consider data collected from an e-commerce website. Each visit (session) is a record. Each visit has an Id, Start Time, and Duration (regular scalar values), but also zero or more tags, various properties of the user agent and zero or more “events of interest” like registration, adding an item to cart and leaving a review. Such events, in addition to scalar values (Id, Type and Timestamp) may include zero or more additional properties which will vary according to the event type and may differ even between events of the same type.
‍
Assuming that the raw data is stored in a JSON format, here are examples of two records (usually they will appear in the same file in a “JSON per line” format) :

Source JSON

Record 1:

{
"id": 1,
"StartTime": "2020-01-06 17:00:00",
"Duration": 450,
"tags": ["summer-sale","sports"],
"user_agent":{
"agent": "Mozilla/5.0",
"platform": "Windows NT 6.1",
"resolution": "1024x4069"
},
"events":[
{
"EventId": 547,
"type":"login",
"Timestamp":"2020-01-06 17:00:10",
"EventProperties" :
{
"UserName":"John Doe",
"Successful": true
}
},
{
"EventId": 548,
"type":"add-to-cart",
"Timestamp":"2020-01-06 17:01:15",
"EventProperties" :
{
"ProductID":"xy123",
"items": 2
}
}
]
}

Record 2:

{
"id": 2,
"StartTime": "2020-01-05 12:00:00",
"Duration": 959,
"tags": ["gadgets","audio"],
"user_agent":{
"agent": "Safari",
"platform": "iOS 14"
},
"events":[
{
"EventId": 747,
"type":"registration",
"Timestamp":"2020-01-06 19:05:10",
"EventProperties" :
{
"UserName":"Joe Schmoe",
"Account Type": "Premium"
}
},
{
"EventId": 875,
"type":"added review",
"Timestamp":"2020-01-06 19:09:17",
"EventProperties" :
{
"ProductID":"r2d2",
"rating":5,
"text": "Friendly, compact, and cute robot. Excellent build quality"
}
}
]}

From nested to tabular semi-structured

In order to enable efficient analytical queries which will be used by BI tools (and BI users) who natively speak SQL, the nested structure of the raw data has to be at least partially transformed into a semi-structured, SQL-friendly schema.
‍
Here’s an illustration of the first record in Firebolt’s semi-structured table (some data omitting for brevity):

Visit

Id: 1

Start_time: 2020-01-06 17:00:00

Duration: 450

SQL implementation

CREATE FACT TABLE Visits
(
Id INT,
StartTime TIMESTAMP,
Duration INT,

-- zero or more tags
tags Array(TEXT),
-- user agent properties will consist of two arrays,
-- one for names and one for values,
-- they will alway in the same length per specific row

user_agent_properties_names Array(TEXT),
user_agent_properties_values Array(TEXT),

-- Events consist of an array for IDs, array for types,
-- and two nested arrays for event property names and event property values
events_ids Array(INT),
events_type Array(TEXT),
events_properties_names Array(Array(TEXT)),
events_properties_values Array(Array(TEXT))
)

Resultant (illustrated physical) schema:

In order to illustrate the querying of the data in both of Firebolt’s approaches (UNNEST construct and special array manipulation functions), let’s consider the following query written first in plain English: List all the IDs of the visits whose tags contained “sale” or “fashion” in descending order of duration.

Using UNNEST

UNNEST flattens the nested data and adds a column having one row for each element of the nested type.

SELECT id
FROM visits
UNNEST (tags)
WHERE tags = ‘sale’ OR tags = ‘fashion’
GROUP BY id
ORDER BY duration DESC

The GROUP BY clause here is needed because each row will appear as many times as the number of elements in its tags column

Using array functions

Here we use the array function “has” which returns “true” if the array supplied as the first argument contains an element equal to the second argument.
‍
SELECT id
FROM visits
WHERE has(tags, ‘sale’) OR has (tags,’fashion’)
ORDER BY duration DESC
‍
Note that we do not need the GROUP BY clause here because each row appears just once and we operate directly on the array without “exploding” it.

More Examples

The following examples will use array functions and UNNEST interchangeably. While the same query can typically be written in either one of the approaches, there is usually a more suitable way to do it, and of course it’s a matter of personal preference.

‍Count all the visits whose user agent string contained “safari”, group by event type in descending order of number of visits.
‍
Here we’ll use both UNNEST and an array function. For the grouping, UNNEST is a more suitable option, while for the filtering the Array function is more compact and readable.
‍
SELECT COUNT(id) as number_of_visits
FROM visits
UNNEST (events_type)

WHERE
any_match({(prop_name, prop_val)-> prop_name = ‘agent_string’ AND
prop_val LIKE % ‘safari’}, user_agent_properties_names,
user_agent_properties_values)
GROUP BY events_type
ORDER BY number_of_visits

Here we introduce the concept of Lambda expressions in array functions. The function any_match accepts a Lambda expression, and one or more arrays. It then runs over the elements of the arrays (which have to be of the same length), passes the current elements to the Lambda expression and if for any reason there’s at least one call to Lambda that evaluates to true - the whole function will evaluate to true, otherwise it will evaluate to false.

A simpler example would be to rewrite our first example where we used the disjunction of two “has” functions: “has(tags, ‘sale’) OR has(tags,’fashion’)”, with a single call to any_match, as follows:

any_match({(t) -> t = ‘sale’ OR t = ‘fashion’}, tags)

How flattening impacts the ETL

In the inventory of items below, there are several sizes for each item, and prices in different currencies for each size. So, for a single item, its prices object will look like this:

{
"ItemID": 1,
"sizes": {
"38": {
"prices": {
"EUR": "286.00",
"GBP": "245.00",
}
},
"33": {
"prices": {
"EUR": "286.00",
"GBP": "245.00",
"USD": "318.00",
"RUB": "19619.00"
}
},
"32": {
"prices": {
"EUR": "286.00",
"GBP": "245.00",
"SGD": "430.00",
"AUD": "459.00"
}
},
"40": {
"prices": {
"EUR": "286.00",
"JPY": "34699",
"USD": "318.00"
}
}
}
}

In order to convert it into a fully flattened table like the one below, Snowflake has to run the “LATERAL FLATTEN” operator twice - first to get sizes, then to get prices, and finally JOIN three tables.

Size

Currency

Price

Item

EUR

286.00

....

GBP

389

...

245.00

With Firebolt, using arrays, you simply run (the subquery is here just for readability). The query may look complex, but it entirely eliminates the need to materialize intermediate tables and join them.
‍
If you want a fully flattened result, you can augment the query with a double UNNEST clause (left as an exercise to the reader) - once again, no intermediate materialization, no JOINs and a huge performance gain.

SELECT
ItemID,
sizes,
-- The following results in an array of array - of each size,
-- an array of currencies
transform(
{(curr_price) -> JSON_EXTRACT_KEYS(curr_price)},
prices
) as currency,

-- The following results in an array of array - of each size,
-- an array of currencies
-- each 'VALUE' extracted from the object inside for a given price
-- is just a single string, but we still have to convert it to FLOAT
transform(
{(curr_price) -> CAST(JSON_EXTRACT_VALUES(curr_price) AS FLOAT)},
prices
) as prices

FROM (
SELECT
JSON_EXTRACT_INT(json_object, 'ItemID') as ItemID,

-- The following results in an array of sizes
JSON_EXTRACT_KEYS(json_object, sizes) as sizes
-- The following results in an array of JSON objects with prices for each size
JSON_EXTRACT_VALUES(json_object, sizes) as prices
FROM inventory
)

Conclusion

Firebolt allows SQL experts to combine native semi-structured storage and Lambda expressions with partial flattening to achieve a seamless analytics experience delivering sub-second performance. It provides:

Native Lambda expressions that can be involved directly in SQL
A flexible combination of nested storage and flattening that can be queried in SQL without the need to create huge, costly tables
Sub-second performance across structured and semi-structured data, regardless of the amount of attributes and arrays it contains
Substantial cost reduction by avoiding the need to load all JSON or other semi-structured data entirely in RAM and do full scans or complex nested queries.

WHITEPAPER

Semi-Structured Analytics Without Compromising Performance

With endless streams of JSON log files and other nested serialization formats like Avro and Parquet, the ability to handle semi-structured data is a must