semi structured data
May 15, 2023

A primer on analyzing semi-structured data (Part 1)

No items found.

In today's data-driven world, organizations are constantly generating, analyzing, and utilizing massive amounts of data to fuel their businesses. As the data landscape evolves, businesses encounter a wide range of data types, from structured to unstructured and everything in between. One such data type is semi-structured data, which holds a unique position in the data ecosystem, blending the features of both structured and unstructured data. In this introductory guide, we will explore the realm of semi-structured data, focusing on its handling and management in the context of databases. 

Specifically, we will examine the JSON format, a widely adopted standard for representing semi-structured data, and discuss how to harness its potential in building data applications. From ingesting and processing to visualizing the data with two popular frameworks, Streamlit and Metabase, this guide will provide you with the fundamental knowledge necessary to handle semi-structured data effectively. 

Parts

Overture: The dataset, setup, and loading data into Firebolt
Sonata: Analyzing the raw data
Minuet: Array manipulation in SQL, including Lambda functions and flattening data
Allegro: Visualizing data with Streamlit & Metabase
Finale: Conclusions

Overture

The dataset

In this blog post, we will explore some of the key JSON and array functions available in SQL, including Lambda functions, and how to use them to manipulate semi-structured data. For this, we will leverage the New York Philharmonic dataset containing all their concerts since the Philharmonic’s first concert at the Apollo Room conducted by Ureli Corelli Hil on December 7, 1842. 

The dataset spans over 175 years of performance history, covering 11.000 unique works and 20.000 performances. The full dataset, when flattened, has over 800.000 rows. 

Which composers are most popular? Have there been any trends in popularity by a conductor or by season? How many concerts were performed each year? We will answer all these questions and more using JSON and array functions. 

Apollo Rooms venue on the left. The first page of the Philharmonic’s first concert is on the right. 

Setup

Our tech quartet is composed of the following: 

  • Storage: S3 – Object storage on AWS 
  • Database: Firebolt – Cloud Data Warehouse for data apps
  • Visualization 1: Streamlit – Python-based framework for front-end visualization
  • Visualization 2: Metabase – Open-source visualization tool

The raw JSON file, “raw_nyc_phil.json”, is loaded into AWS S3. This file is accessed from the Firebolt data warehouse in the form of an external table, “ex_nyc_phil”, to enable direct querying. This data can be ingested “as-is” or flattened to the “nyc_phil” table in Firebolt to be analyzed and visualized.

Loading data into Firebolt

Step 1a: Upload to S3. Since there is only one input file and no preprocessing is needed, we will use the AWS management console to upload the JSON file to our bucket.

Loading data into Firebolt

Step 1b: Create a database and a C2 general-purpose engine in Firebolt for this demo. Firebolt provides various types of granular engine sizes to address various workloads. A C2 engine with 4 vCPU, and 8GB RAM is more than sufficient for this exercise.

Create a database and a C2 general-purpose engine in Firebolt

Step 2: Create an external table pointing to the file we just uploaded to S3. 

Here’s a breakdown of the syntax: 

  1. Define the table as external. External tables serve as connectors to your data sources (in this case, S3). External tables contain no data.
  2. URL and OBJECT_PATTERN specify the S3 location and the file naming pattern 
  3. TYPE specifies the file type Firebolt expects to ingest given the OBJECT_PATTERN.
  4. PARSE_AS_TEXT: If True, Firebolt wraps the entire JSON object literal into a single column of TYPE TEXT. If False, we must map each key in a JSON object literal to a column in the table definition. 

Step 3. We can either: a) ingest the data as is (one-single field containing the JSON object) in a Firebolt-managed table or b) transform the raw_data using array and JSON functions before inserting it into a Firebolt-managed table. We will go with option b). 

Sonata: Analyzing the raw data

First, let’s ensure we can read the input data via the external table we just defined:

The schema of the JSON object looks like this: 

At the top level, we have an array of programs. Each program has multiple attributes:

Minuet

Array manipulations in SQL, including Lambda functions

Moving on to the next part of our play, we want to extract the top element: programs.  To accomplish this, we will leverage the JSON_EXTRACT_ARRAY_RAW function:

JSON_EXTRACT_ARRAY_RAW(<json>, '<json_pointer_expression>')

Next, we will answer some questions about the data using ARRAY functions

How many programs has the Philharmonic played? 

To find out, we can use the LENGTH function:

Let’s now convert the above array to a standard tabular format, having each program on a separate line. To do this, we will need to use the UNNEST clause:

Array manipulations in SQL, including Lambda functions

If we perform a COUNT(*) in the outer SELECT, we expect to see the same number of rows (13954) matching the LENGTH() of the array from above.

What time do the concerts usually start? 

The first step is to extract the “Time” attribute from the “concerts” array. 

We’ll begin by extracting the concerts array using the JSON_EXTRACT_ARRAY_RAW function and UNNEST it - similar to the programs array above. 

We will extract the Date, eventType, Season, Venue, 

Location, and Time for each concert using the JSON_EXTRACT function. 

 Back to our questions (What time do concerts usually start): 

8:00 and 8:30 pm are the most frequent options. Oddly enough, there were four times when the concert time was between 2:00 AM and 3:00 AM 

Most popular venue? 

Which composers are most popular? 

Similarly, we need to access the works array now instead of the concerts array.

Extract all conductors' names per season into an array.

We can either:

Let’s explore both options. 

Option A: UNNEST, extract the composer name, and ARRAY_AGG/NEST back.

Continuing on, we will group by season, and nest (into an array) the composers. Lastly, we will remove any duplicate names using the ARRAY_DISTINCT function. We can optionally return the number of unique values in an array using the ARRAY_UNIQ function. 

Option B: Use TRANSFORM Lambda Function 

TRANSFORM applies a function to each element of an array. Therefore, we don’t need to UNNEST anymore in the FROM clause. We can directly work with the array and extract the composer name in the SELECT clause. 


Similarly to Option A, we can apply ARRAY_DISTINCT and ARRAY_UNIQ. Please note the above query doesn’t contain the GROUP BY clause. If you need to obtain one row per season, you will need to GROUP BY season, and FLATTEN(NEST()) the composer_name column before applying ARRAY_DISTINCT and ARRAY_UNIQ. 

Lambda functions can be chained together and become extremely powerful when applying changes iteratively to each element of the array. In the above case, we can apply another transformation and replace all the letters from each composer's last name with asterisks except the first one. 


If you are still wrapping your head around how Lambda functions work or are applied, you can take a look at this blog post that covers Lambda functions in depth. 

Identify the works which were played at piano

To find out which works were played at piano, we have to extract the instruments from the soloists' array. We will wrap the extraction of the soloists’ array into a view. 

View definition:

Query to extract the soloists' details: 


To find out the works played at Piano, we can leverage the CONTAINS array function that returns a boolean if it’s a match or not.


Sort the soloists' names for each work by their instruments

To answer this question, we can reuse the view and CTE from above. We will turn our attention to a different array function: ARRAY_SORT, which can take two arrays as input and sort one by the values in the other. 

Given the following entry:

soloist_names: Otto, Boulard, Munson, Mayer

soloist_instruments: Soprano, Alto, Tenor, Bass

It will resort the soloist_names array based on soloist_instruments values, as such: Boulard, Mayer, Otto, Munson


Firebolt supports a wide range of array and lambda functions similar to the above ones, that you can check by following this link to our docs.

Firebolt supports a wide range of array and lambda functions

Flattening data

Next, we’ll tackle ingestion. We will apply the same concepts from above to UNNEST all the arrays, starting with programs, and continuing to concerts, and works, and finally, UNNEST the soloists' arrays within works.

Finally, we will wrap the SELECT query and create a CTAS (create table as select):

SELECT * FROM nyc_phil

So far, we have covered loading and analyzing semi-structured data with various JSON and Array functions. In Part 2, we’ll analyze and visualize the data using Streamlit and Metabase. On to Part 2!

Get started with Firebolt

Free credits. Free support. No credit card needed.

Read all the posts

Intrigued? Want to read some more?