Event streams in Firebolt
February 22, 2024

Event streams in Firebolt

Event streams have always been problematic to analyze in SQL.  They're all around us, monitoring where a user goes on your website by session, click streams in the adtech world (impressions, conversions etc.), and now with the advent of IOT, it's gotten even more exciting.

But that doesn't have to be with modern systems.

Having come from the adtech world and spending far too much time and likely millions of dollars of compute on this issue, I'm going to use the issue of click fraud for this example.  Click fraud occurs because content providers get paid for conversions of ads shown on their content.  A conversion occurs when a user actually clicks the ad from the impression.  Sometimes, content providers get quite brilliant and figure out how to fire conversions in the ad system programmatically.  These are found by looking for conversions without an impression in a session.  If you're an ad network, and fraud is occuring, your advertisers will see it in lower quality audiences.  This becomes expensive as they'll pay you less per volume.

Often, the schema for the stream would look something like:

This data can be large and needs to be sifted through rapidly to identify how many clicks happened with no requests or impressions. So, what would the SQL look like? It generally goes something like:

You just joined a billion rows to a billion rows, this is never going to complete before a report times out, so we run off to generate yet another summary table. I have lived this fact-to-fact join life enough on various data warehouse technologies. The last one cost me over half a million dollars a year and ate up a few warehouses - virtual ones at that. I have always wanted to see a better way to do this. A way that doesn’t squeeze the infrastructure or myself for all we got. It is refreshing to see a solution when it comes to solving the same problem on Firebolt. 

For starters, Firebolt is efficient in terms of how it processes data. Columnar storage, vectorized processing, de-coupled storage and compute - the usual Cloud Data Warehouse buzzwords, obviously.  But the other aspect is the efficient data pruning through indexes - multiple ones at that. But there is more to the story than indexes, which I have seen deliver 100X faster response times over and over again.

The big news here is that Firebolt has array data structures, Lambda expressions and a cool aggregating index. View the aggregating index as a no nonsense take on materialized views. How do these things come together? Simple … let me show you.

In Firebolt, you just create an aggregate index using array functions and you're done:

What does this aggregating index do?  An aggregating index will materialize an aggregate behind the scenes, so if any user writes a query against the underlying table that the aggregating index can serve, the query interpreter will automatically shift to reading from the index rather than the base table.  In this case, nest() is an aggregate function that will be automatically maintained at ingestion.  This gives us ease of use.  There’s no need for summary tables or indexed views.  The end user doesn’t have to query some other table,  It also gives us extreme performance as we’ve effectively “pre-joined” all the data we need into one physical unit, so we don’t have to join at query time.  Fact to fact joins are always expensive.

How does this change the billion row self join that cost me a lot of money? Look below …  What is the trick here?  For one, creating arrays that can be efficiently processed through Lambda expressions.  If you want to learn more about how Firebolt offers Lambda expressions to process arrays and arrays of arrays, please check out the SQL:Thinking in Lambdas blog.

All done with no joins so it'll go plenty fast enough to wire directly to a dashboard. If it’s not fast  Of course this is just the start.  In other event stream analysis we may want to see how many sessions visited web page N as well as page Y and not page Z.  This too is easily handled through similar indexes and lambda expressions. How cool is that ?

Interested in testing this out with all the SQL tables, ingestion scripts and more. Go to my GitHub repo.

Read all the posts

Intrigued? Want to read some more?