Explorium provides an external data acquisition and management platform that enables companies to make better business decisions by automatically discovering, connecting and matching their own data with hundreds of curated data sources, and thousands of external data signals.
As Explorium grew, so did their data & performance challenges.
Explorium captures and stores its external data across many different data warehouses and specialized databases. Each customer request gets routed across these databases, and then the results are assembled and returned.
All time series data is stored in Databricks Delta Lake. Previously, all customer requests for time series data were processed using a Presto cluster maintained on AWS, with the most recent data kept in an in-memory table to help improve performance. At any given time Explorium gets a broad range of small to very large data enrichment requests. Each request was converted into workflows involving a series of federated queries. Explorium also supported some internal requests using Amazon Athena.
The challenge was that performance would periodically become very slow, and it was very hard to predict when that would happen.
The root cause was that under higher loads, a large job could require significant resources, which impacted all the other jobs. The Presto cluster was a shared service that did not isolate requests from each other. Whenever Presto received requests that required it to scan a large number of partitions, it could cause a slowdown. Explorium established some policies, such as timing out requests that were taking too long, to help improve the response times for the other requests. But it meant that during high loads, long jobs and timeouts could adversely affect all requests, and impact customer satisfaction. Given that both data volumes and requests were expected to triple over the next year, Explorium decided they needed a new solution to handle customer requests for time series data enrichment.
The evaluation process
Explorium evaluated other Presto options, including Trino. They also evaluated Amazon Redshift. None of the Presto options helped isolate workloads or dramatically improve performance. Redshift, while it did offer greater performance, required significantly more effort. It also had many of the limitations of more traditional data warehouses, including a lack of decoupled storage and compute, which Explorium felt would cause issues as they grew. They published their initial benchmark results on medium.
Then Explorium evaluated Firebolt. The entire process, which involved implementing several of the queries against a large data set, took a few weeks. Explorium loaded 1.5TB of time series data into a Firebolt engine (compute resource) on AWS. They started with a larger memory-optimized engine that was roughly 20% more expensive than the Redshift cluster they had benchmarked, and added aggregating and join indexes to improve performance even more.
The result? Every query ran in less than one second, 17-102x faster than Redshift across the evaluated queries.
They decided Firebolt could easily support their future growth with its decoupled storage and compute architecture, and started implementing a solution.
The solution - Firebolt for time-series data enrichment
The entire implementation process took 2 months, although much of the time was spent doing paperwork and getting resources for the project. The time spent implementing was much shorter, with most of the work done using SQL. After extensive testing, Explorium went into production.
Explorium first processes its raw data using Apache Spark and loads it into Delta Lake. It then loads the data into Firebolt using an ELT process that runs within Firebolt using SQL. Explorium then divided small and large requests. All small requests execute directly against Firebolt data via the Firebolt REST API. The primary indexes in Firebolt, which sort and index the data in a table based on the columns and order of the columns in the index, helped dramatically improve performance of all these live queries.
For the larger offline, scheduled data enrichment requests, Firebolt executes a federated query directly against each submitted file by treating it as an external table. It joins and enriches the submitted data with other data within Firebolt. Explorium then returns the results as an enriched file to the customer.
While Explorium liked the additional performance that aggregating and join indexes provided, they decided they didn’t yet need the additional performance. So they ended up deploying a lower cost three node-engine, and chose to just rely on primary indexes for now.
Even without aggregating and join indexes, Explorium was able to get the fast performance they needed to make all live queries truly live, with all queries running in 2 seconds or less; 15-50x faster performance compared to Redshift.
Even for the remaining larger offline data enrichment requests, Firebolt improved the performance 3-5x on average compared to their original Presto deployment, with roughly the same cost of ownership.
Explorium used a large alphanumeric field in the data to ensure unique records. Firebolt leveraged the CityHash function to generate a unique numeric value for each alphanumeric value. Firebolt then leveraged that hashed value in the primary index, which provided faster lookups within the time series data.
In the example above, Firebolt is leveraging a partitioned table to handle the time series data. This is the first level of data pruning. Adding the column converting the h310 data into a unique number as the primary index allows for fast lookup of data stored in the TripleF file format.
Once the performance in Firebolt was validated, the next use case was to quickly read the enrichment data to dynamically build the query and export the data to the end user. The CLOUD_STOARGE function was used to handle this request - it will read data off of S3 and provide the data as an in-memory table to join to the data that resides in Firebolt.
With Firebolt, Explorium no longer has to rely on timeouts for slower queries, or other forms of throttling. Firebolt has consistently delivered predictable performance for every query since going into production, even under high loads.
Easy to manage, with linear scale
Supporting the ever-increasing data volumes and loads has also been easy. Since its initial deployment Firebolt has not only been easy to manage. It has delivered linear scalability and costs, with plenty of room for growth. Scaling engines has been done with just a few clicks.
In short, Explorium were able to leverage Firebolt to serve enriched data to their customers in production in a way that is consistently fast, production-grade and reliable, while knowing that the solution can continue to linearly scale as their customer base continues to grow. Firebolt’s under the hood file format and indexing proved to be highly efficient in pulling filtered data more quickly and more efficiently compared to other solutions.