Imagine you could analyze how the Internet is used by all of us. Sort of a Google-Analytics for the entire internet. This is Similarweb.
Similarweb is a big data powerhouse, collecting enormous amounts of web-related data to help marketers, brands, salespeople, and many others to analyze how their audiences interact with websites. For example, you can easily track which keyword searches drive traffic to your website, which sub-pages they land on, how often they actually land on a competitor's website, where they’re from, which mobile OS they’re using, whether they clicked on organic or paid links, and much more.
If you go to their website, the first thing you see is a search box allowing you to directly research any website:
For Similarweb, being able to deliver a great user experience for end-users, where they can analyze, slice & dice, and find insights in data through a broad spectrum of analyses, is what it’s all about.
For example, here you can see an overview of the behavior of godaddy.com:
You can go deeper into the analysis with more views:
Something very cool is the ability to compare multiple sites head to head, in this case, godaddy.com vs wix.com:
Data size and data stack
Similarweb is ingesting 5TB of new data a day.
In production they have roughly 1 petabyte of data today in the dataset relevant for the segment use case which we’ll dive into below.
Everything is on AWS. Processing is mostly done over the data lake with Spark and Databricks, orchestrated by Airflow. Then queries are mostly served to the UI through either DynamoDB or Hive (before Firebolt was added to the mix).
The data pipeline from collection to serving is split into these four steps:
- Data Collection
This team is in charge of collecting data from a big variety of sources, such as public data, partner-originated data, directly measured data, and more.
- Data Synthesis
Here the data is cleaned and processed to remove private and dirty information, etc.
- Data Modeling
This is where a lot of the magic happens: not only is the data modeled, it also goes through heavy ML processes to create predictive models that can draw conclusions on internet-wide behavior from partial data points.
- Data Delivery
This is where the data is served to the end-users through packaged analytics-rich experiences.
The ‘Segment Analysis’ use case
Imagine that you’re a marketer at FootLocker, and you want to understand how FootLocker.com performs compared to Amazon.com. Now obviously it's not an apples-to-apples comparison because shoes are just one of many things that Amazon sells. Similarweb wanted to allow its users to analyze ‘segments’ in bigger websites so that you could compare FootLocker.com traffic only with searches for shoes on Amazon.com. Similarweb understood that this feature is one of the most complex they’ve ever tackled from an analytics perspective.
Some of the challenges Similarweb ran into when trying to implement this use case are not hard to guess: Data volumes are huge, which makes everything tougher, and ETLs are very costly and take a lot of time to develop and maintain.
But an even bigger challenge was the fact that there’s dynamic input from users that the analysis needs to take into account. In order to pre-process every combination users might want to compare, an exponential amount of combinations would need to be tackled, which is unfeasible.
Each day Amazon generates 150GB of data in Similarweb. Users want to analyze up to 2 years worth of data. If you want to look up URLs that match dynamic patterns, you need to scan a lot of data which becomes painfully slow. Multiple URLs are grouped into one array for each session of a single entry, making it even tougher.
Similar considered Athena since they already used it for internal analytics, but they quickly ruled it out because it’s not fast enough to support the low latency required to deliver a fast end-user experience. Similarweb also ruled out DynamoDB because even though it is a fast document store, it doesn't work well with SQL and it does not support dynamic grouping.
Attempting to solve the challenge with Auto-scaled Lambda functions
Another approach that was tested was auto-scaled Lambda functions. The idea was to trigger one Lambda function per every single day of the range the query requires. That requires data to be stored in JSON, so an ETL process was implemented to convert the existing ORC format into JSON. This essentially means that a full copy of the full 1PB of data is required, to support all the possible segments users might request. Second, the performance wasn’t great. Even though the Lamdba functions are parallelized, typically a few of them ran slower, making the overall wait time long. And lastly, it doesn't support SQL for further grouping and aggregations as needed
Selection of Firebolt
Similarweb narrowed down the competition to BigQuery and Firebolt. They conducted a benchmark and added Athena in as well. These were the results:
Firebolt showed the best performance but also didn’t require any additional pre-processing. Raw data was loaded and was ready for dynamic querying at sub-second performance. Additionally, Similarweb saw a lot of value in Firebolt’s approach to decoupling of storage & compute, which allows for easy workload isolation. This meant that they could isolate the workloads for the new feature set, and deliver queries that are consistently fast and predictable. It also allowed for easy continued development on different compute clusters (called “engines” in Firebolt) over the same data, without affecting the production experience.
The last consideration was cost, and Similarweb found Firebolt to have the greatest cost performance and lowest TCO compared to the alternatives.
Therefore Firebolt was selected, and within a few weeks was fully programmatically orchestrated in production with Airflow and Firebolt’s REST APIs. Since then, Similarweb went on to deliver more features using Firebolt as the backend.
To wrap it up, here’s an example of the Similarweb app showing traffic and engagement over time for Playstation 5 (“PS5”) over amazon.com using Firebolt as the backend. Behind the scenes multiple TBs of data were scanned and dynamically filtered for “PS5” while delivering a ~1sec load time experience in the UI.