Big Data Analytics for Gaming

Big Data Analytics for Gaming

In our recent ‘Big Data Analytics for Gaming Workshop’ we let the audience do a lot of the talking. We can’t show the recorded breakout rooms for privacy reasons, but here’s a summary of what was on their minds:

  • Integrations are too complicated. With such a big pool of tools to choose from, how do you manage it all?
  • Planning for growth is challenging. How do you scale from less than a TB to 100s of TBs?
  • Everyone wants performance. An average query takes minutes, and some take much longer!

Luka Lovosevic, Firebolt’s Solution Architect was impressed, and explained how Firebolt addresses each of these challenges.


Transcript

Luka: My name is Luka. I'm a Solution Architect at Firebolt, and today, we're going to focus on big data analytics for gaming. So, the kind of trends and challenges we see in gaming, especially as the data volumes rise and the kind of the need for performance and scalability goes up and how Firebolt can help you there to achieve scalability and performance. We are also going to do a demo so just a couple of slides, maybe a use case or two, and then we'll go into a live demo and, of course, let's keep this interactive. So, if you have any questions, please interrupt me and let's talk about it. All right.

To set the scene, we at Firebolt definitely see this as a great decade for cloud analytics. There's Redshift, there's Snowflake, BigQuery, and a lot of products that have really helped the cloud analytics space in terms of making it available literally in a few clicks. Today, you can just do a couple of clicks. You get a full data warehouse at your disposal, which was not the same story as 10 years ago. So, it's full SaaS. We see the decoupled storage and compute, which is something that enables scalability, enables good performance, enables concurrency, and definitely scalability and elasticity are things that we need today, with the rise of data volumes and the rise of what we call instant analytics.

With that said, we definitely see new challenges ahead, talking to our prospects, they are already today, either at the terabyte scale or larger. But if not, they also tell us that they anticipate this to happen in the next either quarter or two or the next couple of months. So, definitely, terabytes are the new gigabyte as we say and this of course brings in some challenges. It can be an operational overhead to manage this. It can be the pure scale of the data, which doesn't allow you to do things as you have been doing them before. It can also influence the cost as well.

With the rise of data apps and instant analytics, it can be in-house BI, it can be end-user-facing analytics, and the SaaS users don't want to wait anymore. Google, the internet, all of the things, 5G have spoiled us and we expect things to happen instantly. We don't want to wait for 15 or 20 seconds for a dashboard to load. So definitely this, this is something that we've been seeing it in the while, and end-user experience and subsecond performance is something we expect from platforms today. 

And then, with the rise of semi-structured data, JSON, we see kind of a trend that not everybody wants to have a fully-fixed data model. People want to have kind of the flexibility to change stuff, and this is where JSON plays nicely. So, you can have your Schema which is kind of partly fixed, partly in JSON format, and that allows you to change things and evolve things over time. So, custom solutions go around this and people use them today, but they are costly. They give you kind of the overhead of either data engineering effort or you lose some of the things that Tamar mentioned in the beginning. So, you either pre-aggregate data to handle the kind of rising volumes, but then you lose the raw events or the raw data underneath. So, you cannot do a deep analysis. You cannot go deeper than like a monthly or a weekly aggregate or you analyze smaller time windows. So, you can only analyze, let's say, the last two months of data. But then everything older than that gets deleted or removed from the system. Again, another good situation to be in, especially today, is when you want to do things like trend analysis, do things like support data science work in terms of finding seasonality. So, all sorts of issues arise there. And then to do this, of course, there are complex pipelines to manage. Again, it's a data engineering effort.  Imagine you have five tables that you need to pre-aggregate, you need to get them in sync. Whenever new data is flowing in, you need to again run your aggregation scripts to keep everything in sync there. 

Not really a good place to be in, especially as you need to cater for both speed and data. Of course, you can also accept poor performance, especially in high concurrency but this is not where you want to go today, especially with so many products focusing on speed, and the user experience becoming kind of central to all of this. 

So, how does this relate to gaming? Of course, we've seen a large number of KPIs that the gaming industry focuses on and these can be internal catering for things like average revenue per user, retention rates, churn rates, things where kind of internal teams want to focus on either on the financial side or purely on the gaming experience to make it better things like, conversion rates, free-to-play versus pay-to-play. So, there are many different things that gaming companies are trying to follow and report on, but those are not easy to manage and to have at your disposal, given the challenges before. So, the data volume is rising. You have high concurrency and then you need to do all of these things and have a good experience there. Things like the final analysis. Those can be quite complex to do, trend analysis, and so on.

So, how does Firebolt come into play here? So, Firebolt was designed with speed in mind. So, speed or performance is one of our core tenants. Everything we do, the number one thing is we want to be faster at doing it. So, how we do that in essence, we focus on indexing style. So, it's kind of back to the basics. But, we have some very interesting concepts here and I'm going to go through them and then, we can also see it in a demo.

The first thing is we want to be very efficient at finding data.  This is the concept of our primary or sparse index. So, whenever you do a SQL query, we as Firebolt engine, we want to be very fast at finding those data, those rows. Being fast at finding them and being very specific at finding those. So, not finding things that we don't need, but really focusing on the things we need means that we are very efficient and this gets reflected not only on the performance but also on the cost side because we require less resources to do this.

How do we do it? We actually use index small data ranges. So, we index ranges of data within your database. And whenever you do a query that has kind of an aware part, you'll see in the demo, that has part of the index there, we are very efficient at finding those exact rows. Some other products focus on doing things on the partition level. We think that this is not adequate for today's analytics. The petitions are quite big. It requires moving data that you don't actually need. And this, of course, affects the performance. The building block for all of this is called TripleF. It's something that you as a user of Firebolt don't actually see directly, but it's a way we store data within Firebolt. So this is a format through which we index the data, we sort it and compress it into TripleF or Firebolt cloud format. The way things work within Firebolt, we try to be really NCC for compliance. So, whenever you create a table, it's going to be the very similar syntax to what you're already used to and the detailed statements. And as you can see here in the bottom right corner, really defining an index is quite simple. So, you create a table, you list some of the columns there, the data types, etc, and then you list your primary index. So that's one type of thing. And actually, it's kind of the core element to Firebolt.

Even with primary indexes, we are quite fast, but then we have two more things that accelerate,  especially in scenarios where we have a lot of aggregations or you want to have your data model in kind of a normalized or maybe it's kind of star schema format 

First is the Aggregating Index. This is very similar to materialized views in other platforms. So, the essence of agg indexes is Firebolt will pre-aggregate things and keep them in this format and serve it from there instead of going to the raw data and that kind of pre-aggregate or calculating things on the fly. How it works is you have your raw data in a table. You have your kind of raw events, let's say, gaming events or anything of that sort. Then, you create an aggregating index on top of this table. You see the definition here. It's quite simple. You see the game code, player code and currency code. These are free dimensions that we want to aggregate by and then we list the different calculations. It can be anything like sum, average, counts, you name it. Any of the aggregate functions. What will happen here is once I run this statement, Firebolt will create this index. It'll populate it and whenever your query is grouping by some of these columns and doing some of these calculations, Firebolt will serve data from the index.

How this helps us? Of course, this helps with, first of all, the resources required are less because some of the data is pre-aggregated for you. The other thing is you can cater for concurrency here because we don't need to calculate so many things on the fly, we can serve much more things in parallel. So, we can cater for higher concurrency, and we really see this in kind of data apps, customer-facing analytics, or even plain old dashboards and reports where you have a lot of, kind of, repeating workloads or a lot of concurrent SQL queries that are hitting your engine. The aggregating index helps here dramatically. What's nice here is you create the index only once and Firebolt will maintain it for you. There's no need to do any complex data pipelines for this. It's a detailed statement whenever new data flows into your base table essentially what happens is Firebolt populates the base table and also calculates the aggregating index for it or updates. And that's it.  You can have one or more and it really helps with performance. It's as simple as that. 

Another index type is the Join Index. This one helps if you want to add additional data to your base table. So, if you want to join things, no need to denormalize the tables or the schemas as we've seen it in some scenarios. So, actually, here you can focus on the kind of having a star schema, you can have your dimensions, you can have a fact table, and do things properly. So, you can join things, join tables together and get really, really good performance there. 

How this one works is basically join indexes are you can view them almost as hash tables. They're kept in memory in RAM. And, so whenever you do a join, you see here in this example, I'm doing a join index on three columns here, game-code, game-studio, game_currency. Whenever I join two tables, let's say, I join them in one game code, studio and currency will get fetched from memory. So, they will be kept in memory as part of the dimension table. And what this helps is with kind of, again, using less resources on the engine, no need to calculate things, just serve them from memory, which is of course the fastest you can get. So, these two things can work together. You can use both right, Aggregating and Join index in the same query and this is where we really see high levels of scalability, low sub-second performance, and really less resources used. All of these things, of course, help with the before-mentioned calculating the KPIs, the trends, and different final analyses that we've seen in gaming. All right! Any questions here?

Tamar: I see a question in the chat Luka. So, I'll interfere with Kirill. It looks like you need to define the metrics in the Aggregating index, not just the aggregating dimensions, Kirill also feel free to interfere and ask.

Luka: Yes, I mean the Aggregating index definition will contain these calculations. But in your query, you can do additional things. So, your query doesn't need to be as simple as this one. It can also combine things here. You can do a join as we have discussed. So, it doesn't limit you to only this type of definition. You'll see the demo, actually that we create an Aggregating index that's kind of wider, but we only use a subset of the columns and still, the query optimizer is smart enough to use the data from the Aggregating index.

But Kirill, if you have any additional questions or anything you want to go deeper into, let's do it.

Kirill: Thank you, Luka. Yeah, this is great. Just want to learn more and I understand the trade-off. What does it mean when I'm writing data into the warehouse? How much does it slow down?

Luka: Yeah. There are two things to consider. So, when you create an Aggregating index, it will use some of the storage because it's a new object. Usually, we see like 1 to 10 ratio. Let's say, our base table has a hundred million rows, we tend to go into a kind of, recommending an Agg index when the aggregation is, let's say 10 million rows, so 1 to 10. Storage is just quite cheap, we use S3. We really don't do anything, on top of that, we kind of propagate the cost of S3. So, this is one thing to consider. The other one is the impact of the insert. So, we've seen and, of course, that depends on the Aggregating index definition and the data volumes, but somewhere between 10% to 20%, slowdown of inserts when you have Aggregating indexes on top of the table. Not huge, especially when you consider that inserts in Firebolt are quite a performance. But yeah, definitely it'll depend on the number of agg indexes you have, the data volumes and etc. Hope this answers the question.

Kirill: Yeah. Thank you.

Luka: Awesome! All right. So, just quickly to touch upon the architecture, on a high level, we've mentioned before. So, Firebolt is a cloud data warehouse fully managed as a software service. We run this in AWS and the architecture is really decoupled storage and compute, kind of, in essence. So, if you look at it from a very high level, you have a storage layer at the bottom here, which is data plus indexes. This is kept on S3 and then you have to compute engines or Firebolt engines at the top. So, engines are basically the kind of endpoints that you use to query your data. So, you can have one or more engines running. There are specific engines for ingestion, what we call general-purpose engines and engines used for analytics. So, for reporting, kind of, read-only engines, you can have one or more running. They can be different scales, different types of engines, etc, we can show this in the demo. And then on top you have your use cases. So, reporting, integration, and anything that uses Firebolt will basically interact with the data through these engines. Awesome!

So, just to focus quickly on a Customer Use Case. This is actually a mobile gaming company that has recently moved to Firebolt and an interesting use case because it touches upon the challenges that we've seen. Definitely, they had issues with operational dashboards on Looker. They were incurring a lot of costs because they were very cumbersome to manage and to calculate. So, on the one side because of the pre-aggregation because of the small time windows, a lot of data engineering efforts were spent there and the other pain point was actually the performance. Still, the users were quite unhappy because a lot of times were in the 15 to 30 seconds, which is not something you want to see today. They rewired their solution and moved the data over to Firebolt with Looker running on top of Firebolt and they achieved subsecond performance there. So, saving time, saving resources as well because we are much more efficient and you can see some of the numbers here in terms of how many queries they cater per day, and the reduction of cost. Awesome! Is there something you guys can relate to? We can also touch upon these things as we go into the breakout, but, yeah, I think, it's actually kind of reflects the poll that we've had at the beginning. There is one item there on the integrations that we can also touch upon in the break room but also have a slide on how Firebolt can help with integrations.

Okay, let's do a quick demo of Firebolt. Enough talking. Let's actually see how this works.

I hope you can still see my screen. All right! This is our main UI. So, at Firebolt, you access it through your browser, of course, that's the kind of user interface. There are many other ways to interact with it. There are rest APIs as is the case. There are a lot of way to actually programmatically work with Firebolt. So once I log in, I have my databases here. I have my SQL pane, the results will appear here once we run any queries. And then on the right side, we have like an Object Explorer, so you have your tables, views, columns, indexes, everything there. So, let's see what we have in this database. We have one interesting table, it's called lifetime value and this demo is actually coming from one of our AdTech customers who gave us an obfuscated data set that we use for demos. It's quite interesting because again, it's a pretty large data set you have. You see here the FACT table with 51 columns, 31 terabytes of uncompressed data before being adjusted into Firebolt and then the size within Firebolt almost 2 terabytes. So, we can also see the count from this table here. So, it's 57 billion rows, right? So, quite large, this is the lifetime values, so different kinds of events, and different metrics that are calculated for this AdTech company, but quite a large dataset and it was struggling before while being queried for _____[00:19:09.04]. So, ____ does scans of course. It doesn't have the sort of indexes that Firebolt has and so the Looker dashboard that was running on top of this table was performing really badly. Once ingested into Firebolt, we can see a typical query that was issued. So here, I'm switching over to another SQL pane and I'm going to run a typical query that was coming from Looker to calculate some of the different KPIs that the customer was expecting. So, here you can see the typical things that Firebolt is good at. So, Joins you see, we have our fact table, we join it with 2 dimensions. You can see the where clause with the different filters applied, and then you can see different groupings and the calculations, right? So, the different kinds of clicks, impressions, all of these KPIs are calculated on the fly here. So, if I do this and I execute, this is now running on top of, kind of, two weeks of data with another filter on the media source. And what happens is you see that performance is quite good in subsecond, but what's more interesting here is the actual scanned bytes. Out of the almost 2 terabytes of data, we're only scanning less than 4 gigs. Really a small percentage, and really see how efficient Firebolt is at finding rows that are needed. So, this, of course, is where the primary index kicks in. So. on our base table, we have both of these, the timestamp and the media source as part of the index. And, you can see, the kind of speed that you get and also the efficiency or the effectiveness of the platform. So, really fast at finding data. 

If I do an increase here, so I go into one month, you can see it's still quite a good, linear scale. If I go to two months, we're going to see something around the second. So, still, still quite nice. This scenario uses both the primary index and the aggregating index. So, if I do an explain here, this is how actually, kind of the query optimizer and the planner sees this and what it produces as an output of our query is actually the query is hitting the aggregating index.  So, it's not hitting the base table, it's hitting the agg index. And, it's using some of the primary index columns, to find data, but actually, all of the calculations are done beforehand, they're kept in the aggregating index.

If I show you the DDL for this agg index, it's going to be pretty clear. So, here it is. This is the DDL. We define it on our base table. We define the columns, you see here, those six that we grew by, and then you can see the calculations. You can notice here, for example, Count Distinct was not in our original query, but still at this level of granularity, the optimizer is smart enough to serve data from the agg index because everything is there. In your agg index, you can have one or more calculations, not necessarily all of them need to be in your SQL query. What's also nice is, for the agg index to kick in, I don't need to change anything on the query side. So, if you see any slowdowns in the behavior of your queries, either from your data or from any of the, Looker or any other dashboards, you can simply add additional indexes. On top of this LTV table, you can add additional agg indexes and no need to change anything on the end user or query side. So this is another benefit of this approach.  So here in this demo, we've actually seen two things. So primary index, very efficient at finding data and then aggregating index, no need to calculate things on the fly, but basically serve data from the aggregate index directly.

Another thing that I want to show you here is, if I come in this side. So, let's try it again. You remember that this was around 0.9 seconds, on two months of data. Because we are a decoupled storage and compute architecture, what I can do here is I can switch engines. So, this was currently running on a $9 per hour engine. You can see the resources of this engine here. So, it's a C18, what we call _____[00:23:36.24] CPUs, 72 gigs of RAM and 900 gigs of SSD. I can switch this to a smaller engine. This one will cost me $4 per hour. It has less resources available but it also helps me save some money. So. what I can do here is I can switch over, run the query again, and now I'm at 1.5 seconds. But it's also costing me much less. This architecture enables you to make some much different engines to your use cases, and, you do not have maybe, really subsecond performance where you actually are kind of, exposing things to your end users, in data apps, external customer facing content, or, if you really, don't care so much about subsecond performance for internal BI, you can use a smaller engine and save some money there.

The final thing I want to show you here is the engine selection screen. I switch tabs, to our edit database screen. So here I'm in this demo database, you can see the engines that are attached here. So, five engines here, two of them are running. You can see them in green, and then I can add additional engines and really, this screen helps me configure, the different options, different specifications for the engine. So, I can choose from different engine types. We have engines that are optimized for concurrency with more CPU available. We have engines optimized for very intensive queries or for ingestion which are more memory-optimized, or RAM-optimized. And basically, you can pick and choose between a different very granular selection of engines. You don't need to go, in kind of T-shirt sizing, you can really adapt your engine towards your use cases and really focus on being efficient.

The other thing is for each engine, you can choose the engine scale. So, you can go between, 1 and 128 nodes. And whenever you kind of change the engine specs and the engine scale, you can see very transparently how much this will cost you per hour. So, we charge by the hour of the engine up time. Whenever the engine is inactive, it can be shut down automatically or you can do it from the rest API. But, in essence, the cost of this solution then is very, very transparent. This allows you to really focus on what matters, which is, in our case, scalability, performance, and kind of catering for high volumes of data at high frequency. Yeah, that was it. A very short demo. Any comments or questions here.

Tamar: I'm seeing some questions in the chat guys. Do you want to go ahead and ask.

Luka: Don't be shy? 

Tamar: Well, I can just read them out loud. So, there's one question from Dianyu, does Firebolt also support data storage in Azure? 

Luka: Not as of now. So, definitely Azure is on the roadmap. Our current, cloud provider is Amazon so AWS, but GCP and both Azure are on the roadmap. 

Tamar: Another question. Do we have the freedom to define the S3 bucket type for storage standard or IA or glacier, also can recreate backup of the data directly from S3?

Luka: So, how ingestion works in Firebolt is, data is coming in from your  S3 bucket. This is where you can actually change the tiers and choose the type of storage. Once it's in Firebolt, it's managed by us. So, you cannot play with those kinds of levers yet. Regarding data export, it can be done.  We have a copy to command, which actually some of our customers use to periodically export or back up some of the data. How it works is, you define a SQL query. You wrap it with a kind of copy to the clause, and then you export this to an S3 bucket. So, you can choose your file format and S3 bucket where data will be exported. 

Tamar: Cool! Giving another opportunity to just interfere and ask a question, otherwise, continuing to read them out myself.

Luka: Thank you, Tamar. 

Tamar: Another one is, are we only paying for a query run time or engine standby time as well?

Luka: So, it's the time that the engine is up.  So, imagine I have this engine up for 30 minutes. It's going to be $4.5. So, it's the engine uptime plus the S3 storage, which is a best price from Amazon. Once the engine is up, we don't care if you issue a hundred queries or two queries, it's up to you.

Tamar: How long does it usually take to activate an engine? 

Luka: That will depend on two things. One is the warmup method. Once the engine is starting, you can select to warm up indexes, which will bring the indexes into the SSD of the engine or you can warm up all data which is, again, bringing all of the data from S3 coping to the SSD and this means that all of the queries will be warm from the very beginning. So very, very fast. This is one thing that affects the startup time and the other is the data volumes. So, on a kind of blank canvas, it is going to take around five minutes to start an engine, and then, it can be 5 to 10 depending on the data volumes.

Tamar: Cool! Last one, before we go into the breakout rooms, how well integrated is Firebolt with DBT, especially usage of your indexes?

Luka: Awesome! Let's check this slide, which is our Ecosystem slides, so we can touch upon DBT and other things here. We have a GitHub page. I can find the link and share it later on where we actually list all of the different features we support from DBT. The primary index and agg index definitely we support, I'm not sure about the join index. I can check this and come back to you. But internally, we use DBT as part of our cloud data warehouse solution and we've been very successful in using DBT so far. Of course, as an industry, there's a large trend in using DBT and our focus is definitely there. So, the thing, we will get the join index and other features that might be missing will come for sure.  Apart from DBT, of course, we support different toolings. And really, I mean, when I look at this, I joined Firebolt around a year ago, almost none of it was here. We had a couple of integrations, we had Looker, etc. We had JDBC, but then you can see that really our ecosystem team is doing amazing work.

We are adding new connectors and new tools,  literally, weekly. We just published, for example, .Net and a Go SDK last week. But you can see there's a plethora of different tools and then SDKs that are available. so our focus is definitely to enable integrations, and enable usage of FireBolt from different tools. But also keeping in mind that we want to support programmatic access to Firebolt through rest API, through CLI, JDBC and SDK. So, this is one of the kind of core things that we focus.

Read all the posts

Intrigued? Want to read some more?

Curious to learn more?

Discover how Firebolt achieves extreme performance at a fraction of the cost