PANEL

Data Architects Unplugged: Survival Stories From the Field

Firebolt architects are constantly at the front line. Their job is to get the biggest data analytics and engineering challenges from customers and come up with innovative solutions. Rob, Chris, Matthew, and Jay got together to talk about the craziest challenges they got, the main trends customers are looking for, and 80s music albums that keep them going.

Transcipt

Jay: Hello folks. I'm Jay Rajendran. I'm a product marketing manager at Firebolt. Today, we wanted to share some of our experiences from the field. This is from the field, for the field and how customers look at Firebolt, how customers are using Firebolt? I'm joined by three of our SAs. This is a cast of characters here. They are a rowdy bunch. They have their own opinions, and I'm supposed to be playing moderator with these folks. So, having said that, let's go over and get quick introductions from them and then we'll get started. Robert!

Robert: Hi guys. I'm Robert Harmon. I'm one of the solution architects at Firebolt. I've been here a year as of today.

Chris: Wow!

Robert: So, that’s a big day. It seems like so much more than a year because we've gone through so many amazing things. Prior to coming here, I was in the Consulting Space, strangely with Chris. He used to be my boss once. Before that I was out in the AdTech space as an individual contributor, managing some really large data warehouse projects.

Jay: Over to Matt.

Matthew: Yeah. It's Matthew Darwin, really don't mind, which again, solution architecture with Firebolt. I've been here for just nine months now. Prior to that, I was also in Consulting for a little bit on cloud data platforms and before that ran as head of data and DBA primarily on SQL server for quite a lot of companies around here in the Manchester area. So, you can probably tell from my accent I'm from the UK, not from the US. So yes, that's Manchester where it rains rather than Manchester in the UK and US.

Jay: By the way, the English accent provides so much credibility in tech. I have to say that. Chris.

Chris: Chris Honcoop, been here a year and a half. So, I guess I'm the elder statesman and can still tell you what to do, Rob. So, before that data engineer, data architect, data warehouse, all that stuff for longer than my lack of gray hair may show.

Jay: Thank you, Chris. I know you guys are all very opinionated about stuff. I want to get your thoughts on where the industry is today from a cloud analytics standpoint. I know we're going to see a vast array of opinions, but I wanted to get your thoughts on it. And how do you see the landscape today? Anyone of you can take this, volunteers.

Matthew: I'll jump first. So, typically at the moment, when you look at the data landscape, you're going to come across modern data stack, everywhere, and all you're going to see alongside modern data stack is logo after logo, after logo, after logo of vendors and products, rather than actually thinking about the problems that people are trying to solve with data, getting back to the basics of data architecture, it's just, okay. What quick product can I pick off the shelf to solve this problem now and get moving forwards quickly? I think it's the approach that a lot of people are trying to take and that's where we then see these data stacks where people have 20, 25, 30 particular products in their stack and it becomes slightly uncontrollable. And perhaps at the moment, it's becoming a bit of a backlash against that approach. And it kind of gets blamed on vendors rather than getting blamed on the people that are actually making the selections that the vendors are selling to. So, I like to call it "easy mode." Somebody wants to just build an entire data platform without thinking about any of the problems, that would be my take.

Jay: Okay. Anybody else.

Robert: I think Matthew hit on one of the primary issues that we're facing today. I hate to age myself, but I remember the nineties and two thousands where things were manageable and they're not anymore. And the difference is, partially our products, but we've also forgotten a lot of the strategies that we used to take after data to make it as simple and manageable as possible, rather than look at the hard work of “I need to rebuild the schema because it just does not work for what I'm doing”. We throw products at it. And then we end up spending more. And, the tough part about that is not only do we end up with more spend, but we just added a product which complicated our platform. So, now we're not going to go look for another product to solve the problem I just created with the last product. And this does stack up. I think to bring back to Firebolt real quick, one of the things that Firebolt does is that efficiency allows us to simplify on so many levels. So, if we embrace that, we can kind of collapse a bit of that product debt and get back to doing the jobs that were built to do, to manage that data to get that accuracy that we need.

Jay: I like that. I think in one of the follow ups that I was going to have from Matthew was exactly that. Is Firebolt becoming another logo in that stack and are we compounding the problem? But you preempted that, so I appreciate that. So, my follow up question is, if that's the case, we are truly solving a problem, what is driving customers towards Firebolt? What are you seeing out there? So, maybe we can start there.

Chris: Well, I wanted to step back a little bit. That's not saying there isn't a place for multiple tools, like Firebolt does, something very, very well, like finding the needle in the haystack or giving you aggregates. I mean, we're amazing there. But that's not to say we should replace Spark or something for that workload, but yeah, Matthew is right, it has gotten out of hand. It should have maybe 3, 4 data things in your platform, not 10, and to follow up with what Rob said, indeed, like half the time people look to new tech as an easy button when that's not an easy button and it just makes it worse. We do see that. A lot of people come to Firebolt to solve their problems. I think one of the main reasons I love working as an SA at Firebolt is that, that's what we do. We work with you to solve your problems. Firebolt is part of that, but sometimes it's more than Firebolt and we can take a fresh look at your whole stack and say, here's a different model. Have you thought about this? We're really nice. I promise we don't just come and dictate and tell you what to do. But, that's one thing I like, I'm not just here to set up Firebolt for you and run your query. We take a broader look at things and say, hey, where's the problem. Let's attack the problem, not just shove in Firebolt into your stack.

Jay: Essentially, you are meeting the customer where they are and providing value by just understanding their problems. I love that. That's great! Going back to our key value propositions that we typically talk about - Speed, scale, and efficiency. Let's talk about speed first. What have you seen out there in terms of your engagements, specific results that you've seen? We don't have to mention competitor's names or other providers' names, but generally what you're seeing in terms of improvements that you've delivered?

Robert: It's really hard to say straight up here's the number because every engagement's going to be different. Your data set is not going to be like the guys next to you. So, we're going to see a range of improvements on different workloads. And quite honestly, there's some workloads where we're not going to improve things and that's okay. That's part of the game. Those workloads that we do see improvements on, we generally see anywhere from 10 to what was the one we were talking about yesterday, Chris, we only made it to 80X faster, the best we could do. So, it really depends on the workload and in that case, 80X is probably not to be expected in most places, but that group had a really well-designed schema. It was logical, it was tight. They knew what they were doing. So, it exploited product ability. It fell right in the wheelhouse and off we go and sometimes we don't have that. We published a recent blog article on an engagement I was in where the best we can do is, 10 to 15 second queries. But when you look under the hood on that engagement, it's an insane request. It's a massive amount of data. You're not going to make this happen instantly. So, we're normally a little squirrely when we get into 10 second queries, but when you compare it to the half hour it took on the previous platform. We're looking pretty good.

Jay: Yeah, that sounds good.

Matthew: I think these are great points as well. I mean, the fastest that I saw was one that went from something that was taking about 38 minutes down to being about 0.15 seconds, which was fantastic and the reason for that was just down to what we went at, which is pruning. So, rather than calculating this entire huge dataset and then right at the very end, limiting it down to the 50 rows they wanted to display in their app, we were focused on, let's get the 50 rows to begin with as soon as possible and then play those back and it's huge. That was just hilarious. We kind of sat there like, can we compare the results? Can we make sure that this is right? There's something not right here. How can you go from that to this? I mean, it was a free query, it is very, very specific to why we would do well in that space compared to how they were running things, but that's where it comes down to and that's the benefit with Firebolt. We're looking at and trying to process as little data as possible to get your results as fast as possible. It's not magic. It's not like we're doing something where we're breaking the laws of physics or anything like that. It's just that we process as few zeros and ones as quickly as we can, as efficiently as we can and return that for you. But, on the flip side, I've seen people saying, oh, I want this query to run sub second to a mobile device in a dashboard and then you look at the actual data result set for that query and it's like, three gigs of data and you start having to quote, connection speeds and speed of light, people are going to say, I think you, you need to either reassess the size of data that you're producing here or have different expectations. It's just not going to work because of our aforementioned speed of light.

Chris: Sometimes we do get proof of concepts where I'm joining 1 billion to 100 million, no filtering at all. And, tell me how fast can I go? 9 times out of 10, this isn’t a real workload. It's somebody saying, hey, this is the test I'm going to have to really see what Firebolt can do? And then they're upset. They say, you guys promised sub second response time like Matt saying there's limits to how fast electrons can travel. So, where we excel generally tends to be the real use cases. Not saying there isn't, like I said, 9 times out of 10, when you get those weird ones, Folks trying to hypothetically see what a software can do, but Firebolt tends to really focus on what do you do? You look at data and data by itself is meaningless. You're trying to get something out of the data. And that usually means either getting an aggregate view or looking at small subsets in order to find things that you need and I think that's where Firebolt fits in, on BI, not running through the entire stack and joining giant tables. But, in that 1 out of 10, where it is needed to do that giant thing, sometimes we just say, Hey, well, why not just do this over here and prejoin it there or maybe there's a way we can filter things and look at it differently as well.

Jay: That's cool! So, there are a few things that I heard. One, it's all about real life use cases and how we deliver value there, but at the same time, there also needs to be realistic expectations. Now, we are not solving the speed of light problem. There are other things that we are focusing on from a data perspective. We're really efficient there, and that's what produces results. So, worth reminding customers of for sure. But I want to expand on that. Everybody talked about response time, but there are other workloads that go beyond that. One of the things that we talk about is concurrency. This comes up quite a bit. What's your view on concurrency? How is Firebolt delivering there today?

Matthew: I've seen some really good concurrency numbers from Firebolt. And I think again, it's down to that same point because we're processing small chunks of data. There's an awful lot of small chunks of data that you can process at the same time. So, if those query patterns are in that space where you kind of have multiple dashboard requests, say, maybe it's like a finance exchange chart or so on and the actual query is superfast because of the way that we're using our aggregating index to produce them. They're taking 0.2 seconds or whatever it is, 0.15 seconds pull that back. Then when you put that into a real workload and see what people are doing with their phones and how they're accessing it, and actually look at what those concurrency numbers really are in their application. There's usually quite a stark contrast between what our customer thinks it is and what their usage pattern of that concurrency really is? So, we're at a point where we've got a query returning in 0.2 seconds. And the next query that comes in returns in 0.2 seconds, they think that's concurrent queries and it's not two sequential queries that are happening in the space of a second with two users hitting a button slightly differently at that point in time. We get them out of the door fast enough that actually concurrencies not even an issue anyway, but then, we do have tools where we throw thousands of queries simultaneously at Firebolt to see how it works, maybe not thousands, but hundreds of queries at Firebolt and see how it works and see where those latency periods come in. And, if you hit a bottleneck on a single engine, at that point, you can scale, you can add more nodes to your engine or you can add more engines to service those different workloads. So, from a concurrency perspective, there's a lot of tools that we've got to work with. One is to make the queries fast enough that actually you're not even having a concurrency issue anymore when you look at what your users are really doing and then look at those scaling options to move from there.

Jay: Obviously the efficiency side of things actually does make a difference. If you're inefficient, then you're relying on crutches, like auto scaling and all that stuff from a concurrency perspective, maybe you don't need those things. So, if you're really, really, truly efficient, it just changes the game quite a bit. So, that's cool. Let's switch gears a little bit. We talked about performance. Now, let's talk about efficiency. How are customers looking at efficiency? I think there's a lot of focus in the industry on cost itself. Where do you see Firebolt from an efficiency perspective, cost, processing and all that ?

Chris: That's always a difficult dance trying to figure out where you want to be on that. How many knobs you want to have and how many things you want to tweak versus how much do you want it to be the easy button? I know as we've matured as a company we've gone kind of around and around on that. Some people want to just throw anything at it, not think about indexes, not think about anything and just have it come back fast. Other people really like to be able to turn the knobs, get under the hood, see exactly how it's working and get the absolute, most out of their database. I would say generally we try to be kind of in the middle, you have some, you said not to mention names, but you have some technology where you're in there. You're all the way down to choosing how the storage engine is going to interact. And you're really having to set up everything and then we have some other well established, well-polished databases that really, you can throw anything at it. They tend to be very expensive. I would say we're kind of in the middle. Like we should be able to throw things at it and have it come back and at least match if not beat our competitors. But then if you put in a little extra time and work, then you can also be faster. And in general, having a few more knobs, than some of our competition, is what allows you to be cheaper. So, especially for example, choosing the type of EC2 instance that we're running on, whether it's compute focused or RAM focused. Almost all my customers that I've ever sat there and profiled and done all the things and picked the right engine, have not chosen the balanced option. We have a balanced option, which is kind of like Jack of all trades, master of none. There are two customers, that I can think of, that do use balance, but almost all of their patterns fall into a certain area that either needs more RAM or needs more compute, like JSON needs a lot of compute, for example. And so that's one way I think that we can be more cost effective than some of our biggest competitors is really providing the right machine for their problem.

Matthew: I think of an example that we just had recently with a customer, they were having, and it's a concurrency question, they had a particular query that they run that is quite expensive in terms of RAM on the machine. And one run of that query takes about a quarter of the RAM available. So, if you run four of those concurrently, they have an out of memory issue on the engine that they're running on. But, because we have those different suites of engines available, we pointed them out to them and said, hey, we could go on, and we can tune this query some more. We can do all of this stuff, which is cool or there is a switch here that we can just switch our engine from being a compute optimized engine to a RAM optimized engine and the difference in cost is 10 cents an hour. So, let's just do that rather than spend any more effort on it, because that's going to cost you time. That's going to cost you money, it's an effort that you could be doing something else, and we can switch you over to something that's 10 cents, extra an hour. And now you can run 8 of those queries, because it's got doubled. You never ever see 8 of those running concurrently ever. So, there we go.

Robert: I think there's more to it though than just Firebolt. I think Firebolt opens up other opportunities for efficiencies that are difficult to achieve on other platforms. If your queries are running slow or if your BI platform is not achieving the performance that your consumers want, what do we do? We fire up some piece, you know, some kind of third party app that creates a summary table out of our base table to make it go fast. Well, now there's a new use case that's kind of based on that last use case. Now, we end up with a pipeline, on a pipeline, on a pipeline. We've all seen this. All of that starts to accumulate more debt. Now, not only do I need all of these pipelines to make my end users happy. Now, I have to keep track of all these pipelines. So, now there's more need for man hours looking into things like lineage and data quality. And, this manpower explosion happens at that level. And, I think Firebolt is a really interesting platform for that. I can get that speed on the raw tables. I can toss an aggregating index out in like two minutes and the end user's happy and there's no secondary pipeline. There's no update anomalies that happened because I did summarization. All of that stuff that wakes the pager guy disappears and I can put my engineers back to doing something that's revenue generating rather than chasing the pager. I personally think that's a pretty cool track.

Chris: Firebolt is a lot cheaper per hour than people. We get people, so we get some customers that are like, ah, I got to be $2 an hour. I'm like, well, that's it.

Robert: I used it as a joke, but it turned out not to be funny. Can you find me a data engineer? They're all gone. Chris: And they're a lot more than $2 an hour.

Matthew: What was that you tell me about it, talking about today? Was it pineapples versus data engineers? What's the difference? Like pineapples on seats, such a weird thing to say.

Jay: So, you guys bring up some really important points. I just want to recap a few things that I heard. For one, granular provisioning is important. At the same time the control that you get with Firebolt, on certain platforms you may not get that and you end up overspending because it is inefficient. Then there's the people aspect of it. The people aspect of it, if I can move certain logic into the database and that completely simplifies the process, I save money, but there's another side to it as well. I think operationally as a SaaS platform, we're different. It helps as well. So, I wanted to explore that a little bit on the productivity front. So, what skills do you need to get started on in a Firebolt? So, I'm walking off the street, the question is, what do I need? What do I need to have? Go ahead.

Robert: It's surprisingly straightforward. Underneath all of this click and glamor, it's a SQL engine. It does SQL. If you know SQL, we're off to a good start. Now, unlike some platforms we've reintroduced the idea of index. So, there's a little learning curve there. We've got three basic index types and one of them is very, very physical. This is how we're going to order every table, a primary index. Then we have aggregating indexes, which seems simple at first, but then once you start getting inventive with them, they can get really exciting and really all that does is it allows us to manage aggregates in a real time fashion for the end user. So, they don't have to build those pipelines. They don't have to build materialized views and all that noise. And then the third one we've got is a join index. And if you're old like me, this is not a new trick. Join indexes work like covering indexes in Oracle or SQL server, or it's the same exact thing. So, it's not really exciting from the developer perspective, but from an operational perspective, these things get really neat if you apply them in the right place. Again, the manpower issue would be normalizing all of our lookups because joins are hard on other platforms. Well, we have a join index so we don't have to do that. We see a query, that's joining something and it's slow, just create a join index, call it a day. But that learning process, that learning curve is really low barrier. If you can spend three hours with me, I can explain this, turn you into a pro.

Jay: Perfect.

Chris: Yeah. I mean, I remember, when I started, being pretty intimidated, like the two guys that came before me on the team, one came from AWS, another one came from a vendor, they were asking lots of questions, like detailed stuff in the interview. And I was like, oh no, I'm not going to get the job. I really want this job. But, then when I got in and I had come from Azure, Postgres, even some Oracle, I came from that world. But yeah, like two weeks later I was running my own POCs and doing work on Firebolt. I remember the day I was like, wait a minute it's still just SQL. The approach is a little bit different and you can do some cool things with column based databases that you couldn't do with row based database. But generally at the end of the day, if you know SQL and you know how to structure data, you can do Firebolt. That's the key thing, really understanding data.

Matthew: But, I guess that's the skillset that's been disappearing. Understanding data and understanding data modeling is disappearing. There is an element of, do your homework on data models, go and read some books written in the eighties about data modeling. Because there's not much, there's a lot of this stuff that solves problems in terms of the modeling, in terms of why you do the things that you do. So, go read some of those books. Read some of the later books, then kind of follow it through from a historical perspective. Whereas I think there's a whole lot of people that started being a data engineer like say 2012, 2013, have never worked with databases. They've never worked with relational models. They do all this stuff in Scala. They think that's how data engineering and processing works. And they're doing all kinds of hoop jumping to get through, dealing with immutable files in cloud storage objects and all this kind of stuff to produce their pipelines. And actually you don't need to worry about that if you can’t sort your model out to begin with. So, I guess there's an element of upfront design with Firebolt, but then when you are looking at what you're doing with Firebolt, with your database, you also get those guys there helping you to do that or if you're lucky, you get Rob or Chris.

Chris: Yeah. A year ago I had a customer, for example, they were a major retailer and they brought us their dashboard. Half the improvements we made were not because of Firebolt. I said, here, go run this on your Redshift. They ran on the Redshift. Their times cut in half. Good news is their times are cut down to 20% or less on all queries by going to Firebolt. So they're like, well, if we're going to do this lift in Redshift, we might as well lift and shift it to Firebolt at the same time. But yeah, there's a lot of things you can do just by taking a fresh look and it wasn't nothing, amazing, no crazy tricks with indexes, nothing. It was just having a function in the where clause. That's all it was, so that it can't use an index because it was destructive and couldn't use that to filter. So, if there is a loss, a little bit of a lost art, I know Rob and I talk about this a lot in evenings with beers in hand. There is a little bit of a lost art, but it is exciting to see the industry overall swinging more towards data engineering. It's probably, I don't know, 5, 10 years ago everything went full stack and then people realize like, well wait a minute, there are skill sets we want in each area, no one can be a master of everything. Just like the balanced engine, jack of all trades, master of none. That's what you get. So, it is exciting to see the kind of weight put on data engineering right now. So, data science got really big and that stole a whole lot of people out of data engineering because they go, oh, this is the new cool thing. So, that's partly how we got here too.

Robert: I have to admit, I'm thrilled to death that I'm seeing trends on social media, data engineering is now in Vogue. We're getting a lot of press around modeling and being a former data architect, this is great although I did have a moment where I thought I told you, so.

Chris: A moment, just a moment.

Robert: Maybe a few moments.

Chris: I think you said that to me daily, Rob, come on.

Jay: I've been trying very hard to keep him off the soapbox. I don't want him getting on....

Jay: The last piece is that you guys are such purists. But, what you're pointing out is that the fundamentals do matter. If you do the right things up front, there's a lot of less pain afterwards. And the industry, as a whole, in some ways it's been forgotten. In some ways it goes back to chasing the next new tool or the next new aspect because you're so scared of the big data problem. You forget the basics sometimes. I do believe that. Transitioning to Firebolt would be SQL skills, data modeling, fundamental stuff. I mean, you should have that.

Robert: Yes and data modeling is no more important on Firebolt than it is on any other product. It's just that we're kind of in a strange world where we're really trying to help that customer, whether they're on Firebolt or not, we want them to go fast. So, we often engage the data model and sometimes it works for us. Sometimes we make it go so fast that they don't need Firebolt anymore. Now that doesn't happen very often, but it does happen and that's okay.
‍
Chris: And, sometimes that's not people's fault. A lot of times what we see happen is they set it up and set up a very good model in the beginning. Then someone makes a change. Often, they're changing their BI tool. They're not realizing that the BI tool is completely changing the way it writes the SQL. And then all these little changes to BI tool over time that's why I'm a big proponent of once a year audit of your data stuff and just taking a look at it instead of continuously adding new features, adding new things, new dashboards, new widgets, just tune up, I guess you could say at least once a year, once a quarter would be better.

Jay: All right. So again, switching gears. Rob, one of the things that you said was that you have AdTech background. I guess my question is, we've seen a lot of the AdTech customers leveraging Firebolt. Why is that the case? What is so different about AdTech?

Robert: It's big! So, there's two issues that I ran into in 20 year. Well, maybe 3 or 4, I'll start with a couple issues that I ran into in the AdTech space. So, what was it about 20 years ago? I joined an AdTech startup. I didn't know the first thing about AdTech other than this is where all the data is. So, me being the egotist, I went chasing it and it's an interesting problem because of the volume and the market. So, data volumes are always increasing. You're going, the internet expands, that means more, more ad views, more ad clicks, etc. At the same time Cost per View (CPV), which you're getting paid for moving ads, is constantly dropping. This has been true for 20 years. So, CPVs have dropped from about 30 to 80 cents. So, I'm making these numbers up, but you get the idea. If you're an AdTech provider, this gets to be a very, very difficult game because the data is increasing, but the value of each row of data is constantly decreasing. So, now we need to be very, very proactive about data ROI and this doesn't sound like a huge issue when you're dealing with just terabytes of data, you can work your way through it. When you've got five petabytes of data coming in every month, and you're distilling this down to a petabyte every month, that compute is massive. And, so you really do have to dial in on the ROI of every row and every query to meet that demand because your margins are dead thin. So you're constantly watching, if you're a data warehouse architect in this world, believe it or not, you're spending more time watching your financials than you are actually the data warehouse because we've got to make all this work. And that's where Firebolt can come in and help a lot. We run into trouble in the AdTech space. One of those things is analyzing Clickstreams. I know Chris has done this before. We've talked about it. Analyzing Clickstreams on SQL server back in the day was brutal because there's no way. You've got a million views of an Ad. You've got another million, well, probably 800,000 clicks. You've got 2000 conversions and now I need to find all of the impressions that became clicks, that became conversions, but worse I need to find all the conversions that didn't have a click or an impression because those are click fraud. So, now I'm joining this huge click stream table. There's billions and billions and billions of rows to itself three times to find my answer. Well, that is an impossible app for most platforms to do efficiently. In Firebolt, we were playing around with it. I was just doing it not because I'm working for AdTech, but because it seemed like a perfectly good thing to throw out Firebolt and see what happens. Matthew saw it. I did some quick aggregating indexes over an array and crushed that down to a single select, no join, everything is back real time instantly. It reduces the cost of that query by thousand fold and that can affect an AdTech business. And that's likely why they're drawn to it. I went too far.

Jay: No, that's cool. I think this is important. Look, the massive volume of data and the processing complexity, you got that, I think, we are there. But then you have the other extreme of customers who basically say, hey, you know what? We are not an AdTech. We are not an AdTech business. I have my 200 GB Postgres database and we are going to add some analytics. So, I guess my point is, I think in AdTech, obviously, from a big data perspective, we have value-add technologies, that's a great fit. What do you say to the other customers who are on the other end of the spectrum?

Chris: The customers I found that go that way because of how easy it is. They're sometimes coming from on-premises. They have to go buy new hardware, upgrade things or even some of the older cloud technologies. And they're just like, oh really, I can just change that in a click and they're excited because most of the time they have big dreams. No one's going to be at 200 GB forever. They usually plan on growth, as they should and plan for growth and so they like Firebolt for that part of it.

Matthew: Yeah. But also there's other things that come into play as well. I mean, if you looking at that 200 GBs of data, Firebolt is going to compress that really well. So, it's going to shrink down to like 20, 30, 40 GBs, maybe something like that. And you can start running this on a really small cheap engine, like I said, we've got customers that are running stuff on maybe like, $1.30 engine and they could load their entire data set into that engine. They can query, any query that they can think of on that engine and it is fast. Whereas taking that cost elsewhere, looking at how you would run that on something else, will be more expensive. So not only is it that easy, it's quite often coming into that kind of cost effectiveness, I think from within Firebolt. And, an example of that would be, I've worked with customers that look at using, say, serverless technologies to do their querying and they start off and they chuck a lot of Parquet files into cloud storage. They're then running a query engine over the top of it. And they immediately will not get a query that runs faster than two seconds and with that particular technology, it's just not going to work. It's going to be about 2 seconds, which is fine and they run that query once a day. Cool, brilliant. Then, they stick a dashboard over the top of it and now they're running that two second query, 10, 20, 30 times an hour and each time they run it, they're paying by the scan. And all of a sudden, this starts to add up and they haven't optimized it. They haven't thought about how they partition their data and structure it to reduce the size of those scans and all that kind of stuff so it starts to add up. So, at this point you now have to refactor and then you can take that whole thing and say, well, we're going to run it at say $1.40 an hour. You're going to have this engine up for eight hours a day. So you're at like $10, $12 a day running costs for this. It's actually quite a compelling sort of argument, I think, for going forward with those small use cases. Well, so yeah, Firebolt is going to do really well with the big stuff but because it does well with the big stuff, it's also going to do really well with the small stuff and we allow you to scale accordingly.

Jay: Good for the big stuff in terms of the bigger problems, like the AdTechs, which is hard to solve. And, then from a business perspective. If your margins are decreasing in a year, you have to focus on the data more. I like that data point, and then, obviously, the smaller stuff, easy to use, compression, all these things that add up as well. That's cool in terms of what we deliver. There is one thing that you did mention, I want to go back to, Matthew. You talked about a data lake with a Parquet file and being able to use a query engine. But I want you to expand on that a little bit more, because what we see typically is a conversation around a data lake or even a Delta Lake and the expectation is that the Delta Lake is going to deliver that sub second query performance. What are you actually seeing there?

Matthew: Yeah, so Delta Lake is, it's a fantastic approach to dealing with a problem. So if you were in cloud storage, you have immutable files and they're usually going to be Parquet files because that's what people write there and if you're using Delta Lake specifically it will have to be. You're going to actually just use Parque files under the hood. And all it's really doing is adding that DML semantics to immutable files. So, if I do an update in my file, what actually happens is I rewrite the file. This is what happens under the hood. So, I now have two files in the same place. So double the volume there and so on and then when I come to actually doing my reading from that, there's a manifest and a log that points to which are the current files to go and read from. So in that space, we've had a customer who has embraced Delta Lake. I think they're using the spark streaming for that. They've got a lot of stuff coming in for their kind of silver tables as they call them, it's great. They then wanted to have dashboarding on top of it. And, this is where with that particular tool that they were using, the cache was great. They were getting good query performance, but as the changes come into their tables, and this is the entire point of Delta Lake in the first place, as those changes came in, it invalidated the cache and their query slowed down from like 2 seconds to 30, 40 seconds for that query because the first thing that had to be done is go back to S3, find out which are the current files, load those back into cache, now run my report. And that was no good for them at all. So when we took Firebolt for them, we started looking at the fact that we had some challenges to overcome because at this point we don't have a direct Delta Lake connector. So, we are viewing all of the files in that partition to load into Firebolt. So, we did a bit of work with that actually using the manifest file there, using that to join directly to the files, which was great, load that in, really fast performance and now all of those queries are fast, even with the updates that they were having. So, they're keeping their update latency as they were expecting. But now all of the queries are in the kind of 0.1 to 0.3 second query bracket and their dashboards are all lightning fast. So that's the space that they've gone into. It's taking that existing great technology. Spark is brilliant, Delta Lake, I think is a great thing for having a repository of all your data. There's nothing wrong with taking some data and dumping it there so long as you know what you're dumping and why you're dumping it there. But if you don't have the downstream use case just yet, that's okay, kind of keep it there. But as soon as you're going to start to want to use that, you're going to want to do something with it and that's where Firebolt really comes in. I mean, Firebolt can play perfectly, nicely in that Lakehouse architecture over the top of a Delta Lake or a Data Lake. I've not looked at Iceberg yet. I need to pull Iceberg out and check how that works, but it's going to be the same kind of thing because it's the same problem that immutable files are. Guess what? Immutable. So you can't do updates and deletes on them. So, at some point it's just writing a new file out. You need to load that in. We'll be able to handle that with Firebolt and roll forwards from there.

Jay: Cool. I think there are a lot of customers who are looking at these technologies. So, it's great to hear that there's an option from a Firebolt perspective, along with the lakehouse architecture. I only have a couple more questions real quick. So, if a customer's starting to go on the Firebolt route, how would you help them navigate that? They're new, they're coming in. What is a typical approach from an SA standpoint? How do you guys approach it? Do I pick a person? Chris, let's go.

Chris: How we approach it at Firebolt right now, not saying this will always be the case. But right now we are very hands-on. So, one of the concerns, for example, we get from a lot of our customers is like, I can't go Google and find if something breaks in Firebolt. So, already right there, that means, we are their Google. We are their stack overflow. We are whatever. So, as I say, we take a very hands-on approach right now, and it really depends on the customer. Like you said we have 200 GBs. We have people coming from Postgres, on-premises. We have people coming from all over the place. So, they have a completely different background, completely different expectations with Firebolt. What we do right now, our first step is to find out if Firebolt is a good fit for them. We don't want to spend all those hours that we spend working with them, if right away, we can say, hey, this isn't the best technology for you. There are times we will recommend other things if it's not the right fit. Then, we go through a POC process. Our team, none of us likes, sorry, marketing. We don't like throwing out a slide that says 600x improvement, all that stuff. We hate that stuff. We take the customer's data. We load a potential customer's data directly onto our servers 95% of the time and then we run some queries, either provided by the customer. That's the ideal. But sometimes a customer doesn't really know, like it's a new workload that they haven't done before. And then we write some queries so that there's no hiding. And we show you, this is what it's running. This is what it looks like with your data. Yeah. That's a lot of work. That's why the three of us on this call, plus half a dozen more, are very busy folks, but we believe it's the best approach. There's no hiding and when you're paid by the hour, most of the time, you can't hide anyway. So if we do all this work and somebody says, yeah, I like Firebolt. There's no dotted line to sign on saying here's my five-year contract or whatever. You're paying by the hour. So, this is what works for us right now.

Jay: Anybody else?

Matthew: I think it's great as well when you do those playback sessions and you actually see, oh, these are my queries and it looks like my data, and hang on a minute. I'm comparing these results to my dashboard that takes 40, 50 seconds to load and you've run that query and you've got the same results there in like, you know, second, subsecond, whatever it is. It's really cool. The response to that is always very positive. Then, the other thing that it does as well is, it's a step forwards for getting that workload into production. So it helps us and it helps the customers. It is a win-win situation. We're at a point where we can say, here's all what we did. Here's the code that we built for you. You can take this and roll forwards with it and start to implement it and really reap the benefits, which is cool. So, I think it's a great process. I think it's really cool. It's not always easy. You know, we sometimes get some odd challenges. We'll have to rework things around, in order to make it work. But it's pretty cool. I mean, I have one where the POC was one query and it was to build the results for a data model, a data science training model, and it took like 30 minutes in Firebolt and that was the best I could do with this query. But I was processing it over a range of data that was like 200 or 300 times larger than they'd ever been able to run before. And, so it was like, okay, it takes 30 minutes. Sorry guys. I wanted to get it faster. You've never done this before? It just doesn't work on anything else that you did. They were going out from producing a model where it's an hour's worth of data to now we can produce a week's worth of data in exactly the same way. So, yeah, it's cool. It's a fun process.

Robert: I love those Matthew, every so often I just had one recently. I said, look, guys, I can only squeeze this down in three seconds. It's the best I can do. What's your current query run time? Well, it's impossible now, so I guess we'll go with yours.

Chris: I have had customers give me fake numbers before, like here's what it was. And then we show them and I'm like, I'm sorry, we, we barely squeaked out above the numbers you get now and then they're like, that's okay. We made up those numbers, you crushed it and I was like, oh, come on guys.

Robert: In this case, it was real data. It's just that it was so big. They couldn't push it to production anywhere else so they came to Firebolt and I gave it a shot and I guess that'll work. But yeah, sometimes you ask how much improvement can we get. Sometimes we start with the impossible and, get good, not necessarily great. You know, the jump from the impossible was a little tough.

Jay: We covered a lot of different topics. And I really want to thank you guys for jumping in and sharing your experiences. So, my last question for you guys, nothing data engineering related, your favorite eighties music album or a group is …

Robert: Anthrax.

Matthew: I'm going with Sonic Youth, probably.

Robert: Oh, wow.

Jay: And Chris?

Robert: He's too young to remember 80s.

Chris: I did listen to, I shouldn't be ashamed to say it, but I did listen to a lot of Depeche Mode. I liked them a lot and I listened to U2. I mean, even though it was overplayed at the time, when I go back now and listen to like Joshua Tree and it doesn't sound like the eighties. And just some of the things there, so at least eighties, U2, I don't know about current U2, but sorry guys, if you're watching.

Jay: I love it. I love this because you get a very diverse group, cross continent, diverse music preferences. So, on the Depeche Mode note, I'm gonna finish this up. Let's enjoy the silence. Thank you very much for joining guys.

PANEL

Data Architects Unplugged: Survival Stories From the Field

Talk to a Firebolt Expert