Listen to this article
According to Yoav Shmaria, VP R&D Platform at Similarweb, the best way to manage data warehouse costs is to tag every table, database or ETL running to have good granularity over every feature. Besides handy cost management tips, Yoav walks the bros through the tech stack he implemented to analyze 100s of TBs of web data to serve fast customer-facing analytics. Full disclosure, Similarweb is a Firebolt customer, but the bros kept it objective, and there’s no Firebolt talk in this episode. Let's go!
Listen on Spotify or Apple Podcasts
Guest: Yoav Shmaria - VP R&D, SaaS Platform - SimilarWeb
Hosts: The Data Bros, Eldad and Boaz Farkash, CEO and CPO at Firebolt
Boaz: Welcome to another episode of the Data Engineering Show.
Eldad: Welcome everyone.
Boaz: Eldad, How are you?
Eldad: I am great. Yeah, it is always good to visit and feel again.
Boaz: Yeah, you did not come over for a shopper's dinner last night.
Eldad: It is the only place we meet, on podcasts.
Boaz: With us today is Yoav Shmaria - VP R&D, of the SaaS Platform at SimilarWeb. Yoav, how are you?
Yoav: Never better. How are you guys?
Boaz: Thanks for joining. You are from SimilarWeb. SimilarWeb, if you have not heard, it is an amazing data product. SimilarWeb, our market intelligence, and research platform were publicly traded as of last year. Yoav, tell us a little bit about your journey, to where you are at SimilarWeb or how did your career evolve?
Yoav: My personal journey in SimilarWeb?
Boaz: Yes. Even before. Because you have a nice mix of software engineering, data technologies, and now management of R&D groups.
Yoav: Yes. I started my journey as an entrepreneur doing my studies. I had some touch in doing everything from everything, from full stack development to marketing and sales, and we had some specialty around payments and registration platforms, and I found my career journey starting in SimilarWeb as a Frontend Developer, nothing data. I have a very strong connection with product management, with business management, and I think, these capabilities took me to start managing, within the organization and eventually, two and a half years ago, I am already six years in SimilarWeb, I joined when my big daughter was six months old and now, she is six years old, so that is quite a journey. For the last two and a half years, I am leading the R&D group of our main B2B product. We call it, pro internally. When I started my current role, we were a bit less than 30 employees. Today, we are over 80, 15 of them are located in Ukraine, the rest in Tel Aviv. I think that my biggest leap was when I started leading the entire group. So, I had to deep dive into the entire data ecosystem in our organization, which was amazing and still an amazing journey. In my group, I have two main positions, separated into backend engineers and frontend engineers. And what is special about our group in SimilarWeb is that the backend developers are doing some kind of full stack backend engineering. We call it a data server. It is a mix of data engineering, focusing on ingesting data into different databases and the backend layer for the API.
Boaz: Before we dive into that super interesting stack which we want you to dive deep into, tell us in your own words about what SimilarWeb does and the role data plays at SimilarWeb?
Yoav: Getting how to understand about SimilarWeb is that we are not giving you some insights above your data. We are giving insights about everyone else's data and the simplest way to describe it is Google analytics for the entire internet, which it is. We are giving analysis across almost any domain and also a mobile app. We divide our use cases into five separate solutions. So, we are starting with general research. Today, most of the marketing budget goes to digital, not to paperwork, and not to radio. The journey usually starts with research when we get some strategy decisions around how my market looks. I want to get into the website, build their market, who are the main players, in which countries they are, and how the audience looks like.
Boaz: If I am, for example, a digital marketer at Zava, a clothing company, what do I do with the SimilarWeb application?
Yoav: First of all, you would like to explore the fashion industry worldwide, maybe to get some trends around rising countries or some rising audiences to find new competitors, understand maybe the market is getting a lot of traffic from, let us say display ads. Let us say that on average, most of the fashion websites get 10% from display ads and I am getting only 2% for my traffic. So, maybe I am doing something wrong. Maybe, the audience for fashion loves banners. This is very, very high-level market insight. And, then, we are going into competitive research. We call it digital marketing to optimize for each channel for the SEO, PPC, maybe by an affiliate manager. These personas are doing tactical actions, like tracking search engine position, tracking the campaigns, understanding competitors, performance, and doing this benchmark with Zava.com against any other brand and understanding what are the differences in terms of traffic performance?
Boaz: Given this is such a data-centric product, let us go back again to your team. How do you call them the data backend team?
Yoav: Data server.
Boaz: Data server.
Yoav: In my group, we are doing everything. I mean, it is data engineering, backend, frontend, and QA, but I do not have a single backend developer, doing only API, coding, and data engineers. It is like a full stack data server engineer and we changed it four years ago. We had a separate group for data engineering and a web group for backend and frontend and we realized that especially for a product like us, it is super important that the person who is modeling the data and the person who is serving the data should be the same person and this is how we started to get data engineering stack for our backend developers.
Boaz: What titles do the data engineers in that group have? Do they call themselves software engineers or do they call themselves data engineers?
Yoav: In our HR system, it is a data server engineer, but most of them are on LinkedIn, I guess there are somewhere between backend engineers, software engineers.
Boaz: So it is a software engineer, building data-centric products.
Yoav: Yeah, we do separate the frontend from the backend, but the backend is not a trivial backend at all.
Boaz: That is super interesting. We love that. I think one of the courses came up, let us say 10-15 years ago. If you asked the software engineer to build a database or data warehouse or query engine kind of project, people used to look down on that as something beneath their payroll. Today, the most interesting software projects, the hottest thing in software is to build data, data-rich applications and it is a huge shift in the market.
Yoav: Yeah, I love to call it that in SimilarWeb we are doing data engineering upside down. In most of the organizations, I and probably you also, find data engineering is some offline processing. It is saved for BI and analytics. It is not part of the real production. Maybe we are hosting some database, a lot of data to serve something, but we do not have an entire data engineering operation for production. In SimilarWeb, it is the opposite. We are analyzing the entire internet, and we have this funnel to give real-time insights to our customers.
Boaz: Let us talk about the data stack. What runs there in the data server?
Yoav: Everything. I want to deep dive into the entire stack of our data collection methods and machine learning and everything, as I said, it is data engineering all over the place here. But in my group, we have shared the Data Lake but we are still on AWS, which is a very nice name for tons of files, we are using the Glue Catalog of AWS and very important detail for SimilarWeb. We are managing a branch system, a data versioning methodology for our data that is because we have multiple algorithms running to calculate our traffic estimations, and we are changing them from time to time. We are always running with some master brand for our data, and then, we are having in production a mechanism to show our customers a better branch. So look, we upgraded our estimations, take a look, we are validating it for some period of time and when we are deciding it is good enough, we need to make a decision whether we are starting a new breaking point with a new out algorithm or we are running back three months or three years of data. Very soon in the next quarter, we are going to run the entire three years of our mobile web data.
Eldad: What you are saying is the customers are moving away from software versions to data versions. To them, that is much more interesting to be able to play with versions and to be able to kind of understand the data that is driving the results, and that is also a very big shift from how engineering looks at the world. I wanted to ask someone with your background, what was the experience in trying to really reorganize an engineering organization that is more traditional, looks at data as something that is decoupled? How do you reorg from 20 engineers to 80, but also not just grow in size, but really rewire? How does that organization operate on data?
Boaz: How was the experience?
Yoav: I think something that helped them is that we are running, in my group, a totally very strong horizontal teams methodology. That means that we have a metrics organization, and there are a lot of methodologies in HR development. What we are doing is that I have a data server team and each team member is or a couple of them taking part in a different squad delivery and this helped me to keep knowledge in this team. So, we are not losing control but in every team, anyone and everybody doing whatever they understand. So we have a very solid layer of data engineering stack. Now, what we did change when I got into the role, I think we had, two ETLs running on a very premature Airflow and what we did is that we started to take, and that is a great lesson, two or three engineers every quarter from all around the squads and dealing with the infrastructure. I think our two main tasks that we took when I entered the world were first, creating solid infrastructure for an ETL abstraction so that every new engineer can pretty easily run a trivial ETL. We are not talking about a very complex stack, but let us say we are using DynamoDB a lot. If I want to just take a very simple collection from S3 buckets and write it to DynamoDB, it would be like two hours of work. And you have a running ETL supporting branching, everything, it was a big deal. We invested a lot in it. And the second was to find more in databases that we can sell as production because we Edgebase base a lot, which is amazing for concurrency and really, really, fast, querying like the data, but the structure is very key value-oriented. It is like you are keeping a very dummy data structure for a specific key. You can scale, but you cannot really run complex analyses over the data. So, these were the main two initiatives that we took with some kind of virtual team of infrastructure.
Boaz: What data volumes do you guys deal with?
Yoav: It depends. It is basically per feature. I think our largest database in production is around 150 terabytes, compressed, of course. This is for a part of our big data research product and in the digital marketing for the pure dataset. We also have a data set that is some dozens of terabytes, but it is crossing the trillion rows of data. Other than that, we have more than 100 ETLs of data ingestions running every day or month. We have, I think, over 30 or 40 tables in DynamoDB, and in general, I think, we are running over a petabyte in production and that is, of course, only what we sell, which is three years of data and we have more in our Data Lake.
Boaz: When you set out on the journey as you mentioned before on the versioning of the data, I think that is something that is becoming more interesting. How would you recommend it for people who want to go in that direction to take it on? Because it feels like there is no clear sort of market standard on how to go about it. People are trying to figure out if there are even startups and companies being built around that area like lakeFS, what is your take? What would you recommend?
Yoav: Wow! Take for data versioning. First of all, let us understand the challenge. Because theoretically, you can say, I have a new version, let us override. The problem is that you would like, first of all, sometimes you would like to show the customers in production like us two different versions. And, of course, before you release the production, you would like to have some staging product that you can test. So, you want to be able to walk seamlessly with those two versions, maybe more. We are doing a lot of infrastructure work right now on it, but I think the main area to plan is for the actual serving layer for the databases, because in Data Lake, as you say, like lakeFS, where you have a lot of methods to manage like data versions. But, eventually, let us take DynamoDB to the table. I have one table and now, I have another version. So, whether I want to manage a table version, and then if I want to take 12 months from one and then from another that is one kind of complexity. We totally use a lot, prefix or suffixes like the branch name. But you will need to design it in advance because otherwise, you will get into a situation that you want to write about, and again, I am talking about the serving itself. You want to write only part of the data, like for the new branch and you are willing to think about how they both merge together and in big data, if you want to do a union data or join, it is not always a trivial task and you can lose performance for that. So, I would say modeling the serving data, I think, that is the most complex part, because as you say, there are many, I think also Data Lake now has some solutions for the versions. But when we are talking about serving, that is not trivial at all. We are still struggling by the way, what is more relevant and when you are running some analytics engine on the data, it is even more complex.
Boaz: What do you mean by struggling? You told me everything was perfect.
Yoav: Nothing is perfect.
Eldad: The versions are related to the models, right? Different model versions require different data versions, which are translated into a feature version, and the challenges across every step. And you have mentioned up to the serving part but then it starts much earlier. This is why versioning in lakes is becoming such a hot topic. It is connected to ML, it is connected to models.
Yoav: And, of course, you need to hold some metadata store for your versions. Now, I have multiple versions all over the place, somebody needs to hold this information.
Boaz: In which case is what metadata store.
Yoav: So, we are using it, it is an in-house solution. It is a simple database that we are holding in relation to the database with some services that we hold. But again, think about it. It is something that is in your critical path, right? Because, unless you have some cash in place, you need to go through this service, always to ask, wait, where do I get the data from? This is your router. So that is also a critical part.
Eldad: Wired into the product versus just kind of looking at different versions to figure out which report.
Yoav: I always want to rewrite my entire data. I would rather only write parts of it. So, I want to hold the changes somewhere that I can serve seamlessly, let us say a perfect graph, but it is combined with five different versions of data.
Boaz: Let us go down that path, testing. So, how do you go about testing your data?
Yoav: Again, testing is all over the place as part of the modeling of cost, data rating, and everything. In our part, we have a very loud stack of automation testing. That is running on the product and seeing the final results. What is important again in our life is that we have a window release and snapshot list. We have a daily release of daily data and then we have a snapshot where all the monthly calculations are taking part. We have nightly tests, making sure that we have first of all a full sync between our data lake and our production databases, and that we are showing new data. It is supposed to happen that if you have, I know hundreds of data sets that somewhere will get a zero. So, we are always testing that we are not missing anything. This is like the critical test I am always talking about with my teams. Please, when you are releasing a feature with new data, make sure that next month we will see numbers. It is like the dumpiest test, but you always lose something. I think this is the main area. We have multiple tests in place for Linux, but I do not think it is the most relevant now.
Eldad: Testing is moving from units to data units and results.
Boaz: At the end of the day, you are delivering an experience to users, they slice and dice, they look at data from different angles, and users' experience is great and fast, but what are you guiding, I would say product principles. What do you think about the user experience? How slow is the query not good enough to be in the UI? How much does engineering need to be pushed to come up with ideas, to make things smooth from an experience perspective?
Yoav: There is general user experience science, and there is also the part of how easy to use the product. We are talking about basically time to insight in SimilarWeb. How many actions and how long does it take to get an insight from those huge data sets? I think in terms of performance, I believe that and Google has a lot of articles about it, that in some point, let us say around five to ten seconds, this is well, you are starting to lose it, especially now that we are so used for instant messaging and the user will wait for this report, but he would not play with the data and in one slice it over and over again and drill down if every action takes now 10 to 15 seconds. It will not happen and we see it in the data. We did several evolutions in the last year or so, taking the dataset from operational databases or starting a PLC with a low engine and then taking it with a more robust cluster or whatever the solution is, and we see the difference. We see more queries per user as we increase the performance. And, this is from the performance side and the other part is how do we take the most relevant piece of data and expose it to the customer as soon as possible. We have not cracked it perfectly yet, but we are having more and more steps into it and the onboarding experience will be that you define who you are, what is your market or competitors and we will give you the most interesting part of the data.
Boaz: Another question I had in mind. How do you guys hire engineers for the data server team, but the skills you require are not a trivial one. So what is the strategy there?
Yoav: So, in general, hiring is not easy these days. Hiring backend developers or data engineers is really tough, especially experienced ones. We are focusing on hiring smart people, good engineers with one of the aspects of our stack and we have an understanding that we will have to have this ramp up and learning curve for the other part. So, if we have a strong backend engineer, we want to believe that he will be able to pretty quickly take over our backend stack. And, then we will give him all he needs to know about data engineering and if we are getting a strong data engineer, we believe that he will be able to track with our backend stack. So, I do not think we ever hired someone that is fully stacked with both data engineering and backend development.
Boaz: So for that team, essentially, you are looking for both, you would get both data engineering-oriented people and backend engineers.
Yoav: I think I will tell my backend team that part of my problem is that once you are in SimilarWeb, we are like a superstar. You can do everything. So, it is a very, very desired material in the market. But yeah, our stack is not trivial and again, I think also, BI developments got more and more complex over the years, but they usually have a different mission because most of the processing is happening offline and we are taking care of a really big stack of the production data stack, which makes the challenge bigger.
Boaz: For all these years running a data stack in production, which were the most sensitive or error-prone, or risky areas that ended up causing on average more errors in production that you had helped for? Like, if you would go back in time, knowing what you know now, which errors would you have tackled to make for less incidents in production?
Yoav: You were asking like, what should we improve?
Boaz: I mean, most at the end of the day, the incidents that you guys happened that affected the experience, were they more the data versioning, the ETS were serving mismatch of Schema issues. I do not know. What were the most sensitive areas in retrospect?
Yoav: I think, above all-important to say we are running full. We have two regions that we are maintaining. We have a hot backup in two regions. Because that happens. We are managing a totally HPS cluster and things happen and we have a very, very structural methodology of shifting from one region to another and that is something that we know how to do and usually happens with machine errors, this happens other times. We realized that data validation, again between the Data Lake and the actual database is a very, very critical thing. Again, our customers are buying data from us. Saying that it is a very bad experience if you are querying data and getting one response, and the next day, we will get a different number. This is also something that we experienced. By the way, in various areas, it can be that we had some historical bug when loading data because we were writing in batches. We have a daily batch or monthly batch. So, writing to DynamoDB, we had some bug in the Spark client, that we realized that we are just losing rows on the way to Dynamo and it was not logged very well. So in that case, you have keys in the Data Lake that do not exist in production. It happened when we just loaded data that was not synced well. So the numbers, we did not have the same rows count or same aggregation count between the query engine in our Data Lake. So, we had many tests around these areas. And, eventually, you would be surprised, even UI can create some issues, because sometimes doing teamwork, so we say, "okay, it's just a simple average. I will do it in JavaScript." But then, the customer gets maybe a different experience because one developer wrote the code for the client, for the graph and one for the Excel in the backend and one for the API and, then you have some tiny number typos that usually will say who cares, but our customer cares. It is important to mention some of our customers, they have many data analysts that get our data in batches, from API and they are doing the math. They do not need the UI. So, in their world, if we have tiny biases all over the place, it is a big deal.
Boaz: Okay. What parts are of the stacks today do you sort of consider legacy and our plan to be phased out in the next year or so?
Yoav: I do not know we phase out. We are doing a lot of exploration around and changing the way we are maintaining the Edgebase cluster today. It is a technical death during the years. We used to have some very nice hacks in the old versions for how we prepare the data files to Edgebase and how we manage the clusters. So, this is something that we are working on strongly. Other than that, we need the ability to be able to sell as much as we can to sell from the same data source for features that are using the same data. Pre-calculation again, it is great in terms of performance, but the developer experience can be very bad. Imagine that now it happens a lot that you need to take one aggregation over an existing dataset. So, you have this dilemma. We calculated it again and, then you have hundreds of transitions all over the place, and it is really how to track or you want to execute the query on the fly and then the performance penalty is always the question.
Eldad: AAP, aggregation analysis paralysis.
Boaz: Awesome. What did we cover, by the way?
Eldad: Did we cover huge failures while I was away.
Boaz: Maybe you could say because we talked about a variety of things that have gone wrong but we can do that again. If there is one single sort of meltdown that you remember, that you would want to warn others to learn from your mistake and to not get into it again, what would that be throughout the years?
Yoav: Meltdown.
Eldad: I was having a quick question given that you are like, aggregating everything that is out there in terms of data. Can you give us or share with us how fast the universe of data is expanding? Meaning that kind of it is constantly growing. How much of a challenge is that? How fast is data growing, given that it is not per user, it is just massive, massive granular data points across so many industries? How fast is it growing?
Yoav: So you are saying if you are storing more and more data. I can tell that we have data sets where we are loading every month, let us say, two terabytes for seeing the database. So, we are seeing an increase also in coverage. So, we want to cover more and more from the internet. I believe we are not seeing everything in all the darknet and areas that do not exist, but we have more and more websites and, also in the internet area, I think that you have new websites every day. I think this is where the main challenge is because you do not have an absolute number of entities. The entities are growing. So, maybe the data is not always like exponential growth, but the number of entities is endless, keywords are something that all three of us want to buy genes, but we will search for different terms.
Eldad: I search for skinny.
Yoav: Yeah. I had Eldad for the skinny, but yeah. So, these data sets are going massively because the entities are being changed. Of course, you have the baseline, but by the way, we are exploring it a lot and we are seeing that keywords especially are changing on a daily basis. Think about our lives. So, today we have the Russia-Ukraine war. Ten years ago, you had something else, and today you have some fashion trend and tomorrow you will have some fashion trend and you have NFT today and then SpaceX and Tesla, and every day it is something else. So, the information is endless. And for that reason, you cannot assume that most of your data is the same. It is the opposite, most of your data will be changed.
Boaz: Cardinality is a nightmare.
Yoav: But, that reminds me, eventually we serve the entire internet, but our customers are not querying the entire internet. So in that matter, something that they want us to be able to do in the next couple of years is to find solutions, where we can optimize what our customers are more interested in and then to have a better, more cost effective solution to give a very robust solution for the data customers, are querying and having some cold storage for whatever just need to be there and can be queried some time, somewhere.
Boaz: Product cost we did not cover and it is interesting for a company like SimilarWeb.
Yoav: Yeah.
Boaz: How predictable is the cost these days. How tough is it to manage?
Yoav: Also from colleagues in SimilarWeb, I know it is tough. This was one of the first actions I took when I got into this role. I am really glad that I did it. Because we did some gross action again for the data server team. And now, we are tagging every Spark cluster, every database, every table, each ETL running. So, I have a very, very good granularity to know for a specific feature, the entire cost from the ETL and ingest and the data serving, the databases, the staging, the development. So, I have very good visibility. But, otherwise, it is impossible. I mean, I think the number of roles and tags that I have today in our stack is something that you cannot manage if you do not have a standard.
Eldad: You need a data warehouse to manage your tags.
Yoav: That is a very good tip. I think, as soon as you do that, you can control it. Again, shit happens, but at least you can know in real-time what happens. It has happened to me every week. I think there is no way that I am not slacking on one of our engineers and what the hell is that? Why is that on? I see some anomaly in a specific table for a specific team. So that is very, very important. Because otherwise, you do not know what happened. You just know that you have a bias of $10,000, now go figure it out. So, imagine your tags.
Eldad: Keep your tags close.
Yoav: Yeah.
Boaz: Yeah close, as close as can be. This is awesome, Yoav. Super inspiring. It is rare to see a product that is so deeply driven by data. And, I have personally played with SimilarWeb products and it is absolutely amazing. The levels of experience you get there, the breadth of analysis, coupled with such a great user experience instead of speed and insight is something definitely to learn from. So thank you for joining.
Yoav: Thank you, Boaz.
Boaz: And see you around.
Eldad: See you around. Thank you, Yoav.
Yoav: Thank you, guys. Take care.