November 23, 2021

How did Agoda scale its data platform to support 1.5T events per day?

Scaling a data platform to support 1.5T events per day requires complicated technical migrations and alignment between hundreds of engineers. What to see how Agoda did it.

Listen on Apple Podcasts or Spotify.

Boaz:  Okay, so ready to get started.  Eldad! Are you ready?

Eldad:  Yes!

Boaz:  Okay. So hello, everybody.

Amir:  Hello, everyone!

Boaz:  Welcome to another episode of the Data Engineering Show presented by the Eldad Farkash, right here!

Eldad:  Hi there.

Boaz:  And Boaz Farkash, that is me! We are related.  We have the same parents, which makes us the data bros, woohoo! So, with us today, from Agoda, we are lucky enough to have been joined by Amir Arad - Director of Machine Learning and Shaun. Shaun your last name, please?

Shaun:  Shaun Sit.

Boaz:  Shaun Sit.

Shaun:  Yeah.

Boaz:  I had a hard time finding it because your name overlaps with a lot of options there, but I did find it eventually.  So Shaun Sit, a Senior Dev Manager at Agoda and currently managing the data platform.  So, we are having you guys on the show and, you know, I wonder before we start just, at Agoda, because you are in traveling and all that, is it frustrating working around travel data all the time and you guys are at home? I mean, don’t you want to travel all the time? Are there any travel-related perks? How can we get some inside information? How it is to, you know, be frustrated, but enjoy travel life at Agoda?

Amir:  So yeah!  You must travel in order to be working in Agoda.  It is not allowed to be stationary. You get fired if you stay at the same place.  And lots of perks and everybody at Agoda travels all the time, kind of hard during Covid.

Boaz:  So, yeah, exactly. So bomber for those who joined Agoda during Covid and did not enjoy perks.  Any recent, exciting trips that you guys were on?

Shaun:  I recently went to Phuket. It was lovely. Absolutely lovely.  You guys should come whenever travels open up.

Boaz:  I'm jealous. I am jealous.

Eldad:  A few moments of silence!

Boaz: I recently came to the office after working from home for a long time. That was refreshing and I went back home and then went back to the office, not fair!

Amir:  Similar, right?

Boaz:  These guys go to Phuket and all these exciting places. Okay! So, let us start with a short intro about you guys. Tell us what you do at Agoda beyond the fancy titles? Who goes first?

Shaun:  Sure! I can go first. As you mentioned, I'm Senior Dev Manager at Agoda.  I manage the data platform teams here.  I am currently managing four teams.  They together manage the entire, deal with anything that is data-related in Agoda itself. So, the four teams managed the pipelines, our data lakes, the self-service data applications that we built for anyone in Agoda to use; and then lastly, we have a team as well that manages the UI, like creating UI that make the experience cohesive for all these tools that forms. So that is what I do.

Boaz:  How many people are all of these teams combined?

Shaun:  In my area, we have about close to 30 people.

Boaz:  Awesome!  And Amir, how about yourself?

Amir:  So I am complimenting Shaun's effort by doing the machine learning part.  We have the machine learning platform that we do in Agoda and a lot of tools and good stuff that we do to have the data scientists get their pipelines and get their models in production.  And other than that, I have a few teams that do actual business applications, like personalization, marketing efforts that use all of the amazing data platform tools that Shaun’s builds to improve the experience of the Agoda customers.

Boaz:  And how many people over there?

Amir:  It is changing all the time. But, the entire data platform, at Agoda, we have 4000.  I think at the data platform, we have 300 people.

Eldad:  Wow!

Boaz:  Wow. Wow!  Okay! That's great! So, let's talk data. I mean, at Agoda, I would imagine data volumes are through to the roof. What kind of data volumes are we even talking about?

Shaun:  If you're asking about like, messages on Kafka, which is our main data pipeline, like how data moves around in Agoda itself, we do about a trillion or so messages a day.

Boaz:  Okay!

Shaun:  And then if you're talking about data lake, that comprises about like tens of petabytes worth of data.

Boaz:  So Amir, on the ML front, a typical slice of data challenges you guys look at, how far back do you go? How many events are you looking at? How much data volume?

Amir:  In terms of, let's say predictions done daily, we just passed the 60 billion predictions we do per day from our models.  And all of that is based on historical events and future predictions that we make. So huge pipelines processing, billions of rows every time.

Boaz:   So let's break down, you know, the data stack for a second. I wonder Shaun, you know, maybe if you could elaborate when Agoda has been around for quite some time, what does the data stack look like? Or how many data stacks do you guys have? And how do you evolve through the years? How does it look like today and how distant is it from how it was in the past?

Shaun:  I think today, like I mentioned, the data pipeline we are using Kafka for that.  We are using Elasticsearch for logging.  We have Grafana with a custom time-series database called White Falcon.  It is built in-house. Our data lake solution is HDFS.  Then, we have Yarn, Livy, Uzi, Spark, a lot of custom ETL tools. Then, we have also some data governance and discoverability tools. We have a Schema Registry that works nicely with Kafka.  Data Market is our discoverability tool.  And then, we have custom-like data validation, data quality tools as well.  In terms of queering, we are using Impala and Vertica.  So that is our ad hoc query story.  And then, for UI, we are using Hugo and a custom unified data portal.   The team that I mentioned earlier makes everything more cohesive.  And then visualization tools, we have Metabase and Tableau, some custom dashboarding stuff for funnels that we built in-house as well.  And then that's the data stack from my side.  Amir, what is the machine learning side look like?

Amir: Yeah. First-of-all, we are using a lot of this; and on top of that, we have some cool in-house build stuff, like a notebook platform that will be Python Opstrat tooling.  We are using MLflow, which is a very cool model life cycle management tool, and a lot of Spark jobs also that are used for machine learning as well.

Boaz:  Are you guys on the public cloud or you guys are self-hosted, self-managed?

Shaun:  We are On-Premise.

Boaz:  Everything On-Premise.

Amir:  We live inside that the data center.  We connect the cable ourselves...

Eldad:  So, that's the real background for a second. Yeah!

Boaz:  That's why you guys travel all the time.  The travel perk includes you must stop a data center and do wire.

Eldad:  Minus 15 in New Jersey data center.

Boaz:  As I mentioned, a lot of homegrown tools, maybe a time series data. How do you call it? Falcon? Did you say, Shaun?

Shaun:  White Falcon.

Boaz:  White Falcon, first beautiful name.  The other names are not as impressive

Eldad:  Almost as good as Falcon.

Boaz:  That was a good reason to start some products yourself, you can name it yourself. You get straight to white Falcon, so why not go for something off the shelf?

Shaun:  I think it is for several reasons.  We obviously do try out, like other times you use databases we have always take a look at what is out there, benchmark ourselves around those, come do the feature set comparison.  It always comes down to a cost, performance ratio. We think that building it in-house, and we have had this for a long while, and it has been very much key enabler for us to store huge amounts of application matrix on it and across multiple data centers that we have around the world, and also, so it's pretty good. It's pretty great.  Shout out to the White Falcon team!

Boaz:  Awesome! The data stack, I would imagine, serves a lot of use cases. What are the most interesting ones or the bigger ones running on top of the platform?

Shaun:  The interesting ones, obviously I think, would be, I want to say, Data Market. It is kind of similar to, I guess, DataHub or Amundsen from other companies.  I think the DataHub is from LinkedIn and Amundsen is from Lyft, I believe.  Did I get it right? I don't know.  But basically, yeah, Data Market, it is our discoverability tool that has been instrumental in our data democratization story.  We send a lot of data.  It does not make any sense if nobody uses it.  So, we got to make sure that there are tools out there that they make it so that it is easy to find the data that you are looking for, it makes sense, it is of high quality, and it is usable.  It has really been one of the main drivers for us for data usage in a company.

Boaz:  When was the project launched, sort of, how long has it been in the base?

Shaun:  I want to say, let's see, 2 or 3 years ago, that is when we built it. Yeah!

Boaz:  In your current stack, how much would you consider sort of moderate? I am happy with versus legacy. We are always in the process of, sort of, trying to modernize.

Shaun:   I think that is like a tricky question, I guess, for any engineer.  You are never fully happy with a solution that you have. You always want to improve.

Boaz:  I never met a happy engineer.  They are never truly happy.  I was always almost there, almost there.

Eldad:  Always forward-looking.

Boaz:  Yeah!

Amir:  It is job security, right? We are never done.

Shaun:  I think for us, areas which we definitely can improve on, I would say is, around the area of decoupled storage and compute.  I will be honest we are a little bit behind in that regard.  We have not made that shift, and that is something that we are actively working on right now and it is going to give us that next-generation data platform.  So that is something that, yeah!

Boaz:  How do you go about such a project with those data volumes? How long does it take even to evaluate given the massive lift and shift it would involve?

Shaun:  So that is one of the pain points, I guess, when you are work data.  I think, the biggest, pain point is always migration.  All the technologies are pretty cool, like everything; but, in order to use it, you have to migrate from what you are using currently to something new, Right? And that is where things become fairly complicated.  Migrations always take extremely long times.  I guess the key here is to plan, like plan, plan, plan, plan, plan, plan, plan, plan everything, and then, try and move forward as quickly as you can, as fast as you can, figure things out along the way and then, just adapt to the situation.  I would say migration is always the biggest pain point, the biggest challenge.

Amir:  Yeah! And also in Agoda since we have so many different types of data users, like the data scientists that can write their own code and wizards and they do everything on their own.  And it could be like BI analysts that only know how to do SQL and then during these migrations or when you do these changes, you need it to be seamless for them.

So that makes every change even much harder because nobody should even know that something changed behind the scene. Usually, it is impossible.

Eldad:  The users are basically preventing you from making progress and moving forward.

Amir: Exactly. We always ask to just the eliminate the user.

Boaz:  Which teams will be your early adopters for new tech?  Do you have a sort of some internal teams that typically champion for going next-gen, even at the expense of painful migration?

Amir:  Machine learning, Always!

Boaz:  Machine learning, always.

Amir:  Like they get the GPU, the SSA to it and they already started like downloading stuff from the internet, run the latest GPU stuff or TensorFlow.

Boaz:  If you look at your tech stack evolution, and today you are On-Premise, does that involve moving, becoming hybrid? Does that involve moving to something like S3? Because I guess storage for you is a big thing and a big part of the challenge of migrating somewhere, or how do you think about it moving forward?

Shaun:  It is an interesting question. I think that the thing is we have always explored the cloud, right?  We do constantly explore the cloud. We do have some stuff running on the cloud; but for data specifically, we have always done our research and POC, and we have never found the right motivations or the right reasons or the right…, how would I call it? Like the right thing…

Eldad:  It is the one big thing that helps everyone make that from transition.

Shaun: Yeah, Yeah! Exactly, we would never find that big push, right? that pushes us in that direction.  So far, we are pretty happy with On-Premise.  I think also the hardware game has changed quite a bit.  CPU is now the bottleneck, right?  Storage is becoming way, way cheaper, and way faster.  Similarly, with networks, right? And so being On-Premise, it does give us some advantage, right? to kind of leverage this system as they come along and to explore them, so I think I would not say one is better than the other. Always, I think in anything, it is just do what makes sense for you. It is just that in Agoda, we have the right people, the right expertise, and the right history as well.  We came from On-Prem. So, we have a lot of knowledge in that area and so far, it makes sense for us.  That is to say in the future if the cloud makes more sense, we would be 100% on-board.  At this point of time, we are still On-Premise, but we are constantly exploring though.

Eldad:  Makes perfect sense.

Boaz:  Amir, What about you? Which use cases are the ones that are of the highest-profile?

Amir:  Yeah. So I think for us, one thing that we got maybe a bit late in the train is like, we were mainly a Scala shop, so we are doing a lot of Spark job and huge parallel.  We have jobs using like 13,000 cores for five hours; amazing huge jobs.  But people had to be like Scala's expert in order to tune them, in order to get used to them, in order to drive them. So I think, we were kind of late, to see that Python is now really the go-to language for machine learning.  And, so we kind of regret not building tools for that sooner; but now, we are already on par.  I think very soon Python will win over Scala in at least in the use cases of data application.  A lot of, let's say, the less advanced user are already write their scripts and their notebooks on Python and it made like the time to market data projects a lot faster and lot sooner.  So that's cool and this is one thing that I think we tuned on maybe sooner. 

Boaz:  From a user perspective - I am a user or an Agoda customer or an Agoda visitor what kind of things happen in the background that sort of start with the melting the time, unaware of even? Can you share some cool things there?

Amir:  Yeah, so we do a lot.  If you and I both opened the Agoda website, we will get completely different experiences based on the past. Like, if I like breakfast and I like breakfast, I will see more photos of eggs and other breakfast stuff.

Eldad:  Last time we got bacon all the time on your website.

Amir: Exactly, exactly.

Amir:  So, we do try to fit the content to what we think you would like and what you care about.  If you are more sensitive to price, then you will get the best offers. We anyway have the best offers, right? But then if we know that that is what you care about in your current trip, then the whole experience will be optimized for that.  We do, for example, try to cut snippets from reviews that make sense for you.  So if you care about cleanliness, then we will take the review that talked about it, Hey! this hotel is very clean and then, we show that one too.  So, a lot of personalization effort goes there, but even before you came to Agoda, right? All of the marketing that goes behind the scene, the email, the popup notification, everything is kind of optimized to make sure that you get what you want and the information you need on the Agoda Website.

Boaz:  What typically does the initiative for a new ML-based project come from? From your team? Is it sometimes product-driven? How does the thinking around new projects for ML look like?

Amir:  Yeah! So, that's something that I think we do very coolly.  The scrums that we have at our machine learning; they are always a mix of ML engineers, data scientists, and the PO together. And then it's like a 300 beast that kind of try to set the direction. Sometimes it comes from the data scientists, they say, Hey! this is something that we can easily optimize. Sometimes it is the PO or PM that can say, Hey! the business should go that way. And sometimes it is the engineering manager that can say, Hey! other teams did this or we see any other products doing that. So kind of a mix of ideas coming from three different directions, and then get swirled together and the best one wins.

Boaz: Interesting.  Thanks.

Boaz:  Shaun, I saw a piece you published a few months back on a medium called “How Agoda manages 1.5 Trillion Events per day on Kafka,”  

Shaun:  Yeah!

Boaz:  Can you share the backstory there a little bit? Was this published to follow some sort of architectural change or just something that was there for so long and you decided to pass the knowledge out there?

Shaun:  Yeah. Yeah!  I think I wanted to write a blog piece and contribute to the Agoda Tech blog.  So I thought this would be a good topic.  I think it is an interesting thing because if you read the blog post, it is not so much only about the technical stuff, right? A lot of it is about the human process because you have to remember that at Agoda there are thousands of employees, right? So, certain things you need to think about in terms of scale, not only the technologies that they scale well but the human processes scale well as well. And I think that is something that sometimes we do forget as data engineers that you have to ensure that the human process is scale, so that is a lot of the things that drive towards how Agoda manages at 1.5 trillion.  There is all stuff around there like cost management, attribution, even simple stuff like an auditing and monitoring and giving developers the confidence that what they send is exactly what they will receive and having them the self-service ability to just check on those kinds of stuff on their own and then the cost attribution as well, I think, that is a major portion that really allows developers to manage on their own their costs, right? You always have to think about, after all these are company resources. You got to make sure that what you are sending has some business use case, right? You do not want to just send stuff and make the data lake into a data swamp, right? So that is not very useful.  So Yeah!

Boaz:  It's super interesting.  So how do you foster a culture or a way where developers are minded to that? Is that something that from day one their mentor to think about? or which roadblocks do you put in place to make sure that it does not get avoided?

Shaun:  I think there are several ways. One way that we figured out that kind of works pretty well, like we just took a page out of the cloud providers who charge you for every single thing, right? So like, if you mess up a query, you end up paying for it. How much data you store, you pay for it. So I think just building that visibility to allow the developers to see, Hey! this is how much you supposedly will cost a company for sending this much of data or, you know, processing like maybe some, unoptimized query to the actual query engines that we have.  So that itself has, you know, driven to make sure like, oh! you know, there is a cost allocated to all of these actions and I have to take that into consideration as well.

Boaz:  How distributed is the engineering team in Agoda? Which locations are you guys spread out through?

Amir:  So currently it's a lot more, right? Because we are working from home, so a lot of people are spread around the world; but usually, we have three main hubs.  Bangkok is the biggest one where most Agoda seats there.  Singapore is also big and we have a small office in Israel that is right next to you. You can jump, maybe eat a sandwich there.  It is where we have a lot of smart data scientists there.  I think these are the main three and now, we also opened another office in India to increase the diversity and the strength of all developers.

Boaz:  Got it. So, what are your main challenges today? What takes the most sleep from your day-to-day? What do you worry about?

Amir:  A lot of things.

Boaz:  We are not talking about your personal life. We will take that offline.

Amir: Ah! Okay. Okay!  That's right.  I think one thing we find hard, I mean the success that Shaun talked about, kind of making data liability and people kind of owning up to understand that they cannot just send as much data as they want.  This is one thing but still, when you try to change the way that people work with data, for example, in machine learning, each team is working with raw data sometimes with just sending SQL or building data frames. And we try to shift everyone to move to more like a feature-solid approach where you first model your data as a feature and then you start thinking this abstraction, okay! this is my feature, how it behaved, how it looked like a month ago and making this change one thing by the other, this is usually that.  Like, I wish I could just press a button and everybody in Agoda would just move to work in that way, right? But usually, it is a process that takes a lot of time.

Eldad: By the way, this is huge and super interesting and actually one of the biggest things that are happening to engineering, to cloud-native companies or data-driven companies actually moving from engineering, building software to having engineers directly connected to the business feature or building and its cost, its value, its new terms, it is kind of broader thinking on how engineering is done and it is fascinating.  So yes! It is everywhere.  I can say in Firebolt as well.  This is a big thing and being transparent and opening the data and giving visibility, as you said, a payload, an engineer generates payload and that payload ends up at the hand of the user in some form and that changes a lot of things. So, thanks for sharing that is super, super interesting!  We should actually drill down on that on future shows as well.

Boaz: Yeah! Shaun, what about yourself? What keeps you up awake at night?

Shaun:  It is interesting, for me, I think what keeps me up awake is trying to figure out what the next generation data platform will look like and trying to see if we make the right decisions, if we have made the right bets along the way. Because in the data space it is not like, yes! we are agile, but it is still like things take time to migrate.  There is some time element involved.  So that's what keeps me up, right? Is object storage the way to go? Is it not.  Is distributed file systems coming back? You know, these kinds of stuff, right?  Where is Hadoop going?  Where is Yarn going? Right? All that kind of stuff.

Boaz: By the way I'm going to reference your LinkedIn account for the second time, looking at your LinkedIn account, your top-line message there says, “I'm hiring, we're building our next-gen data platform.  Come join that.”  And I was wondering if that is there for 5 years or 1 year or a few months.

Eldad:  It's a good tagline.

Boaz:  It is always true. As you are thinking about your next-gen platform regardless of what that really means because that could mean a lot of things.  Well! What is the objective? I mean, what would you like to plan for to be capable of in 4 years than is it now? Is it more about being future-ready or do you have concrete challenges you want to solve in the near term?

Shaun:  I think one of the things that we want to solve is the agility of the systems, the agility of the architecture.  If you think about the Hadoop space, a lot of things in the Hadoop space are very coupled together, right?  The Yarn is coupled with HTFS.  It is like Uzi only works on Hadoop, right?  A lot of things are very coupled together.  So, I think for us, like for me, I am less worried about which systems we ended up picking as long, like in the future, we are in a much better position to have that kind of agility to change something out whenever we need to without incurring that huge long timelines of a migration, right?  So that's where I want to be.

Boaz:  Got it. Okay! Now, we are going to do a blitz question round.  We are going to ask you a few questions real quick.  Don't overthink, just answer and feel free to cut into each other's answers because there are two of you.

Amir:  I was waiting for that from the beginning.

Boaz:  We will count counter whoever answers first, you know, but we love more than the other.  We are like the parents loving one kid more than the other.   Okay! Let's start -  commercial or open-source?

Amir:  Open-source.

Shaun:  Both. I say both, do what makes sense for your requirement?

Boaz:  That's like cheating.

Eldad:  That's because of Vertica that is why both.

Boaz:  Batch or streaming?

Amir:  Batch for now.

Shaun:  I'm going to say both again.  No, the company will use one or the other.

Eldad:  If you had to choose one, if you had to choose one.

Boaz:  No, it is something like what makes you feel better?

Eldad:  Exactly. There is no good answer.

Shaun:  Okay.

Boaz:  Are you a Batch person or streaming?

Shaun:  Streaming.

Amir:  Exactly.

Shaun:  From Kafka streaming.

Eldad:  Everyone wants to be a streaming person.

Boaz:  We need to highlight the instructions a bit more for Shaun.  Shaun! Don't overthink the answer!

Shaun:  Okay.

Boaz:  Whatever you are feeling is right for you.  Write your own SQL or use a drag and drop visualization tool?

Shaun:  Drag and drop.

Amir:  Write your own.

Boaz:  ML team versus the data development team - interesting insights!  So far, no answer that was the same.

Eldad:  No.

Boaz:  We are not getting on anything.

Amir:  Everyone wants to make it interesting.

Boaz:  Work from home or work from the office?

Amir:  Work from home.

Shaun:  Home.

Boaz:  That is the first agreement.

Amir:  Yeah! But with a little bit of office here and there.

Boaz:  To Uzi or not to Uzi? You mention the Uzi so often.

Eldad:  The original question was - AWS, Google Cloud, or Azure?  So Boaz changed that.

Boaz:  That's true.

Amir:  Yes! Uzi.

Shaun:  No Uzi.

Boaz:  No Uzi. Why not?

Eldad: To couple.

Shaun:  I guess to couple.  I cannot run Uzi jobs outside of Hadoop.

Eldad:  Fair enough.

Eldad:  To DBT or not to DBT?

Boaz:  Do you guys use DBT?

Shaun:  DBT.

Boaz:  When did you guys start using DBT?

Shaun:  We only started exploring it.  We have not done it yet, but there is a lot of concept and approaches, and ideas that we like a lot, and then we are going to use it for the development of our internal tools to match, you know, what DBT can do.

Boaz:  What is it you to use for ETL, is it mostly Spark or other stuff too?

Shaun:  It is driven by Spark, but, it's mainly in-house right? Like we built an ETL tool that is based around spark, but it is extremely easy to do as a whole UI and so on and so forth. It is like anyone in the company just goes in, writes a few SQL, and you are done.

Boaz:  And no commercials, sort of traditional Informatica style, kind of more these kinds of On-Prem, ETL integration tools.

Shaun:  No.

Amir:  Not needed.

Boaz:  Yeah. Nice.  Okay!  So now after you guys, trying to be so cool for listeners, it is time to get real and tell us about one project that was horrible for you guys. That didn't go well at all. So we can all learn from your mistakes.  Who goes first?

Amir:  Yeah, that is a horrible one.

Boaz:  What mistake are you not going to repeat again? Tell us about a project like that.

Shaun:  I think for me is, I think the failure is that we did not get into decouple storage earlier. Like a lot of our systems are still coupled together and that has hurt us. Because obviously with decouple storage, you could always scale, compute, and storage independently whereas like, in the old ways everything has to be uniform.  So, being slower on the bandwagon definitely has some impact on us.

Boaz:  Yeah.  Amir, any glorious failures on your end.

Amir:  It's too many, every day there is a few but again, not about my personal life, right?  It is just always feeling that you are moving too slow. I do not think anything specific, but yeah, getting, for example, by Spark, I think, we should have done it a lot sooner, and in terms of GPU, as we invested a lot in building an amazing project, around Uzi that will allow you to try to kind of do use Uzi to get outside of Hadoop and run stuff on some Kubernetes cluster with SSH commands and stuff.  We tried to break this coupling that Shaun mentioned and we failed miserably like it ended up being unusable and we threw it away and our way of trying to rebuilt a new solution but we tried to make Uzi do stuff but he didn't like to do so if you pushed back on.

Boaz:  But do you still, said, voted for Uzi before.

Amir:  Yeah. But this is why we insisted, right? Because Uzi is a great tool, even though it's XML based and I think it built in the eighties and the UI is as old as like very old.

Boaz:  As old as the eighties.

Amir:  Yeah, something like that, but it is very robust and battle-tested, and it allows you for a lot of features and there is a lot of cool stuff you can do with it. So I wish there was an Uzi replica that is not coupled with Hadoop, but we couldn't find one so far.

Boaz:  Yeah. We seem to have another issue with Shaun, so maybe you're going to have to answer all questions on your own until we get it back.

Amir:  I will take it from here. Okay!

Boaz:  Now on a better note, tell us about the project that did go extremely well or something you are proud of, that is exciting that you want to share.

Amir: Yeah, recently we built a very cool too for a model monitoring.  Usually, I just talk about machine learning, but this one is also very close to data engineering, I guess, hardcore data engineering because what we saw is that, okay, people build a model, they send it to production.  They do maybe an AB test to see that there is business value and that it is actually better than the previous approach. But then after the AB test is over, like nobody's watching over it. Okay. So people see that top-line numbers.  People see that traffic is flowing.  Bookings are made.  People are happy with their product in general, but you don't know what is going on within your model.  How good are the predictions that it is doing? What we built based on the stuff that Shaun along the great Kafka pipelines that we have with Shaun, we are sending the data from the model monitoring all the way back to Hadoop.  So from the model, the inputs and outputs are sent back and we have a spun of the crunches the statistics of all this data.  So kind of trying to find a needle in a haystack, we go column by column, calculate all kinds of stuff, similar to what DQ is doing maybe if you know, TensorFlow data validation, calculating the shape of the data that is flowing and trying to come to insights about how we changed compared to a week ago, compared to what we thought we had when we train the model and these turned out to be a very cool tool. So everything is now connected to Grafana with automatic alerts, and we try to find these anomalies and get them back, getting Shaun back, and, so that was a very cool win and the cool part was the human part.  The model owners, they just click a few buttons, register here and there and that's it, their model is monitored.  Everything goes automatically for them, and they get these amazing alerts for free and that was very cool.

Boaz:  How did you build a justification tool to go after a project like that? And how many people were involved?

Amir:  Yeah, that's cool.  So we work closely with the data science department and we kind of search with them, what are the pain points in the beginning? For them, none of them actually complained about that because it was kind of falling between the chairs, between the ML engineers and the data scientist, so data scientists care about, Hey, I want tools to work fast. I want to be able to have a lot of resources to crunch a lot of data.  And usually, once the model is in production, they cared a little less about that. Well, the ML engineer kind of felt that, okay, this is a machine learning model, like the data science responsibility. So we kind of know that nobody kind of owns that area and maybe it should be owned by the platform and that was a reason for us to go in and do that. And also there was a streak of failures and actual incidents that happened that caused, like a platform degradation. And we said that, okay, it justified to build a tool for that and a platform from that.

Boaz:  Awesome.   Shaun, what you missed is the question because we lost you for a minute.

Shaun:  Yeah! Sorry, my place got a blackout like it happens in Bangkok.

Eldad:  It happens.

Shaun:  Yeah.

Boaz:  So now it's your turn to share a project that you were proud of or a great win that you're happy with.

Shaun:  I think that the Data Market, like the stuff that I mentioned previously, really has beyond that discoverability that it gives to everyone in Agoda.  It also serves as a central place to get the information about a data piece, so you can get data quality information from there as well.  You get to figure out who is sending this, from where, all that kind of cool information, all in one single piece.  And that really has been instrumental.

Boaz:  Awesome!  Okay guys, I think, we are reaching the end of the show. You've been great. Absolutely, exciting to see what is happening behind the scenes with data at Agoda.  I'm going to think of you guys next time I book a trip or something. 

Get started with Firebolt

Start your Firebolt trial with $200 credits

Read all the posts

Intrigued? Want to read some more?