Data engineers are not paid to do support. Liran Yogev, Director of Engineering at ZipRecruiter, and Doron Porat, Director of Infrastructure at Yotpo talk about building resilient self-service products that keep customers happy and engineers calm. They walked the bros through their data stacks and explained how ZipRecruiter is completely rebuilding its data layer from scratch.
Listen on Spotify or Apple Podcasts
Benjamin: Hi, and welcome back, everyone to The Data Engineering Show. It's super cool. Today, we have another set of experienced podcasters joining us, Doron and Liran.
Liran: I love it.
Benjamin: Thank you.
Liran: You tried really hard. It's just...
Benjamin: I speak German, that's my native language, so we have some hours in there as well. So I practiced really hard before the episode.
Doron: Sounds good.
Doron is a director of infrastructure at Yotpo. I hope I pronounced that correctly as well.
Doron: Perfect. You're the only person in the world that can pronounce this.
Benjamin: Awesome. Liran is a director of engineering at ZipRecruiter and also was at Yotpo before. Do you guys just kind of want to give a brief intro tell us what you're up to, tell us about your podcast, and then we'll dive right in?
Doron: Yeah. Go ahead. I'll start. Yeah. I'll start with me, and then we'll go to the podcast.
Liran: Born in the Yavne.
Doron: Yeah. I was born in Yavne.
Liran: 47 years ago.
Doron: 1923. So, I worked at the Yotpo for 25 years, just a lot longer time.
Eldad: Was born at floor four.
Doron: Yeah. We started at floor number one and we reached up to floor number 26, right above a Firebolt. But, I worked at the Yotpo for a long time. I started as a team leader for the data engineering team. Later on, we became an infrastructure group and the team grew. I became the group leader for the data infra. We built an amazing data platform here together, Liran and I.
He was actually my predecessor. He was the infrastructure group manager, and I replaced him recently.
Liran: Doing a much better job.
Doron: Yes. So I do everything better. Now, everything is better here.
Liran: Less people, right?
Doron: With less people. Yeah. We're running very lean.
Eldad: Benjamin, don't get any ideas in your head.
Doron: No. It works only if you're a woman. We have a joined child. We don't have a child.
Eldad: Oh my God.
Doron: We have a podcast that we've been podcasting for about a year and a half. Also about, data engineering and all the surrounding world.
Liran: But it's in Hebrew.
Doron: So it's in Hebrew. So we're not really competing. It's not the same.
Benjamin: I tried preparing for the show and Tamar said your podcast, and then I was shit.
Liran: So, there is one episode in English, which...
Eldad: So he learned Hebrew.
Liran: There you go.
Eldad: Hebrew. Then you didn't hear the second half. Yeah, he's still...
Liran: If you even started, it's so fast,
Eldad: It's too modest.
Liran: How fast we talk in the episode, so I don't think it's good for you as, as the beginner, wave to learn Hebrew.
Liran: Sorry. With friends or something in Hebrew. I do not if there is a version.
Now me. I'm kind of always around the one, so all the stories intersect. I was at Yotpo before that, I did something else. I've created the data platform there, but also I did a lot of other different roles, in my end position I was running, all of the platform engineerings at Yotpo, so backend data, and front end which Doron is doing right now.
Then about eight months ago, I moved to ZipRecruiter. I'm doing the same there, but a bit different teams. So I also run the platform engineering, or you call it enablement and around the area of ML experience, we call it, or ML tools, and all around ML data. And also experimentation, which is a team that is, building an experimentation platform for internally for our organization. So doing heavy testing and measuring everything. So, it's really been fun. And, the podcast.
Eldad: Data sets. Data sets, data sets. More output. More output.
Benjamin: Awesome. For listeners who are kind of have never heard about Yotpo or ZipRecruiter before, do you want to give us a quick, high-level overview of what product the companies actually building or different companies?
Doron: Yeah, sure. So, I think it's funny because I've been in the Yotpo for a while, so we kind of started off as a branding ourselves, as a marketing e-commerce platform and we really became a real platform only in the past few years where we offer a set of different products under the same platform to help e-commerce businesses just, just do better, bit bigger and stuff that.
But recently, I think given the latest changes in the ecosystem and the financial macro environment, Yotpo is trending towards being this retention platform for e-commerce businesses and we do this with the same set of tools, but we're really focused on how we can help e-commerce businesses, online businesses preserve their customers and enlarge their customer base through our products, which are review solutions, loyalty programs, referral programs, communication channels and customer data platform, and more and more products this.
Liran: ZipRecruiter is a Hiring marketplace, I think that's what you call it, basically helps both job seekers and employers find matches and we do that for customers from really small mom-and-pop shops up until customer Amazon. We do have different approaches for each of those customers, from enterprises to small businesses and we heavily rely on AI to do the matching and other things in science systems. So, that's our forte. Helping them really find good matches for both sides. And also balancing the market itself just completely. So, that's the gist of it. And we're based both in Israel and in the US.
Benjamin: Awesome. Cool. Nice. This is the data engineering show, and you guys obviously have a bunch of experience in data at a variety of companies. Take us through the types of data challenges you guys have in your day-to-day and maybe tell us a bit about your stack.
Doron: Okay, cool. So, maybe I'll start with the stack and then I'll go on to the challenges. I think that might make more sense.
Stack, I'll start from the bottom up. That's comfortable for me. So we're ingesting data into the data platform from all sorts of different data sources, whether it's operational databases and third parties event data, whatever and it's basically all streaming into the data lake. So, the whole solution or platform is built around the data lake and it's very data lake centric. Based on AWS. That's all the workloads running there.
Then in the dat lake storing data, different formats, and using different techniques for transforming the data and different engines, mostly Spark these days. I think we're running a long time with Spark, what Liran and I did together in Yotpo for many years is making Spark...
Eldad: Buy a perpetual license, gives you the ability to use it for free on unlimited resources forever.
Eldad: Sorry, go ahead.
Doron: Our challenge years ago was how to make big data tooling available for the generalist developer and later on also for BI developers and stuff that, and that was a big challenge and how to democratize data sets as well as data tooling and we used to do this... I think we had a different approach for this a few years ago, which changed and evolved over time as the platform grew older, and the company also evolved and had different needs and requirements.
Orchestration, we're using Airflow, most of the past pipelines are running using this framework that we built internally in Yotpo. It's also open source. It's called Metorikku where we write YAML files with SQL statements to describe the data pipelines.
Also, with streaming, mostly Spark structure streaming, also Flink pipelines recently. Then, the whole analytics area things.
Plus we have DBT that we started using at Yotpo for the past year or so, well over a year. But we built this whole framework around DBT to visualize the new way of thinking about how we should manage data in the organization. It's also an internal tool that we built and it really connects data producers and data consumers on the other end where we have Looker which we use for internal analytics or external B2B analytics, either embedded dashboards or API, which is also, something we are leaning towards more and more as time goes by. Maybe add kind of analytics is also a big part of the thing, making data available for everyone. So everyone are using Databricks clusters to work on top of the data lake. And it means engineers, BI, analysts, support engineers, solution engineers, and everyone working on top of the data lake.
Yeah, I think that that's the big picture.
And if you ask about challenges, I think that in the podcast, we talk a lot with people and I think a lot of the times we go and talk to people that have big data challenges, it's still a thing. I mean, you think that it's solved already, but people have big, big data challenges.
I think at Yotpo, it's more what keeps me busy. It's more data manageability and how to architect this thing to work well at scale, serving a lot of people, here we have an R and D of 250-260 people and more and more outer circles using the data, and how we optimize this huge machine of money into something that's much more coherent, robust, scalable, and resilient over time.
Data manageability, I think, a wider term, because I can talk for hours about this. I would say that's where I am focused at the moment.
Eldad: Quick question, seems data and everything you do with data is part of the feedback loop within engineering, within product building, , everything you do as engineering and product is using data, how is that translated into user experience, the data? and a lot of what you've mentioned is internal, right? So, it's for building products, how does that get translated into Yotpo's business, for example?
Doron: I think Yotpo is not per se a data product. I think that ZipRecruiter maybe is, is more an example of how is data, centralized and within the product. Data is a big thing, and when I talk about experience, I mostly talk about, and that's what my group is focused on, we talked about front and backend and data, but we are very, very focused on developer experience. I think that where our customers, B2B customers, make the data is, the places where we make it .
Well, I'm not talking about all the machine learning, data science part of things, it's not really under my responsibility. But, analytics is becoming bigger and bigger and I think it's also a part of what's going on, in the world that we are really focused on observability and demonstrating ROI and helping them make the right decisions on. Yopto is a complicated machine and part of it is, I think it's almost actional BI, but external. It's the way that we organize the data in a way that helps them understand how to navigate through the different products to bring more value. So, I think that's where they touch on data the most.
Liran: So I want to add that. I think that what is happening is what we went through in last couple of years is more and more and more use cases were added to the world of the data lake. First, what type of consumers we have for data? So we are mostly being focused on internal, but also, external as well. The data meet the customers in both companies. You don't have to be a data company to have the data reached somehow into production systems.
So, there are more and more use cases just being added. More and more types of consumers. They require different things and their experience or even their capabilities are different. How can they access data? How can they produce data that is high quality? So I think that's what I'm focused on and what is always changing in our ecosystem.
I can but I do not know if we care about that, but our ZipRecruiter stack or not or we just moved on. It's okay. Move on.
Eldad: So, what you're saying is if your data platform shuts down, then internally engineering product business won't be able to operate. It's so much embedded. It's so much interwined. It's as you're saying, self-serve. What is self-serve? Is it opening the data for as many as possible? From your experience, is that real? Is that kind of something...?
Liran: Yeah. So we try to build a decoupled system, right? I don't think it's great if the big data platform, which again, both of our companies is very data lake eccentric. I don't think that if this drops, as soon as it's down, then everything goes down.
I don't want to be at that position. So we need to have some kind of differentiation between all of these backend processes, batch processing, even stream processing that happens, managing both for ML or Funnelytics in the production systems, which actually needs to serve something to the customer, they actually see.
So again, the business will suffer, maybe some late data will arrive to the system. It may be visible to the users, but in the end, we don't want to be, where something actually is not working anymore in production. I think in my opinion.
Eldad: I've seen many peoples saving data sets eventually after they use all the stack, all the tools, they save it in Excel. So, it's a backup, but it's also a failover mechanism and eventually, it all ends up in some report. So, we hear it all the time and I think companies went all in on data, you'll be surprised how dependent they are on internal data.
I'm not talking about external. External is easy to justify, but justifying internal, asking how do I optimize internally? What do I optimize for? Those are new questions we're hearing more about.
Liran: It's even more than that. We need to ask a question, do we need all this data? And it's a question I don't think a lot of companies are asking.
Doron: No, no, we don't need it.
Liran: We don't need all this. Yeah.
Liran: Because Doron and I have been to a lot of discussions about retention periods, for example, for our big data. At some point, just delete it. What will happen? Well, we'll be fine, and I think in a lot of... I'm not actually saying that.
The discussion needs to be made about because at some point it reaches such a high complexity in cost and so many moving parts that you need to ask yourself, do I really need all this? And I think that's something that each company needs to always reiterate on, and ask these questions. I think that's a good culture.
Doron: I would to add something you were talking about questions that we ask ourselves, and I can say that after many years working in data, it sounds dorky.
But I find myself asking myself different questions as time goes by, my concerns shift and I ask myself, recently we started talking about how we should better structure the infrastructure group. And then I started asking myself, where do the lines cross? where does data start and ends when there's backend infra starts and ends, and we have all those interfaces between them. And I think it's a really fascinating question to ask. And it's also in the way that how data stack connects to the APIs or event-based architecture and where do the lines cross. I think it's very, very interesting.
Liran: Yeah. I think we're actually seeing the world move a lot. It's becoming more an engineering world than it was before. It's less and less about just being data. So data is coming from somewhere. It has been produced by someone. It's even been managed by someone that's probably in the product or engineering world.
And I think before we used to have these silos where we had analytics teams just hand-managed data and I think that we're more mature companies. Question is, can they really manage it? Do they really have enough information? Are really taking the responsibility off of the engineering teams?
I think it's because we just talked about boundaries, I think those are also changing all the time.
Eldad: Crazy times, huh?
Liran: Oh yeah. We love it. It's great.
Benjamin: Nice. Maybe take us at how these boundaries look at, the specific companies you are working for? In a sense, you're providing core data infrastructure for your company. Say I'm a neighboring engineering team, I'm trying to build this new, I don't know, data application or internal data experience, whatever, where do I interface with your teams and what are the types of services in a sense you provide them?
Doron: Do you want to start?
Liran: Yeah. We are trying to build the methodologies, processes, and tools around producing and consuming data by all those types of customers you just talked about. That's where we are. And that means that in most cases, when you build the data application or some data experience, then you'll be interacting one of the tools that we have. So it's either going to be something that we bought and we basically implemented and integrated or something that we built that's very specific for just an organization. And what we like to do is optimize that all the time. So that's what we do
So we figure out, okay, we have this customer and we want them to have, to create the best data set or the best data application, so it's going to be the highest quality. It's going to be really fast to create, it's going to be really easy to consume for consumers. So how do we get there? And that's where we add all the different layers of the tools that we either build or buy. So, I think that's kind of where I see the interaction, specifically around the data. Do you want to add some?
Doron: Well for us, first of all, because we talked about teams being really lean, but we're not kidding the teams are very, very small and we support a lot of people, given services. So, we are really focused on building a self-service data experience. I'm going to borrow your words because I like it, but it's all about the experience and how we make this experience better, and help the developers and engineers be more self-sufficient and free to operate within their domains.
It's always maintaining this balance between allowing them to run freely. I'm trying to find the nice word, destroying our vision for how the...
Liran: And also, taking care of our people. We don't want to be just giving out support every day. We are product teams, in a way that we are actually creating internal products, just the Firebolt, for example, does for its customers. But in the end, we both are lean. So, we cannot do support. I mean, no one pays us to do support all the time, and I think it will be really a waste of our time. Our engineers want to build things and not have to support them. So self-service is a really big thing and providing with our customers, with the tool to debug the systems, to run it, to own it, to have...
Doron: And to enforce best practices as well in a way that will not ruin their lives and experience, because it can be a real drag to being blocked in CI with every step that you try. So, it's a matter of how to enforce these in a smart, elegant way, and through this, create the things that you believe in and you think are instrumental for the data platform.
Benjamin: Got you. So going back to this previous example, of retention periods. The day I'm an engineer, I want to use the awesome data utilities you and your teams provide and I said, wow, I really need a 10-year retention period here. At what point does someone question this choice? So it's this part of the core education you guys are doing inside of the companies or is this ultimately up to the consumers?
Liran: I think it's both. We always have a choice between creating validation in CI/CD or creating whatever it is you do. Adding some rules on top of it. We're getting some alerts and monitoring, so we can do everything and we can also do education. We can also go make sure that we have really good guides, and we do sessions with everyone and talk to them about what is actually happening.
Eldad: And it depends on the weather. It depends on how they feel on that certain day. So for example, if they're angry on that, there will be more about consolidation, and measuring best practices applied. When they're happy, then it's about self-serve, pushing more data tools, it's a never-ending cycle.
But the truth is even myself, you constantly ask, does it need to be centralized? Does it need to be decoupled? all of those complicated terms. The truth is you need a team that knows better and that team knows better most of the time, not all of the time. And that team learns faster because they're domain experts. And if they're actually capable of translating that into "best practices", which are amazing, if they work, then they make everyone better. Because the truth is most teams don't have a DNA for data. And the truth is that if you look at data, how it's being applied today versus a year ago, then most teams are not even close to having a DNA for data.
So, I think, The IT Crowd, I do not know, maybe some of you know the show the British, that is how it all started, right? It was all about conflict and they think teams that win with data, they get addicted and they start depending on experts and domain experts like you.
First, we salute you because I think the reason we asked it, it's because it's hard for those teams and in those times, it's even harder because they're now inbound, outbound, they do a lot of stuff.
So, first, we salute you, second, you are important, and third, if you don't deliver, then yeah, the business is shut down in so many ways that you can even imagine.
Liran: I think it close down and there are issues going on. We just help protect and make sure that people do good things and not cause...
Doron: No, I like Eldad's version. I feel very important.
Liran: Okay, we save the world.
Doron: I feel very important.
Liran: Yeah. We're on the critical path. Get that.
Doron: Yeah. I want the critical path.
I wanted to add something. I wanted to add two things, but I forgot one.
Liran: But I'll give you to add your things first.
Doron: No, but I think that another conflict that we have is following what you said, we do data all the time. We practice data. We breathe this. We eat this, and, and we, that's all we do. We do more stuff now, but basically.
Liran: We also have hobbies.
Doron: That's our DNA. We have that DNA, but the problem is working with generalist developers. They have an epic on some data pipeline or something that they have to do with data infrastructure. But the specific developer might not encounter anything related to data for a long time. And by the time they get back to that feature, to fix something, to add something, two years can go by and a lot of the time this stack has completely changed within those two years. And they're like, what the hell just happened?
And I think it's different between the culture and the DNA between the ZipRecruiter and Yotpo, but at Yotpo, the full stack developers, they do pipeline building if they need. All the teams, all the product lines, all the domains, they all have features and stuff that are related to data and data infrastructure. But it's not a day-to-day thing as it is. I don't know, talking about infrastructure related to Java, for example. Because it is not something they do and practices every day. And it's also a battle that we keep trying to solve and get better at how to engage our users to understand what are the pain points.
I remember what, the other thing I wanted to say is that...
Liran: I want to react to what you just said,
Doron: But forget. I'll just say it and then you.
Liran: Okay. Alright. Very fast, I'll forget mine, so.
Doron: One of the things that we use, more and more now, is creating the right observability around data and it started mostly around cost because it's the big thing, right? In the past six months at least. but it's not only cost. It's cost and it's performance and creating this observability in a way that is actionable for the teams. First of all, it's engaging and also really it helps to bring them closer to the material in a way that they can comprehend and understand and relate to their actions. So I also think it's something very stronger we invest in.
Liran: So, I want to add to...
Doron: On the first thing or is the second...
Liran: I don't know. Not about a cost difference, but you now confuse me.
Anyways, I just want to add, that Doron mentioned that the organizations are different and I think when I got to zip, I was under, like, we are the same person, basically, just different.
Doron: I'm a woman.
Liran: Yeah, she's a woman, I'm a man. But the same person.
But in general, we were trying to have our generalist developers use big data, which really, they didn't really know how to do it. So we helped them and created a lot of structures around and interfaces. So, they can do it with really easily.
Doron: In ZipRecruiter, you mean?
Liran: No, in Yotpo. But when I got to ZipRecruiter, I saw something else. They have a new persona that we didn't have before called the data engineer. And I think that's related to what you said before Eldad is actually in both my teams, actually maybe in three of our teams that I have, we're not the experts actually. We are experts in something. We know to how to build infrastructure. We're good at that. We understand the product. We understand our users. We can collect feedback. We understand technologies.
Doron: The platform, architecture.
Liran: The platform, we're good at that. But how to actually, and what kind of data pipelines to write or ML even. We're not data scientists. We're building an ML platform. So that's really different. We actually try not to be the expert in that, in their day-to-day, but on our day-to-day and figure out, again, like Firebolt does for its customer. You don't have to be the best in the data or understand all the different types of data pipelines people use, but more understand what your customers need.
Eldad: Oh, no, our work is harder. Okay, so I apologize.
Liran: I get it
Doron: You need to do both. Yeah.
Eldad: No, it's nothing to say, right? Benjamin? building a database. It's the hardest thing in the universe.
Benjamin: By far.
Liran: Thank you. So yeah, I think for me it's very challenging. I think in data it's a bit different because we do understand data pretty well, but in ML, we have to understand our data scientists where they're not just one person. They are many different people with many different cultures and needs that they need and create a platform that will help all of them at the same time. And that's very difficult, which we are not and really understand what they do. So I think that's also in the data world, for us as well. So our experts are ever in the company and we know how to gather feedback and how to build a really good product. So, but in Yotpo again was different. So, I'm just giving two perspectives.
Benjamin: Right. So, that sounds super interesting, right? Then, coming to ZipRecruiter and kind of adjusting to kind of this different company and different team structure. How's that going basically?
Liran: I need to go on basically, but that's what happens.
Doron: But we do podcasts together.
Liran: But we do it together. But it's neither on a day-to-day basis and it's really missing.
Eldad: So basically, you went from data engineering to data science and it's hard. And, there's a lot of science there, but it's the same data, so at least that, right?
Liran: Yeah. I have one word left from everything I did before. No, I think, whatever I did at Yotpo, it still helps me with the challenges, but it's a different organization, it's a different size.
Doron: The culture is a big part of what we do. It's really important to understand what culture you're stepping into and what's the culture you want to push.
Liran: And I think what I did... I'm going to be really blunt here. What I did when I entered ZipRecruiter, I was like, oh, I know how to do data. We're experts we know what to do. And I tried to push my agenda and I figured out really quickly that this is not what this organization needs. This needs something really different.
Maybe even if I am right, I cannot just push it. Culture is something and culture change takes a lot of time and maybe in some cases you can't even completely, you have to live with something else. And I think that's what I experienced when I joined the ZipRecruiter.
Eldad: Do you have kids?
Liran: Oh, I have too many kids.
Doron: Too many kids. Too many.
Liran: Three for me, two for her.
Eldad: Your same experience. So, we trust you'll manage and you'll figure the kids out and because it's a new family, but you have lessons learned from the previous family. But it's amazing. It's actually, we kind of never got, at least on our show, to hear that version that those challenges from that perspective. Thank you for that.
Okay. Benji, go to the... We need something formal. So, pick the next formal bullet that Tamar gave you.
Doron: Can we do something formal?
Eldad: We haven't even really opened the formal questions and an agenda yet.
Doron: We're not formal people.
Eldad: That was all intro up until now.
Benjamin: Then that to get more formal, tell us about your goals for the next year because you've already hinted it at six past months, things change, right? Things are focused much more maybe on cost efficiency now and those types of things. So, what are the things you're aiming for with your data teams over the next year?
Liran: Can I start? So I think, Zip is in a strange position. We basically are rebuilding our entire data layer from scratch. It's quite a big company to do that in this time, so it's very challenging. So we're moving away from previous architectures and even technologies and just removing everything to just behave differently. So, right now our challenge is the quality of all of this migration and in general, just quality
Up until now, a lot of the different use cases created, data is not at the top quality. There have been testing before and ways to measure the quality of data and monitoring, but in the end, there were many, many unstructured things along the way. So, we are now trying to create, this culture by technology. So, we are building the different tools to help make that into a structured process that's repeatable. It's easy to use. and it's just no-brainers just give you high-quality data. So, we started with Schema.
So we built a way to document Schema, write Schema in Protobuf for each of your data sets, be able to kind of document everything about that dataset in a way that their consumers can actually, you know, understand what it is they're consuming.
We're creating new ways to create data pipelines based on events or in different ways, either by SQL or in Scala, and we're creating infrastructure around that. So to help the producers create high-quality data as easily as possible without having to actually deal with a lot of the complexities of doing that by yourself. So, adding think a lot for automation on top of that. We are going to build a new semantic layer.
We have a data set that's great but how do we actually use it? How do we aggregate on top of it? How do we join it to different other data sets? So creating that extra layer that explains the consumption patterns of the data. That's something big that we're going to build. We need to switch a query engine. We are right now using Athena, so we need to think of something else.
Liran: Firebolt, right? That's an option.
Eldad: It's always the first and last option, but freeze for a second with that thought, and question. You talked about how you wire, you're using Protobuf, are those things that you embed within your legacy or existing, let's not call it legacy, with your production system, assuming that over the next year, step by step, you will kind of be able to unplug and plug and play new stuff or are those practices and principles that you apply on your new projects? Because you're saying you're moving the business to a new platform?
Liran: Moving everything, the end goal is basically deprecating everything that is, old and we have deadlines for it. It's quite aggressive. It's good in a way. We're actually moving. We're not just, doing everything new is going to be like greenfield and everything old is going to remain and crappy. We have to move everything. So that's going to be something that in the end, we'll just have something new and everything else that's not new will just be destroyed.
I think that's why I talked before about the question of what we actually need. So these are questions that I'm not being asked because I'm more on the technological side, but our BI developers and analysts are asking those questions all the time. Do we really need this report? Who actually uses it? Why is it so complicated? Do we really need all the data set behind? So building that data layer is not just about the infrastructure behind it, it's also about the data itself.
And I think all of those layers are being rebuilt, so it's interesting times.
Doron: I guess it's an entirely different story at Yotpo, but I think it can sound very, really similar if I say it, but I can say looking for the year from now, I think the challenge divides into two parts. One is in the world of analytics, it's actually combined, you know, and again, we're also restructuring the data lake, and also I think the whole escapade with dealing with cost, very, very seriously, has also got us to think of are we doing stuff, efficiently enough?
Are we doing a lot of things not in the right way or suboptimal and I think that the rearchitecting of the data platform, I talked about before about dbt and the infrastructure that we build around it, and it really embodies all the agenda that we have towards how data should look in our organization? We also working on migrating everything. This is very difficult and it starts from the actual S3 buckets and where data lies, it goes on to data contracts and data owners and how we preserve this.
It goes on to data quality and all the way to exactly like everyone talking about semantic layers and data catalogs and how we push this thing forward and I think that, we could have worked on this stuff a year ago, maybe not semantically because something is really happening now, these times it's really, really hot.
But I think even talking about the data catalog, we could have started working on this two years ago or even three years ago. But I think that we needed to get to this, level of understanding and have the organization understand what they need from, from data and for us to reach the same point where we see things eye to eye and the importance of things, and that's returning to what I said before about data manageability and the other part of it that goes together is that, currently a huge part of the data platform or all the raw data?
Most of it, not the big data by the way, but a lot of the data in terms of a number of data sets comes from CDC where we stream data from the operational databases with the BDM into the data lake and we understand that this method of replicating normalized databases and tables into a data lake and having everyone or not everyone, it depends how you layer the data transformations, but having people need to reconstruct the logic and need to understand what the team that built this architecture of how that data is modeled for an application and build this into something that makes sense for analytics, for example, or even for the product when you want to push data into a moderation view for B2B customers. It doesn't really make sense and it doesn't make sense in so many ways.
Liran: It's also about the coupling. You have someone build an application that is used in production to serve some data or to serve something for the customers that is built in a way for MySQL, which is you have a couple of tables, you have some JOIN, that doesn't really make sense for analytics. It wasn't built for that and when you couple that means that a production team or the team in charge of those tables are stuck with those like the Schema forever because someone in analytics actually utilizes it, but it's their tables. Why is anyone using those tables? So I think it's also about that.
Doron: Yeah. It's really understandable that we need to publish these data contracts or these data facades and have the operational teams in charge and be the owners of these data facades and for the analytics to rely on that, moving onwards. And, this goes on to, for us, it's probably going to go to the direction of using the output box pattern, and it's a big thing because it means that we need to restructure the whole bronze layer, and it's just something quite big, more technical than what I talked about before, but I think it's a big part of it when you're talking about re-architecting the platform.
Eldad: So, Benjamin to someone who is just an engineer, not a data engineer, even though you're building a database.
Benjamin: Which is the hardest thing in the world?
Eldad: Think about a lot of that stuff that's being discussed is when we used to write and read design pattern books in software engineering, right? So you'd go, you open a book and you had 25 patterns, and you know that there's so much experience being embedded in each pattern that you even learn them by heart because you just assume that's how it should be done, and data doesn't have it. So, a lot of what that team is doing is trying to figure it out. A lot of the staff is common and you could say, okay, we could start treating that as a data design pattern. A lot of that is unknown, right? I've heard about semantic layers since I was 16 when the first time started dealing with data. I'm 44. We invented the semantic layers. It was 20 years ago, but now...
Doron: But it belonged to a BI tool, and now it's also decoupled, everything getting decoupled.
Benjamin: Eldad goes all game, for this podcast was just to show how visionary he is.
Benjamin: Set out, in the beginning, to just drop this.
Eldad: But the truth is we called it semantic and it was just because it was so nice...
Doron: Rebranding to make it cool.
Eldad:... to have a diagram, the JOINs are just, you drag and drop the boxes and the JOIN line follows you, etc, etc. But you're right. I mean back then data was so easy in terms of metadata, in terms of who's using it. It was nothing compared to now. So just reinforcing everything you were saying, we need it, most of it is not solved yet.
Eldad: But it's nice to see that there are pattern emerging.
Liran: Just to add, one of the things that we do actually is actually write playbooks for different data scenarios. With the Data Guild, which is another, entity that we have at ZipRecruiter. But that's part of the job to actually create those, and I think each company probably has its own, and that's probably why there's no single book for how to write. Data Mesh tried to, I think create some kind of standards around it. But I mean, it does not dive really deep or into each technology and each kind of a pattern.
Eldad: It's because it's culture driven and it's kind of product-driven. What's the product right now? What's the culture right now? What did they go through? Did they go from Oracle three months ago? Where they're born? Did they go through the [00:42:49.08] _____ experience, cloud-native startup, day zero data-driven? It's how we win, so many ways to win or lose with data, so many practices to follow or not, but amazing to hear it. Benjamin, you see, there is a reason to wake up every morning. There is a reason to bend the data. Benjamin asks me sometimes, why do we need to build the database.
Doron: Why are we doing this?
Eldad: Do you see the pain?
Benjamin: It's so hard
Doron: It is the hardest thing in the world. It's too hard. What's it for?
Liran: Exactly, exactly.
Doron: Say the world. Pinky.
Benjamin: Yeah. I appreciate you giving me back my sense of purpose.
Eldad: That was the purpose of this podcast.
Liran: What are you doing? Why are you talking to us?
Doron: Just leave everything and go build something.
Liran: Yeah. Build some nicer, neat index.
Eldad: Guys, we're frozen in time and space right now. We don't want to disconnect and go back to life.
Benjamin: All right. Liran and Doron any closing remarks from your end? Tips, tricks, anything else you wanted to say?
Liran: I don't know.
Doron: Yeah. I don't know if you have any tips or tricks, but, I just think I said it so many times, but I can repeat myself, but again I want to say that for a very long time, Liran used to laugh at me always that I'm like data infrastructure engineer that doesn't like data.
Liran: It's a cool thing. It's special.
Doron: That's my mojo. It's good for nothing though, but I think that recently and again, the motivation was cost, but I think that we really doubled down on analytics, on what we do. And I think it's fascinating and it's also about self-measurement, also product measurement for the product that we build and also for observability, which I talked about before. I think it's fascinating. It makes our job much more efficient and, and it really helps to bridge with our internal customers.
That's my tip, I think.
Liran: I think if I also going to add to what you just said, those interfaces are what interests me right now. So the analysts called decision scientists, and we have BI and we have data engineers, we have data scientists, we have ML engineers, all of those need to work together and they're all using the same platform. By the way, in some companies, don't use the same platform. There is the Snowflake area, that's only for the analyst and then you have a data platform based on it, that's only for engineers.
I would like to have a world and I think that's what happens in both of our companies where we work really well together and we are using the same platform, the same tools, and we are all enjoying because I think that's how it should work. And I think having those organizations disconnect, and by the way, that's another thing, organization, why are they disconnected and sitting under the same roof? They all need data somehow.
Eldad: Different database licenses are bought by different managers, mostly.
Liran: I guess that's good for you. Not good for the world. But yeah, I think I would like to see us help each other better, all those different departments and I think that's what fascinates me now and I hope to I have a better future.
Eldad: Boom! Amazing. Love it!
Benjamin: Awesome. Thank you so much for joining today. It was a total pleasure and kind good luck with the super ambitious projects you guys have over next year.
Liran: Thank you.
Doron: You too. You have the biggest ambition.
Eldad: Thank you so much.
Doron: Thank you.
Liran: Thank you.
Eldad: Thank you for joining us.