<- Back to all posts

December 16, 2025

December 16, 2025

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Multiple contributors

No items found.

Listen to this article

Powered by NotebookLM

Listen to this article

What if your data platform could serve AI-native workloads while scaling reliably across your entire organization? In this episode, Benjamin sits down with Ritesh, Staff Engineer at Lyft, to explore how to build a unified data stack with Spark, Trino, and ClickHouse, why AI is reshaping infrastructure decisions, and the strategies powering one of the industry's most sophisticated data platforms. Whether you're architecting data systems at scale or integrating AI into your analytics workflow, this conversation delivers actionable insights into reliability, modernization, and the future of data engineering. Tune in to discover how Lyft is balancing open-source investments with cutting-edge AI capabilities to unlock better insights from data.
‍

Listen on Spotify or Apple Podcasts

‍

Benjamin (00:01.337)

All right. Hello everyone and welcome back to the Data Engineering Show. It's really good to have you. Elad couldn't make it today because he's out traveling. Yeah, but we're super happy to have Ritesh on the show today. Ritesh is a staff data engineer at Lyft. Yeah, we're actually staff software engineers, sorry. It's great to have you on the show. Kind of welcome. Do you want to quickly introduce yourself, kind of your journey into data and what you're working on today, Ritesh?

Ritesh Varyani (00:30.624)

Perfect. So I'm Ritesh. I'm a staff engineer here at Lyft. I've been at Lyft for about six years. Before that, my software journey actually began at Microsoft within the data space itself, doing some SaaS products at that point in the CRM space and then in the Azure space I did some platform work with Hadoop.

Then at Lyft, have primarily been within the data platform space in the six years and I've jumped across products and at this point kind of lead the products of Trino, Spark and Clickhouse at Lyft.

Benjamin (01:13.701)

Awesome. Okay. Very, very cool. So looking at the Lyft data stack, right? Like I assume you guys are processing kind of crazy amounts of data. Give us a bit more info on where your team actually works. Are you kind of like build driving, customer facing analytics? Are you kind of focused on internal analytics as well? Would love to understand the overall stack, where in the company your team works, et cetera.

Ritesh Varyani (01:32.334)

Ciao.

Ritesh Varyani (01:37.91)

Yeah, so the goal of our platform is to give our users access to the data as fast as possible so that they can drive the meaning from the data that they are kind of getting and take better data-driven decisions because of how very different data products are and they carry different kinds of goals around real-time accuracy, around latencies, around approximations, around predictions. So we serve kind of a variety of customers through these products like Spark being a massive batch processing engine. Through that, are primarily focused on customers who want to do.

ML training jobs who want to kind of process a lot of data and get a larger meaning out of it who want to run any kind of GDPR jobs that are going to do full scans across lots of your tables through years and being able to execute GDPR operations on it. So any kind of massive data parallel processing, you would have customers that would be using Spark as a platform. So we serve in that space ML engineers, software engineers, data engineers who are writing these pipelines or who writing these training jobs and trying to get meaning out of the data to make these decisions. If we go towards the Trino side of things, a typical customer anywhere from a data scientist to a data analyst to a software engineer who is either dashboarding who is doing I would say an average sized ETL like not too heavy like Trino ETL works really great for us

Ritesh Varyani (03:27.352)

There in Trino we have experimented with project R degrade, which is Trino's fault tolerant execution. We didn't eventually turn it on. It yielded good results. There was no problem in the technology per se or that particular feature. We just didn't feel the need and we were able to successfully do Trino ETL without kind of turning it on. The primary reason for that being we developed our own ETL kind of a logic within the orchestration layer for Trino. So you would be able to kind of write into a temp table, you would be able to do a dequeue, then you would be able to do an atomic insert override from the orchestration layer rather than the engine layer per se. So we had already done the work before those features came out and we were able to kind of do a lot.

I would say average sized ETLs through Trino. Another very important use case that Trino kind of serves is the BI dashboarding. Any person, any engineer or any software engineer, even an exec that's able to write SQL, they can kind of write Trino SQL and get the access to the data that they want and they can kind of product managers, they can get the data out and basically get the meaning out of it.

Clickhouse remains a little bit in a different space. It's engineered for a space where it's able to do excellently well. It's a sub-second query latency space. It's able to do excellently well if you know what you are querying so that you can kind of organize the data accordingly. The speed that we have personally seen getting delivered for us from Clickhouse is if we know how the data is going to be queried, we can certainly kind of help our customers, must store the data accordingly that gives them like sub-second latencies for their OLAP workloads so they are able to quickly slice and dice the data. Primary use cases for this would be something like in the marketplace where you want to immediately be able to see like how the last couple of hours in different regions is our supply demand and a bunch of other marketplace metrics doing. Couple of other use cases would be if anybody wants to forecast something, if anybody wants to feed it in a real time

Ritesh Varyani (05:48.584)

Pipeline and we have some of those use cases downstream that actually go off of Clickhouse and use that data to feed into the downstream RTML systems.

Benjamin (05:59.653)

Okay, awesome. And then in terms of the underlying infrastructure, do you folks run your own data centers? Is it all on public clouds? Is, yeah.

Ritesh Varyani (06:08.91)

Yeah, so Lyft is an AWS shop. So primarily, let's just consider 99 % of our infrastructure and platform is on AWS. For Clickhouse, there are a few technologies for which we have gone to our vendor. So Clickhouse, we use Clickhouse Cloud, but Spark and Trino, we continue to run in-house. We are a Kubernetes and an AWS shop. So it's everything on EKS at this point for us within the AWS environment.

Benjamin (06:11.557)

Okay.

Benjamin (06:38.725)

Okay. And then in terms of the like underlying data flow, are you folks dumping it all into a data lake that serves as like the source of truth for all of these systems? Are you using open table format? It's like, how do you actually move data around between the different systems basically?

Ritesh Varyani (06:55.437)

Sure. So almost all of our data not all the data But let's say almost all of the data that's used for offline processing today not online again is dumped into s3 And we have our parquet tables defined on top of it. We primarily are a high format shop We are going to be moving to other open table formats in the future But at this point we are a high table format you dump your data and you define a table on top of the set of parquet files and you're able to query that data through Spark and Reno. Through Clickhouse, you need to kind of make the data flow into their kind of ecosystem so that it's arranged in a way which gives you the performance of your queries. There are a few other places where data is outside of S3. also use very heavily Redshift, but it's more around the financial infrastructure space rather than the data platform space, but there is a lot of integration between the two spaces moving data across these two links.

Benjamin (08:00.463)

Okay, super cool, nice. So yeah, that's a super mature stack and it's kind of cool how many different pieces of data infrastructure you folks use to serve different workloads within the organization. What has your main focus been throughout this year?

Ritesh Varyani (08:17.774)

This year the primary goal that we particularly had in our mind was to understand our reliability problems, understand our scaling problems and future proofing our platform. We have for a good amount of time, we were doing a lot of open source work that kind of spun off into a lot of, I would say, products to manage, maintain, deploy, upgrade, ensure it's reliable, you're not falling behind the industry. We did a lot of that around Trino, around Spot even around Clickhouse with some Kubernetes operators when we were in-house. But our main goal at this point is primarily understanding how do we see the data platform running, let's say, five years from now, three years from now, and how we are able to future-proof it. And in this world of AI, we should not be falling behind in any way. And bringing AI in the right places within our platform.

We look at AI as something that integrates across the company. It doesn't remain as like a particular org or a particular domain or a particular kind of a product within the company. And across the domains of the company, we already have had some initiatives outside of data platform, couple of them within data platform. The goal is to kind of integrate wherever it makes sense so that you are able to kind of derive insights from it.

Ritesh Varyani (09:53.6)

second aspect is modernizing and unifying is what we are kind of trying to think at this point so that you are able to, again, the goal of giving a reliable platform to the users so that they are able to get the meaning out of the data consistently and reliably without kind of, you know, suffering through any of the hurdles in the picture.

Benjamin (10:15.877)

Okay, super interesting. So let me drill into the AI part, right? Because that's something I think that's on the mind of a lot of listeners. How do you think about that as a data platform team? Because fundamentally, right, like a lot of the work you folks are doing is kind of giving the other teams at Lyft the right tools for the job to get value out of their data. So how do you think about that? How do you think about it in the context of actually running multiple different systems? Is it like, hey, now we infuse?

Pre-know and give it like a, let's say, I don't know, like make it ready for rag workloads. Then you do it with Spark. Is it something you're actually driving more at the data lake layer, kind of how to open table formats connect, like a million questions here basically. And, and would love your take on all of those.

Ritesh Varyani (11:02.094)

Yeah, so a couple of things that, I mean, there are a lot more, but a couple of things that we immediately see kind of working out probably for us, at least in short to midterm, is specifically around how we kind of view semantic layer within the data space.

We have a semantic layer aspect or a product at left internally. We want to kind of make it ready for the AI native side of things. So think of it like semantic layer V2 with AI native support so that marketplace, a bunch of other orgs are able to drive the best meaning possible from the data that they see. It could be the data through the dashboards. It could be data that's… joined across through some ETL pipelines and some derived tables. It could be some real-time data that they are seeing through ClickHouse as an example. So that's one layer where we see direct interfacing and direct investment kind of happening from our side because we want to kind of be at the forefront of that and ensure that we are able to get the best predictions possible as fast for our use cases primarily. The second aspect that I see is big data systems are very complicated. They are distributed systems by nature. There is a lot going on under the hood. There is a curve to understanding all of those things. How do you simplify that for a new hire? How do you simplify that on a large scale for a platform with changing workloads? I'd lift probably at other companies as well.

Ritesh Varyani (12:48.622)

the kinds of queries that are being written, the kinds of data that is being queried, the kinds of data that is being brought into the system will always keep on changing. It's not about growth. Growth is a separate aspect, but it will always keep on changing from every six months to a year. So you planned for, you did a certain capacity planning as an example for certain kinds of workloads. And now those workloads are different. Now your capacity planning doesn't work. Now either you are reactive or you're, it's not, auto scaling? Do you have the right spark queues? Do you have the right greener queues? Do you have the right resourcing limit set as an example? So auto scaling at a platform level is a different problem. Yes, it can be a little bit reactor based on how much either you see the queue size, how much you see CPU usage, memory usage and so on and so forth. But where AI can help you is very clearly understand how the A, are the patterns changing? If the patterns are changing, what is like a good action to take on those patterns? Which is like an agentic workflow.

Benjamin (13:48.387)

Right. Which is super interesting. That basically means you almost have these two orthogonal topics in which you're leveraging AI. Like A, you have this topic around how can a user infuse their analytics journey with AI, right? How can I ask a question about the data and kind of the data platform just is able to give me that answer, which is where the semantic models come in. And then you also have the second dimension around how can you leverage AI as a team to run your systems in a more efficient way?

Ritesh Varyani (14:17.411)

How do you run your platform better?

Benjamin (14:19.171)

Exactly. think for us, right, that kind of like building firewall, we've actually seen a similar pattern that there's almost these two.

Kind of layers in which you want AI. So we have things like an agent and the product that you can use to ask questions about your data or like learn more about Firewall. So that's more of this like a user interacting with your product gets AI infused into the experience. And then there's of course the AI for the person building on top of that, right? Like how can we integrate with LLM providers in our SQL dialect? Kind of how can we add vectors or support and all of these things? So it's quite interesting that like for many, think like people working around data, as they think about AI, there's like different lenses with which AI comes into the overall picture.

Ritesh Varyani (15:03.148)

Yes, and to actually add to this, there is a, I would say there is an overlapping use case as well here. So think of the users who are using this platform to get their data. Now, how do you help them out if, as an example, a massive… 2000 line job or a PySpark job goes wrong. So some of the things that have been pitched internally is also having like let's say an AI agent for Spark. We have not kind of productionized it at this point, but let's say you have a job that ran for two and a half hours and it suddenly failed. Was it a platform issue? Was it a networking issue? Yes, you can go through a big chunk of log file and kind of try to determine which is what we do today, which is what Anon called us today in the team. But can you just take that entire log file? Can you just take the set of Spark configs that the users passed and just dump it in an LLM and try to get the best meaning out of it? You share your Spark version, you internally, you kind of share a bunch of metrics around that job that failed and get a set of response that, hey, probably was a config an error or was the nodes an error? So… you have like a two-pronged solution there is A, either you go fix-con-fix or B, is your probably the auto-scaling slow at that point? Did you not get enough kind of capacity to kind of run that kind of a workload? So those are the aspects where we are also thinking at this point to kind of invest and see if better meaning can be driven for users. Again, hitting the point of reliability and ensuring the users are able to use AI in a more integrated fashion and able to get the insights consistently for the data.

Benjamin (16:52.547)

Right. Okay. Super interesting. But then in terms of, in terms of your overall strategy on adopting, let's say like a new vendor is somewhere within your data stack, right? Like how does AI change? You're buying decisions as well. Like, would you say you have new requirements of platforms in terms of explainability, telemetry, kind of all of these things in terms of feature sets? I'd be super interested in how you basically think about that as you bring in new technology that doesn't even have to be commercial, right? Same thing with like an open source project you're bringing in if your lens on that is shifting at all.

Ritesh Varyani (17:31.926)

Yeah, so 100 % yes. It does not mean every decision for a vendor product revolves around AI. It becomes like an important factor for us to consider is, and to be honest, like all the vendor products that you will go out today to kind of research, everybody will give you an AI part in their system. Because people across the industry are really, really passionate and driven and from the business perspective as well, like they want to do something around that space. Now, does that mean that when you look out for a vendor product, you are kind of looking of, yeah, sorry. Yeah, so does that mean when you're looking for a vendor product, you're only interested in what AI work they are doing? No, but it becomes one of the, I would say, part of your RFP package to really kind of understand, hey, do you kind of,

Benjamin (18:13.413)

Sorry.

Ritesh Varyani (18:30.122)

Let's say, give any idea bug ability aspect. Do you give your own semantic layer? Or do you provide a few other features that allow for data discovery in a much better way as an example, rather than a simple search? Where that integration lies, how much it is fruitful, you need to also test it to see how much fruitful it is, to be honest.

Benjamin (18:51.843)

Right, okay, that's super interesting. then, kind of, yeah.

Benjamin (19:01.838)

In terms of your open source strategy, right? Like, and how that relates to AI, is that something that you feel is becoming?

More important because you also adopted some rights, managed offerings, and moved a bit away from being fully built around open source. But you're, of course, also contributing to Trino. I'd be really interested in how that is changing and also how your perspective on maybe open source has been changing over the time, whether AI actually plays a role in that, whether that's completely unrelated. Would love to learn more there.

‍

Ritesh Varyani (19:33.422)

So a couple of things that come to my mind around this is Lyft has also evolved in all of this time. it's a question of kind of really prioritizing of what work is the most business critical work and how do you kind of go 100 % behind it and kind of deliver it for your customers. The second aspect to this then becomes is how do you accelerate that with AI?

So rather than thinking of this as an AI versus an open source thing, it's about a question of, hey, this is our goal. We want to modernize the platform. We want to build a platform that serves for the next three to five years or the next generation for Lyft. How do we get there given what we see, the current and the future or like short-term future developments that are happening in the industry. Open source investments are still very heavy from our side, like we do and want to keep on contributing to it. But at the same time, we want to kind of really understand where the business is going and also kind of ensure that our investments are in alignment with that. So rather than viewing this as a conversation.

As AI versus open source, would be more around, hey, according to the out-business strategy, where does it make sense to invest in the long term? And how do we kind of fulfill it with AI in the picture?

Benjamin (21:05.293)

That's super interesting. And then one thing I actually see. If you think through it, right, is like. Where does your engineering time go as a team? Like if you look at it at the portfolio level, it's like fundamentally more and more of that time is probably now going into like unlocking kind of value through AI, right? Because all of a sudden you woke up one morning, you have a new tool in your toolbox and you can all of a sudden build things that haven't been possible to you before that unlock different types of value eats into other capacity you might have, right? Like on, for example, evolving your cell phone, so data platform or contributing back to the core pre-no query engine. this like basically something you're then trying to balance in your overall strategy to also figure out, how can we get more cycles to actually focus engineering time on our AI strategy?

Ritesh Varyani (22:00.46)

So the short answer to this is not everybody is working on AI initiatives at this point. And to be honest, it won't make sense if everybody.

Benjamin (22:09.731)

Right, of course. Have a business to run, right? Like kind of customers, customers to serve, kind of, yeah.

Ritesh Varyani (22:15.854)

Exactly. So where does it make sense? We as a company have taken a public shift as well towards being like an AI native company. So if it makes sense according to our business strategy, if it falls in that particular purview, if it aligns with it, then obviously we go and invest. in the initial phase of this, was about, if you are the one who is going to take on the initiative, probably spend a few hours outside of what you're already working on. That is how you will kind of discover AI, the tooling for it. And then you can kind of probably get an engineering leadership buy into, if it makes sense to kind of support it. Like Lift overall as a company is a very, it's a bottom-up company and a top-down company, both like we need to align with business initiatives, but we are very much a bottom-up company that way therein wherein if hey I find some value in an experiment that I did or a few hours over a weekend or a couple of weekends that I spent that can kind of help towards the larger goal like okay what's the larger goal for data platform it's the reliability space it's making better sense of the data okay what AI initiatives you can bring in for it does it make sense do a POC let's see how much it succeeds what are the success metrics looking like will it fly in production. So you go through those cycles on your own and then all of a sudden it can very easily then get funded in your planning cycle because you're able to show that, it aligns well with what we want to do. It's the future. And this is how it kind of plays out in the space. That kind of works out very well. The second aspect to this is the leadership also has recognized it already and which is where comes the public shift of being an AI native company. So which is where.

Ritesh Varyani (24:15.438)

Currently we are in the phase wherein we have some dedicated use cases all across Lyft that kind of have AI tooling, have AI chat ports, AI agents that are already in production that are working. So if it aligns, again, coming back to the same point of, if it aligns with the strategy, if you're able to show, if you have the passion to work as an engineer, because it's a difficult space as well to get started with. The second aspect to that is it's changing a lot.

Benjamin (24:40.367)

Right.

Ritesh Varyani (24:43.822)

So you need to be abreast with everything that's happening in the space as much as possible so that you are not kind of left behind as well as an org, as a company or as a software engineer yourself as well. So if you are passionate enough, if you find the space, there's obviously encouragement to kind of go and see. If you see the results, we'll fund it.

Benjamin (25:08.261)

Okay, very cool. Interesting. Yeah. It's like, it's super fascinating, right? To like kind of see these very like large infrastructure stacks basically now adapting to the transformation. mean, it's incredibly difficult to run these types of data platforms, right? And there's very few companies that actually have workloads at that scale. And then all of a sudden you wake up one morning, there's new technology and you have to figure out kind of how to, yeah, kind of like have it bring value within your overall stack. And that's such an interesting transformation and like seems like a massively exciting engineering challenge to be involved with as well.

Ritesh Varyani (25:44.672)

It is, it is. And I mean, a few of the teams were very open to the AI space. Like we had, we have a deep integration with our agentic framework within our ML space as well. The ML org is doing pretty well on that particular side of deeply integrating and providing AI. And it's been a company wide initiative now as well at this point to ensure that it's deeply integrated with the developer tooling. It's deeply integrated within your GitHub PRs. It's deeply integrated as customer facing tooling agents that can help out customers. Not only that, it's deeply integrated in, let's say, a UI tool internally used at Lyft for people to kind of derive meaning as to what they are seeing from the data that's in front of them.

Benjamin (26:31.439)

Right.

Ritesh Varyani (26:31.928)

So it's these different spaces where people have, people initially kind of invested their own time, off in their own arcs in kind of trying to experiment, see if it works. And now slowly what we are trying to, what's kind of happening as a company is it's kind of consolidating into like a single direction of kind of providing different kinds of models so that you are easily able to integrate, you're easily able to kind of use an LLM prompt and get the result you want and showcase to the end user so that you are less worried about, how do I establish the connectivity to those models? How do I establish, how do I get the business subscription? How do I get through this entire approval process, procurement process and everything? All of that kind of gets now consolidated. And what you can kind of focus on is, hey, I just want to integrate this and I want to test and measure this if this kind of provides value to my customers.

Benjamin (27:30.063)

Right, okay, that's super exciting. Nice, very cool. kind of closing out, you think ahead, so 2026 is coming up, we're recording this on the 9th of December. What are you excited about next year? Kind of like at Lyft and the data space in general. Yeah, kind of what's on your mind?

Ritesh Varyani (27:48.098)

So two things, point number one is how are we able to work on unification of our data stack so that we are able to serve our customers better and probably not go towards an approach which becomes harder to manage, maintain, upgrade and kind of be reliable for our customers. The second aspect to this is use AI wherever it's pertaining within our data platform space to be able to serve our customers better, like add that particular layer of faster iteration development right from our engineers within the company to within the data space, you are able to kind of derive better meaning from the data that you see in front of you.

Benjamin (28:37.189)

Okay. Nice. Yeah. That's cool. That's exciting. Well, let's catch up like kind of let's do another episode a year from now, kind of see how your stack evolved. That would be amazing to learn about. It was really like super awesome having you on the podcast today is like kind of I learned a lot about the stack in Lyft. I'm sure our kind of listeners did it well. The work you folks are doing is super impressive. Any closing words from your side?

Ritesh Varyani (29:03.534)

We are hiring for the positions, so please take a look at our website. The positions are very much up to date on the website, so if you're interested in joining Lyft, please do apply. Thank you, Benjamin, for the time. It was amazing collaborating with you and talking about all these stacks between companies.

Benjamin (29:25.763)

Yeah, amazing. Love it. It seems like a great kind of team to work with. So if you're on the job market, kind of go, go apply to be a part of Retached's team or the data or get lift. And then, yep. See you around. Awesome.

Ritesh Varyani (29:37.774)

See you have a good one.

‍

Table of Contents

This is some text inside of a div block.

This is some text inside of a div block.

Technical Deep Dive: Automated Column Statistics

Collect statistics about the values in your columns to improve query plans.

Hans-Peter Lehmann

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

Robin Tang explains how Artie simplifies real-time data streaming and CDC for teams at ClickUp, Substack, and Alloy.

Firebolt Team

Unlocking Faster Iceberg Queries: The Writer Optimizations You're Missing

Your Apache Iceberg tables are slow because of how your data was written into them.

John Kennedy

Intrigued? Want to read some more?