<- Back to all posts

October 20, 2025

Block Bad Data Before the Write with Nike’s Ashok Singamaneni

Multiple contributors

No items found.

Listen to this article

Powered by NotebookLM

Listen to this article

Nike’s Principal Data Engineer Ashok Singamaneni joins Benjamin and Eldad to discuss his open-source data quality framework, Spark Expectations. Ashok explains how the tool, which was inspired by Databricks DLT Expectations, shifts data quality checks to before the data is written to a final table. This proactive approach uses row-level, aggregation-level, and query data quality checks to fail jobs, drop bad records, or alert teams - ultimately saving huge costs on recompute and engineering effort in mission-critical data pipelines.

Listen on Spotify or Apple Podcasts
‍

[00:00:00] Ashok: DLP expectations gave an idea to the industry that you can do data quality before actually writing the data into your final tables. As the scale of the product increases, it becomes even more difficult for us to find exactly where the issue went wrong, like, even if a production job fails.

[00:00:18] Benjamin: Ashok is a principal data engineer at Nike, and he worked on a variety of cool, actually, open source projects as part of his work there.

[00:00:27] Ashok: I think over the time, in my experience, what I learned is this ingestion layer and the transformation layer, you should treat that as a software product, not like a data engineering product.

[00:00:39] Benjamin: Hi. This is Benjamin. Before we start with today's episode, I wanted to quickly reach out on a personal note. We've just launched Firebolt Core. Firebolt Core is the free self-hosted version of our query engine. You can run Core anywhere you want, from your laptop to your on prem data center to public cloud environments. Core scales out, and you can run it in the multi node configuration. And best of all, it's free forever and has no usage limits. So you can run as many queries as you run and process as much data as you want. Core is great for running either big data ELT jobs on, for example, iceberg tables or powering high concurrency customer facing analytics on big datasets. We'd love for you to give it a spin and send us feedback. You can either join our Discord, enter our GitHub discussions, or you can just shoot me an email at Benjamin@Firebolt.io. We'd love to hear from you. We added a link to Firebolt course GitHub repository to the show notes. And with that, let's jump straight into today's episode. Hi, everyone. And back on the data engineering show.

Today, Eldad and I are actually in person in Munich together, which is very nice. He's real, same room. We're both real. Exactly. And we're super happy to have Ashok on. Ashok is a principal data engineer at Nike, and he worked on a variety of cool, actually, open source projects as part of his work there. So one is called BrickFlow, and one is called Spark Expectations. Excited to have you on the show today, Ashok. Do you wanna quickly introduce yourself? Tell us about your background, how you got into data.

[00:02:04] Ashok: Thank you. I am Ashok Singamaneni, and this might be interesting. I'm a mechanical engineering. I've done mechanical engineering basically. And then I switched to data like probably twelve years ago and it's been a long journey and I've done my master's.

[00:02:21] Eldad: The industrial revolution ends with data, unfortunately.

[00:02:24] Ashok: Yeah. So I figured probably when I read the MAPRED newspaper, that's when it clicked that I think this is going to be good. And then Cloudera, Hortonworks, and all of that happened, and I write some books. And then I did my masters and did thesis on, big data analytics on vehicle data, and then later moved into banking industry and then worked in health care and then now in retail domain. So it's been interesting journey.

[00:02:52] Benjamin: And so after twelve years of data, do you ever think back? Are you like, hey. I next kind of, like, in a couple of years, I wanna become a mechanical engineer again, or you're, like, kind of

[00:03:01] Ashok: No. I think it's a perfect blend right now. Look at Tesla. You can be a mechanical engineer as well as also be in software. And, also, you can be in robotics as well as in software and data.

[00:03:11] Benjamin: That's awesome. Cool. So at Nike, right, you worked on both these kind of BrickFlow and Spark Expectations framework. Tell us a bit, like, how did those projects start? Kind of why did you decide to actually open source them? What are they all about? We'd love to learn more.

[00:03:27] Ashok: Sure. I think coming to Spark Expectations, right, being in the industry for so long and have been in lot of production calls and misfires happening in the production because of the data changes, unexpected column changes, etcetera, in upstream or downstream, and having been in lot of recompute of the data. This all happens because of the majority of the times data quality issues, which can be removed upfront while the data is being processed, as well as also some of them are the processing or or planning mistakes that happens regularly, and that's kind of common. But when I looked at Databricks DLT pipelines that was released a while ago, they have introduced a DLT expectations, which kind of is interesting because if you have seen Great Expectations, Great Expectations is a tool that actually gives you data quality report post processing of the data. After you process the data, you will see the data quality report that you can run on the data and then the report is really fantastic. But DLT Expectations gave an idea to the industry that you can do data quality before actually writing the data into your final tables. And then I reached out to Databricks team like if we can work on, something related to Spark as well. Because many, many companies, we know that using Spark for data processing and transformations. If this can be available in Spark, that would be great. But the timelines didn't work out, so I started working on the project called Spark Expectations in the same name of weird expectations, weird expectations as this project is related to Spark. So I named it Spark Expectations. And, it has more functionality than DLD right now, but it is only supported for Spark. I wish that it supports for all the frameworks like Pandas and other libraries as well, but right now, we are supporting Spark. And what this does is before you write the data into your final layers, the data quality checks happen and restricts the data that is not up to the data quality standards what you want so that you'll need to recompute as well as you will have alerts on the data that has gone bad, and then you'll be able to alert your upstream teams that this data is bad. This need to be resend and reprocessed.

[00:05:44] Benjamin: How do you define these expectations? Is it like a YAML file, a JSON file? Like, take us through that part.

[00:05:50] Ashok: I think it's up to the teams which are implementing because right now, the rules, the way Spark expectations expertise is part of a data frame. So you can write it as a YAML file, JSON file, or might as well have that in a table as well and load that as a data frame and give it as an input to Spark expectations. And there are also, like, three kinds of rules that you can provide, like, row level data quality checks, table level aggregation level data quality checks, as well as query deque, which we call it as, like, for referential integrity checks or if you want to have, like, some complex rules that you can run on data quality.

[00:06:27] Benjamin: Okay. And then kind of that runs basically before I do any of my batch ELT jobs? Does an initial pass over to the data, makes through, yep, everything kind of looks great before I start spotting a lot of

[00:06:39] Eldad: like, is it sampling the data? Is it running the previous snapshot?

[00:06:44] Ashok: So there are, like, different layers in Spark expectations. Right? So when you load the data let's say, for example, if I'm ingesting the data and transforming it, when I load the data into the data frame, like the initial data frame when I load it from my source table, you can run source level, query dequeue, or aggregation level checks that you want to do. And then you can run data quality or row dequeue checks as in, like, on each column. Let's say the date has to be greater than this level or any other validations that you want to do on a particular column, you'll be able to do. Not a sampling. It runs on the whole dataset. It gathers all the information. And you can also have conditions like ignore, drop, and fail the job. If a rule fails, you can ignore it, but still alert it that this rule is a soft failure. It failed, but you are getting still an alert. Or you can drop the record. Like, let's say a product table doesn't have a product ID or the product ID configuration is wrong, then that is something serious that should not be there in the data. So you drop that record on the whole and put that in an error table and give that alert to the engineering team that there is some error in the error table you can look at. And you can even fail the job. If it's mission critical, then you fail the job, not process the data, and don't put that data into the final table so that you don't need to recompute that again.

[00:08:04] Benjamin: But then for me to understand this a bit better, like, does this actually run as a hook within the ELT job itself? So that, basically, every time I start scanning a table anyways for my big batch processing, I also run the data quality check. It's not like I have to scan the table twice now. Right? Like, that I will first pass over all the data, which is really expensive, run the kind of spark expectations, throw everything away, and only when that looks good, I run the batch ELT. It basically overlaps. Is that accurate?

[00:08:35] Ashok: Yes. I think we use decorator pattern in Python for this. So whenever you tag that spark expectations decorator, there is something called with expectations. If you tag that inside your function and your function returns a data frame, on that data frame, whatever rules you define, it runs all of them and generates the report.

[00:08:54] Benjamin: Very cool. What have you seen in terms of the, like, overhead this actually introduces for, like, bigger batch ELT jobs? Is it noticeable? Is it very fast? Take us through that.

[00:09:04] Ashok: Yeah. I think the road e q checks that happens are very fast. It should happen as a pretty standard checks that happens on the scale. But, obviously, definitely, there will be an overhead. So you wouldn't want to put this on all the layers and all the jobs. You only want to put it at the final layer or the final right step where you are actually writing the data, and that is, like, the mission critical production data that you want to have it. And for the query DQ, as well as aggregation DQ, you need to be careful and optimize it well so that the time is not heavy. Like, you need to optimize your queries, obviously, like any Spark job, enter it. But that is an overhead for sure. Like, if you're running data quality checks every day on top of, like, huge scale of data, then obviously you'll have. But it's also possible in streaming too. Like, if you're doing, like, a micro batch, then you would be able to do that on micro batch too.

[00:09:55] Benjamin: Okay. And so let's say I have, like, I don't know, medallion architecture. I go, like, brass layer, silver layer, gold layer, which you're basically saying, like, on the right to the gold layer, I would basically hook in Spark expectations. And at that point, which then connects with the framework and also error out, I could really make sure that, like, no insert into the gold layer even finishes, um, if there's not a certain kind of data quality check passing.

[00:10:24] Ashok: Yes. Put it. Exactly.

[00:10:26] Benjamin: Very nice. Now would it

[00:10:27] Eldad: be fair to kind of compare that to constraints in SQL or kind of that domain assertions constraints?

[00:10:36] Ashok: Yeah. I think it's something similar to constraints. Right? But it's also the constraints has a limitation that you can only put constraints on that particular column values. Like, this is what this column expects, but spark expectations does something more than that. Like, you can have, like, aggregation DQR. Like, let's say, for example, the whole count or the sale in that table should be greater than 100,000. If, let's say, there is a big sale that is going on and if the sale is, like, less than 10,000, then that doesn't make sense. There is something wrong in the computation, etcetera. Or differential integrity checks, like, if you have, like, multiple tables in the data model and you want to have, like, correlations that are happening properly or not, if the business logic makes sense or not for this data to be there. So those kind of checks also can be done.

[00:11:25] Eldad: So if someone is you know, for people who own production grade pipelines, specifically kind of at those stages where the data is not fully cleansed. Right? Like, at the end of the pipeline, everything went through cleansing. It's always safe. It's it's there's a value for every product. But as you go closer to where data is being born, it's nastier. Right? It's not clean. There's no formal way to do it. So every data engineer I know of has a set of toolboxes based on intuition, experience. Obviously, depends if they're engineers and they're writing spark or getting deep into the right into how those queries run. What would you recommend us? Like, when should we seriously start looking into those frameworks and kind of what would be your guideline on transitioning to that mindset?

[00:12:17] Ashok: Yeah. I think over the time, in my experience, what I learned is this ingestion layer and the transformation layer, you should treat that as a software product, not like a data engineering product like from per se. You will have all the data modeling, data governance, data observability, all of those tools that are there. But I think the general mindset should be the ingestion layer or the raw and the bronze layer and the silver layer should be like a software product. You should have all the checks and balances in place, like data quality and unit testing, integration testing. All of those should be in place so that any mishaps that happens, it's easier to find out and debug. Because as the scale of the product increases, it becomes even more difficult for us to find exactly where the issue went wrong. Like, even if a production job fails, it takes time for you to debug and see, like, lot of human effort also involved, not only the recompute that is happening and the compute costs that are happening, but also couple of engineers, a product person has to be involved for a week or week and half to fix that issue that is happening. But the final goal layer where mostly people use SQL, that can be treated as like a pure data engineering. Like, write SQL, do it fast, build dashboards, break them, and then fix them as much as you can. But like the initial two layers are mission critical that has to be treated as like a software product. That's at least my experience, what I have seen.

[00:13:49] Benjamin: Very nice. Take us through your experience, like open sourcing a data project, right? It's kind of like, did you go to some meetups? Kind of like, do you actually know some of your users? We'd love to learn more about that.

[00:14:01] Ashok: Yeah. I think it's been a fantastic journey for open source. It's my first project to to do that, but I have really great mentors. A big shout out to Adi Aditya Chaddhivedi, who's distinguished at Nike and also Scott Haines who was an author at O'Reilly and other people who have helped me through the journey. My senior director, Joe Hollow, who helped me through the process. These guys have helped me through the process of, like, open source. And, also, there is an open source community effort that is happening at Nike as well. So getting through the approvals and getting the org set up and and the project set up and publishing it, it has been great. And I also had an opportunity to talk at Databricks AI Summit. Um, so networking happened. Couple of engineers also looked at the project and held through. And I think there is more that can be done from the project side, and I wish there need not be Spark expectations. There should be something native to Spark so that anyone can actually use directly from the native Spark module.

[00:15:06] Benjamin: Very cool. So how are you now using like Cursor or, like, any of these Gen AI tools to actually accelerate kind of progress within your own open source project? Like, are you vivecoding a good looking UI?

[00:15:20] Ashok: I think I've been using Cursor and Cloud Code since, like, over one and a half years' time. I mean, time flies. It feels like eternity that I've been using them for a long time now. But wipe coding is really good if you know exactly what you're doing, at least from the data engineering standpoint. Like, for the regular software engineering, building websites, etcetera, that can be very quick. But from the data engineering standpoint, we have to take some guidelines into place. I've seen people using plot code or cursor on production data directly when they're building and drop the datasets. So, so misfires happen a lot if you are using white coding directly on data engineering projects, guidelines that need to be taken. Use service principles or AWS rules, etcetera, so that they have restrictive permissions when you are using plot code or cursor. That is something that I have learned myself as well. I was working on a small pet project, and I was using SQL mesh. And there were some commands which I didn't know that it destroys the whole dataset. And it's a sandbox handler, but it destroyed everything. Right? So having those guidelines in place is really critical from the permission standpoint. And as well as from wipe coding, I think there are different patterns people are using these days on, like, creating a PRD document first, have a GitHub issue, and set up a process for yourself so that there is a journey from what is the problem that you're trying to solve, creating a proper GitHub issue for all the issues that you are working on, and then helping through the journey of, like, breaking that data issue into small tasks as checklists in your plot code or cursor, and then checking through each one of them, writing unit tests as you go, and having integration tests, and human in the loop that you have to review the code that has been written. Ultimately, at the end of the day, you are responsible when you're checking in the code. It's not Claude or Karsar. That will be blamed if something goes wrong.

[00:17:26] Eldad: Listening to your think, it actually makes perfect sense to have those cleansing and expectation libraries being used for often nowadays when no one actually knows who's behind each and every change. So as it starts, right, forking, modifying, changing, updating production eventually, setting expectations makes more sense. So I think there's a bright future for kind of right away even say, like, focus on the expectations because we can't control the modifications themselves. So at least we can own the expectations. But this will be super interesting to see how this kind of semi automated yet highly expectation pipeline will turn out.

[00:18:08] Ashok: Yeah. And it's also the trend that I'm seeing these days, is from the data governance teams and data governance standpoint as well across the industry. Data observability and quality is becoming prime because of AI integrations that are happening. There are tools that are coming out where a CEO or CTO from a company can directly ask questions in natural language and hit the production data and get the data for themselves rather than Someone goes,

[00:18:37] Eldad: codes it, writes it, generates the reports.

[00:18:40] Ashok: Yeah. So the leadership is directly looking at the data and if there is something wrong in the data, then there can be some serious repercussions happening on the business decisions. So that's one of the reasons why I see the industry moving towards the trend that rather than having bad data in the tables and then recomputing or reclarifying things, let's not put that data first in the first place and then redo whatever needs to be done from the upstream and fix the data.

[00:19:10] Benjamin: Cool. So if you look ahead, like, year to year, what are you most excited about in the data space?

[00:19:17] Ashok: I think from the data standpoint, I'm looking at how the white coding improves and the tools that comes for white coding in terms of the data engineering space exclusively. Because right now, it's all generic patterns that are evolving in the industry. Like, if you look at Reddit or any other blogs, there are certain patterns that people are following for UI coding, iOS, Android apps, but there is not specific that I have seen for the data engineering space. This is how you exactly do white coding. I think I'm more excited towards that on, like, how that can be evolved as a tool or a framework. If there can be something as a spec, it, like GitHub has really suspected recently on, like, how to do wipe coding. But if there is something like that for data engineering, that could be awesome. I would like to collab on that. Nice. Awesome.

[00:20:10] Benjamin: So to all our listeners, if you want to work with Ashok on Vibe coding for data quality and data engineering, reach out to him. It was so great having you on the show. Seriously, kind of thanks for sharing your experiences. It's always exciting to learn about new open source projects in the data space. So thank you for being on.

[00:20:29] Ashok: Thank you. Thanks for having me. Have a good day.

[00:20:32] Unknown: The data engineering show is brought to you by Firebolt, the cloud data warehouse for AI apps and low latency analytics. Get your free credits and start your trial at firebolt.io.o.

‍

Table of Contents

This is some text inside of a div block.

This is some text inside of a div block.

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

Robin Tang explains how Artie simplifies real-time data streaming and CDC for teams at ClickUp, Substack, and Alloy.

Firebolt Team

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Lyft's Ritesh Varyani details their polyglot data strategy unifying Spark, Trino, and ClickHouse with AI.

Firebolt Team

Unlocking Faster Iceberg Queries: The Writer Optimizations You're Missing

Your Apache Iceberg tables are slow because of how your data was written into them.

John Kennedy

Intrigued? Want to read some more?