How Zendesk manages customer-facing data applications

Listen to this article

This time on the data engineering show, Eldad abandoned his brother Boaz but it’s ok because Boaz got the full 30 minutes to talk to one of the most interesting people in the data space.

Ananth Packkildurai is Principal Software Engineer at Zendesk and runs one of the strongest newsletters in data – Data Engineering Weekly. He talked about data applications at Zendesk and how they’re built, technologies that excite him like data lineage and data catalog, and the best routes for software engineers to get their hands dirty in the data world.

INTERVIEWER: Boaz Farkash.

ZENDESK GUEST: Ananth Packkildura - Principal Software Engineer.

Listen on Spotify or Apple Podcasts

Boaz: Welcome everybody! Thank you for joining us in another episode of the Data Engineering Show. Today, sitting next to me is nobody. Eldad, my co-host and brother, disappointed me. He could not make it today, and he trusted me. Ananth is here with us. Help me welcome Ananth Packkildura, who is a Principal Software Engineer at Zendesk, used to work for 4 years prior at Slack. Has a bunch of very interesting stories for us today, from both. Ananth is also somewhat of a veteran in the industry, has been in software for a long time, and has moved to data for many years. I would like to talk about that as well. But, Ananth is also a mini-celebrity in the data space. He runs the data engineering weekly, newsletter. Tell us about that. How many subscribers do you have today?

Ananth: Okay. Great! First-of-all, thank you so much for having me. It is really amazing to talk about data all the time.

Boaz: Thank you for joining.

Ananth: Subscribing, I think, we just crossed over 1700 or so, I think we are around that mark.

Boaz: Nice. If you are not registered to Ananth's newsletter, the data engineering weekly, you have to. If you are in data, you have to. It is a must-read. I follow it. Everything you need to know is in there, so look it up. I mean you are by yourself or do you have a lot of people involved there? How does it work?

Ananth: Thank you so much for the kind words, first of all. No, it is all on my own. I started, the background of the story, why I started? Towards the end of my time at Slack, I started working in our observability monitoring infrastructure, essentially applying data principles into the monitoring stack. I started to kind of feel of missing out and there was a pretty good data engineering weekly newsletter before that and got stopped and that was my first go-to source of information before everything, and that is how my morning starts. I had a special alert to read that newsletter and then, I also started to feel that I am going to miss out on something, so I started like, okay! I am going to read something and I just started writing. So, it is out of my own learning purpose, and I still consider this as a good way to learn. I think I would encourage everyone to write most of what they learn and that is how you are structurally learning by yourself. So, it is really, really amazing. Anyone listening, please start your own data engineering newsletter.

Boaz: But I am sure it has been getting tougher and tougher because there is much more content to create from out there. I think it is time you find help with that newsletter. It is too much on your shoulder.

Ananth: Yeah, that is true. Last week, I was going to curate. I had at least 18 articles to curate and shortlisted. Now, I do not have time to write all those things, shortlisted 10 items or 9 items to do that, and if anyone viewing that we already have data engineering weekly, GitHub link, that, anyone read some article, they like it, you can create a pull request and then we can add that as part of the newsletter in this case. So, there is also a way for more community contribution to the data engineering newsletter.

Boaz: Nice. Awesome! Ananth, I would love to hear about your personal career. You started from software and over time, by me looking at your history, it seems that data crept in slowly, slowly, until eventually, you could say, took over. So walk us through that. So when did data become such an important part of your career? What was the tipping point?

Ananth: Yeah, I think I started my career as backend engineering, mostly writing code on Java and other stuff like that and it is funny that when I was starting a career because data warehouses were not that much kind of winded off. In most organizations, there is very less visibility on those things, and it is often viewed as you will be using some tools, maybe SSIS package or any of the tools, I will never go to the that saving my time, but that also is a thought beginning at the career. I would say I got carried away and pulled into this whole big data wave or the buzzwords kind of thing, so I started looking at developing systems in Hadoop as part of experimentation in one of my works. I started looking at Hadoop 0.2 at that time or something like that. It was kind of a very early stage and that is the first time I am actually reading a lot about not producing better than when I kind of realized that most of the problems can be solved if you started to think not to produce patterns and most of the analytics problems. So I think that is kind of a very pulling point, and at that point still, the work is more of a backend engineering because you still have to write a bunch of Java code in order to kind of build any kind of analytical ecosystem at the time. I think I just jumped onto the bandwagon of that hype and then, slowly travelled and realized, okay! This is what we are going back to, where I just kind of reinvented the wheel, and then I actually went back and learned about the data warehouse concept and I took a course from the Kimble group. I think that was the last training program he took, so it is very fascinating to learn. Then, I go back and then learn the basics, and then you just continue on those cases.

Boaz: I think it is interesting, you mentioned that SSIS part and if we are to be completely honest, I think in the past, software engineers used to, to some extent, even look down at data technologists, thinking it is a different position. That is not for us, you deal with your data warehouse. Whereas now, it has become such a big deal, and most of our engineers are actually looking to be more involved with data technologies. It has become one of the most interesting parts of software engineering in general. Thanks for sharing!

Let us start with Zendesk. You have been there for a year and a half. What did you come to do there? or what do you do at Zendesk?

Ananth: Yeah, Zendesk is a very interesting company, right now, as a leader of customer experience platforms, and most of the support platforms essentially. What I am trying to do right now, is involve our customer-facing analytical platform, which is, Zendesk as a company has grown by acquiring more companies. It is like a disparate system involved, but what our clients see or want to see is a unified view of how the customer interacts with their sales and then as a support secret system. Also, not only do they want to view what is the customer experience, but they also want to see more deeper level. My product catalog is there. I wanted to send it to them and tell me which product is doing good or not. They want more than a simple support ticket analytics in this case. So my primary goal - I started in Zendesk to focus on building a data platform and analytics solution to address that particular problem in this case, and as with any data infrastructure, when you start to solve the last mile problem, you realize the problem is actually existing ahead of the problem, like, How do we make sure that we instrument data properly? How can we enable scalable analytics systems on top of it? Right now, I am playing a bridge role to make sure that we are gathering sufficient data, more domain ownership around, and then building scalable solutions.

Boaz: Zendesk has been around for many years. I, myself, have been a client of Zendesk for years. For a company that has been around for years, I wonder how does the data stack looks like? How much is legacy versus modernized? How do you go about modernizing a stack and how has it evolved throughout the years?

Ananth: Yeah, to my surprise. I would say that Zendesk has some kind of a legacy. We have an analytical system on Mongo DB that is actually still so. There is some system there, but surprisingly most of the part, it is pretty much up-to-date technologies. We have some systems using Google BigQuery. All our enterprise analytics is running on Google BigQuery and then cloud storage in this case and we recently started to adopt Apache Hoodie, which is kind of building this use case.

Boaz: Are you guys running multi-cloud or is everything on GCP?

Ananth: Yeah! So our enterprise analytics is actually running on Google cloud. Our product analytics running on AWS, various different companies Zendesk had products at some point of a time. That we could call it a legacy. We have two different cloud services for sure.

Boaz: Walk us through some of the more challenging use cases, workloads that are currently in place that you are involved with? How far long is that new product you described?

Ananth: For unifying the system.

Boaz: Yes.

Ananth: We are just barely scratching the surface. The way I am looking at this problem, not necessarily from unify the cloud services perspective, but how do we do data sharing between these two different disparate services, I think our challenge or compliant from all stakeholders is not necessarily we are running two different cloud services because two systems are atomic in nature, and then they just doing a fine job. I think the problem is at - if I do the data sharing from AWS to Google cloud, Is there any context that we are missing? Is there any lineage that we are missing? How does the consumer on the other side trust whatever you are sending? Right? because the business logic exists in one part and you just send the derived data set to another cloud and the context is missing in the middle range. So we kick-started this whole data lineage, data catalog project trying to possibly build the full bridge, establishing the full context of it, and trying to give more and more understanding to the consumer side of it to figure it out, and I hope at some point of a time, we will be able to merge these cloud services, but that is like a large project one to take.

Boaz: That was my next question. Has consolidating into one cloud been on the table?

Ananth: No, not now, at least. It works, but I think adding additional context right now will give us much more visibility to what is going on.

Boaz: Tell us how the data-related teams are structured at Zendesk? What kind of teams are there? How big are they? How are the responsibilities split?

Ananth: Yeah, that's a good question. I think there is a common pattern I started to see even in Slack and then Zendesk, it is like, we have this product analytics team that is focusing on instrumenting data from our product usages, collecting the data, and then building those, mostly they end up using kind of a Lakehouse architecture. And there is a whole bunch of business operation analytics, sales analytics, and marketing analytics, and these analytics teams have self-contained data engineers and data platforms inside to support the business operational aspect of it. Other teams focus, most of the cases, on the SAS application where you have to deliver customer-facing analytics on top of it. So that requires more and more coding kinds of things other than the SQL kind of workload. So they have a separate team. Zendesk data teams are also organized in these 3 different business orientation aspects of it.

Boaz: What data volumes are you guys dealing with?

Ananth: I do not have a top of mind, but I think we all log everything all the time. So I do not have a very finite number because it is not like one stream of data that we have been consuming.

Boaz: Yeah! Let us say customer-facing stuff, for example. How is that managed? Are these dev teams at the end with customized UI running the show, applying it into APIs, to run queries? or Is it more of an embedded analytics solution? What is going on there on the customer-facing workloads?

Ananth: The way we look at customer-facing analytics, it is more of an application. It is kind of a product on its own. We have a product manager to see how fine at SLA, which will be applied there. We source information from our customers, and then we can build those data pipelines over that period of time. That is a good question. Largely, the customer-facing analytics encompasses more of a backend engineering rather than a pure data engineering perspective. I am just slowly introducing to them the data pipelining concept, but very backend engineering focus in this case.

Boaz: What is the query engine behind the scenes?

Ananth: It is interesting, we right now have two query engines, quick databases, like we are running Redshift and then Postgres, for some historical reason that we keep running on those cases. There is a large project that is going on right now, to kind of unify the data store both in real-time and batch infrastructure to make it more customer-facing analytics. So we are looking at that particular solution.

Boaz: Yeah. I think trying to combine historical and real-time is something humans cannot stop trying to question. What is the solution? Any selected approaches already?

Ananth: I mean, we are looking at various solutions right now, potential contenders like Druid or ClickHouse or Pinot is one of the exciting ones.

Boaz: Which is the last one?

Ananth: Pinot.

Boaz: Pinot, yeah

Ananth: It is kind of a really exciting one. It is a question of right now, what we are doing in the real-time, we do join on the streamside a lot. So I am debating back and forth, join in the stream versus join in the database. I like joining database now because that is what the database is supposed to do. We can always do optimization on the stream. Like why do you want it to wheel on your stream processing? enrichment, yes! But do you need to build up a full flown join system? So those are the interesting concepts that we are exploring right now and dignify our system. Hopefully, we will do something.

Boaz: As you mentioned you are looking at Pinot as well. I ran into PC road from your Slack days about Pinot, you adopted that over there as well. Was it over Druid or something? Tell us about that project back then.

Ananth: Yeah, Druid vs Pinot. These are some of the things that might have been not relevant because every system always improves at any point in time. I think most of the real-time systems coming out of the use case were doing an ad serving engine or having ad serving capabilities. I think we were kind of an ad engine solution, and ClickHouse to an extent the same thing. The nature of the system is mostly immutable in nature. Even though, always immutable in nature, there is no upsert operation that you wanted to do. So you just see the event and you are just continuously running some time series over a period of time. I think that kind of solution is going to work really well, but in companies like Zendesk, companies like Slack where the SAS application fundamentally tries to solve the workflow over a period of time, right? So tickets have been created. Tickets have been deleted and the ticket can go through its full lifecycle, and we are trying to capture an object and we are trying to produce insight for an object lifecycle. So, we wanted to produce the current state of the object and then we also wanted to kind of do a historical view of how the data transformation goes. So when we wanted to preserve the current state of an object, that is where the upsert operation became absolutely crucial, and Pinot does support upsert operation and work reasonably well, but it is still kind of an afterthought, because, Pinot added upsert operations to solve the use case for Uber eats, which is a similar business process engine. I think maybe that is why we are not able to bridge that real-time and batch essentially building some system in real-time from the ground up, that supports ability and immutability. That could be one reason, I do not know.

Boaz: Yeah, this is a tough nut to crack.

Ananth: Yeah.

Boaz: You were in Slack back in 2016 and you were there for 4 years or so. How different was the data stack at Slack when you joined versus when you left? Because I know you, you pretty much helped build it from the ground up.

Ananth: At least at that point when I left, it has predominantly remained the same. One good thing, I do not know if it is a good thing or a bad thing, the early people started to involve in the Slack data infrastructure came from very good previous experience building data infrastructure at scale, and from the good-to-go, we choose some tools that are kind of high on the support side. Just Kafka, Presto and storing everything in the pocket format, tried to use Airflow programmatically to author our system and so on, how do you do modelling and then using structured logging, using a script for logging, not click. So, these things are, we got it right. We spend less time on detecting whether this is an integer or a stream. That kind of a problem I see more and more companies trying to do. Anyone approaching me asking, like, how do I build a data platform? I would say firsthand, please do not use JSON as your data format in this case. So, these are the things we got really well, programmatically author or data pipeline, structured even logging and stuff like that.

Boaz: Because of this JSON, that was an interesting comment, because all the new data platforms essentially are encouraging people to use JSON. Everybody is releasing JSON first features, JSON capabilities, so you are saying stay away from that! be wary!

Ananth: Yeah. Maybe, I do not know. Maybe, they start thinking of a Schema registry to solve this problem. That could be one reason. But again, the compiler is good at type checking and where do you want it to type-check after the fact. When you are doing a compilation itself, it is better to do that, you can reduce a lot of errors in this case. I think most of the Slack challenge at the point of the time was scalability because the rate at the company had grown, every assumption that we have made in every three months we just invalidate at some point of time we made over three months. So, we have to constantly reinvent or try to scale our system to kind of cope with the amount of volume that we are getting. When I started in Slack, we were ingesting maybe 10K events per second and that is what the Kafka cluster looks like. Then rapidly, nine months down the line, we were just ingesting 3 million to 4 million events per second and we barely supported only one type of use case, on the business side of it. We want the operational side of it and all the other aspects of it. So scalability is a bigger challenge in this case.

Boaz: So maybe it sounds like if there ever was a scalability challenge, it is this one at Slack.

Ananth: Yeah.

Boaz: I am sure not everything went smoothly. This is a good part of our beloved epic failure corner. We grow and learn from things that did not work. Any memories to share things that are a good lesson learned?

Ananth: Yeah. I think many things, especially, I do not know how the lesson was learned. I think the Airflow page we had one time, has stopped us from running any code for 8 hours and a lot of small accidental problems. I think this is a very interesting one. Airflow at the time had, I do not know whether it is even now it is true, the scheduler has a problem that it is not able to schedule the rate of spinning off. What we have done is we introduced a Cron job that keeps everyone in our background, just restart the scheduler, just in case the scheduler is exhausted, we just restart in an hour but what we did not realize that that Cron job had a bug. Instead of waking up at the top of the hour and then just restarting one time, it is restarting for one minute continuously. We did not realize that. It was just working fine because it is a very minimal in difference, and one day, we started migrating Airflow from one version to another version and we did not realize this bug when we are iterating that and we are continuously monitoring whether the upgrade is successful or not, we keep seeing this weird behavior and we were so confused, what is happening? Read through all the source code, read through all our deployments. Nothing is happening. What is really happening? That took us like one day to figure it out. Oh my God! there is a Cron bug and we almost forgot that we put that Cron job.

Boaz: The last suspect that nobody even bothered to think about was the one. Thanks for sharing.

Ananth: The good news is that now I am able to read all the Airflow coding in just one hour because we had to read it and understand.

Boaz: If you go back in time, we put you back in 2016 at Slack, starting from scratch, knowing what you know now, what have you done differently?

Ananth: I think one thing I would have done differently towards the end of my time at Slack. I think people started to lean more towards cloud data warehouses, 2016 cloud data warehouses might not be much more mature or much more sufficient to handle our scale. I think at this point of a time, I feel like systems like Firebird, Snowflake or something like that could have been our first choice to do that and then more using the tools rather than trying to build everything else that would really give us much more velocity in the way we want it to handle that thing, because we had to manage all our EMR cluster and because we had to manage all of our Presto cluster and Airflow cluster, typical out of toll for us to kind of support the growth that our company kind of going through and that reduced the trust level of the velocity more and we put people adopting to.

Boaz: In recent years, which technologies or tools or both did you put your hands on, in recent years, or ones that excited you or that you are excited about?

Ananth: In recent years, I think I am excited about a lot of development going on, the data lineage and data catalog. I think this is something that we have not thought through when we started our data team at that point of time. And most of the companies, the data catalog, and data lineage will always be an afterthought. I am so excited to see so much literature, so much talk about lineage data, data discovery systems, and then the data quality aspect of it. I think that is one of the things that I am very excited about. I think adopting that we finally acknowledged we came a long way from Hadoop world to be kind of a heavy hacky backend engineering to acknowledge this is a data system. This is a data management system, properties of the data management system and I think we are coming through the full cycle.

Boaz: Which tools have you been looking at so far at Zendesk with lineage, logging, and quality?

Ananth: There is pretty good tooling available right now. I think it is a good thing a lot of options are available for the consumers. Now I think datahub from LinkedIn is one of the examples of that. Amundsen is another great. I think a Clan is another interesting tool to do that. I think what I am really looking at is how this tool essentially embeds into our workflow of our analysis that the data engineers integrated into their workflow rather than introducing a data catalog, and then just do a checklist and we have a data discovery and then checkmark is done and then nothing. I am sure you are aware of Apache Atlas, which is an open-source data lineage story developed long back in 2012 or something like that. We do have an Apache Atlas from 2017 in Zendesk, except that no one uses it or half of them are not even aware of it. The thing is that the pattern that I noticed why people are not able to use it, is it is incredibly hard

first-of-all to do that. So the analysts, I observed the behaviors, and again, one interesting thing is an analyst came and said, "Hey! What does this table look like? and I wanted to get an insight. Tell me what are the tables that I should query about that?" And someone trying to point out, well, this is a lineage tool that you should use.

And it is a separate effort. They had to go to a different UI. They had to understand that UI and then they are just trying to query, that it can have information, do not have any information, and then if they are not able to find that information, they immediately go and ask, who is the senior of the team? and they will go and ask the same question, right? You will be using an analysis of the data discovery system again and asking it and there is a high chance that a person has the knowledge about it because of an experience and they answer the question. When the next time a similar problem comes in, the analyst will not go and look at the data discovery system. They directly go because their workflow is now changed because this particular tool is not trying to solve this problem. So, I look at this as a workflow problem rather than a lineage problem. So they need to sufficiently solve that problem, which will be really helpful.

Boaz: Awesome! What do you recommend if I am in software, but I have not been involved with data so much and I want to start? What is a good route for a software engineer to get his hands dirty in the data world? How do you approach such an educational program?

Ananth: My take on that is, SQL by default is the easy language of data, right? The first thing in terms of the skill set, I would like to say, is to pick up SQL, it is a first-class tool for you to start navigating the data across. And if you are a software engineer, I think one of the good things they can do is take a look inward rather than outward. Many people started taking, wanting to get into data engineering and looking at - I am going to solve a business problem or predicting, sales analytics or marketing analytics. If you are a software engineer, there is most likely that you will be working with some kind of a system, some kind of software you will be deploying to a production system. So that the production system will emit some kind of logs. I think the very first step is to take that log and apply the SQL and try to understand your system from a different perspective.

Boaz: Interesting.

Ananth: We do have the observability tooling right now, metrics and logs and you do search and try to find the needle in the haystack. But if you try to take that logs and try to understand the long-term perspective, it is particularly helpful, A, because you already know the domain, you already know the system, so you have a very good intuitive scale that you can start to work around and improve your skill and then, once you understood that, the techniques and the domain is transferable, right? The domain is more of acquiring knowledge, the tools, and your way of thinking to figure it out, the pattern is going to remain the same. I think that would be my suggestion - Start with the tool, pick SQL and then look inward in your own domain, trying to figure it out.

Boaz: It is beautiful. So, a log-driven approach. If you master the logs, you know everything about your platform. Awesome!

Boaz: Okay, great, Ananth! It has been amazing, with so many great insights. Thank you so much for joining and again, keep doing what you do with the newsletter. We love it!

Ananth: Awesome!

Boaz: Stay safe with the virus and all.

Ananth: Yeah. Thank you so much.

Boaz: My pleasure.

Ananth: Bye.

Boaz: Bye.

How Zendesk engineers manage customer-facing data applications

Listen to this article

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Firebolt Team

"Where Do I Put My Logs?" A Conversation with TLDCRM's CEO on Solving the Impossible

Sergio Ferragut

Late Materialization: How Firebolt Makes Top-K Queries 30x Faster

John Kennedy

Intrigued? Want to read some more?

How Zendesk engineers manage customer-facing data applications

Listen to this article

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Firebolt Team

"Where Do I Put My Logs?" A Conversation with TLDCRM's CEO on Solving the Impossible

Sergio Ferragut

Late Materialization: How Firebolt Makes Top-K Queries 30x Faster

John Kennedy

Intrigued? Want to read some more?

Don't miss a post, subscribe to the Fireblog