A Deep Dive into Slack's Data Architecture

Growing from a startup to an IPOed and then an acquired company meant that Slack's sales org was scaling rapidly.

Apun Hiran, Slack's Director of Software Engineering explains how the data stack and architecture evolved to support this growth with more reliable and timely metrics.

Listen on Apple Podcasts or Spotify

Transcript:

Boaz: Thank you everybody for joining us here in San Francisco. We are super excited to be here. We are Eldad.

Eldad: Hi! everyone.

Boaz: I am Boaz, from Firebolt. Some background for those of you who have not heard of the podcast. What is this event? The backstory is this. This guy here loves data, decided to found the data company called Firebolt. It is pretty good. Check us out. But it is not really about Firebolt. To avoid that, marketing came up with a good idea. Let's do podcasts. We started doing this Data Engineering Show podcast, which is all about bringing great people from the industry, data practitioners, people who work with data on daily basis and just interrogate to help them about their day-to-day, their challenges, their visionaries, their points of happiness, and everything data related.

Eldad: Do not cut anything out today, all the embarrassing moments that we have on LinkedIn.

Boaz: Yeah. You cannot just skip your speaker today and it counts everything is there to say. The podcast took off. We had a good time doing it, have a lot of listeners and then, we decided why not take it on the road. So, this is the first attempt to do the podcast on tour. We are very excited to be in San Francisco during this for the first time. We have great speakers lined up. So, are you ready?

Eldad: I am ready.

Boaz: The first speaker to join us, is Apun Hiran. Come on Apun! Where are you? See how he comes.

Apun: First time doing this.

Eldad: He is doing it every week.

Apun: Hot seat. It is really warm.

Boaz: Apun, we all worked from home for a couple of years. All of us tried a variety of zoom backgrounds. Some of us went for animated backgrounds. Some of us went for weird backgrounds. Apun sat very close to his huge Jeep car which was his zoom background in the garage. The best background I have ever seen.

Apun: Yeah! That is my favorite place in the house. Quiet. My garage is where I worked from home.

Boaz: And he spotless the car also, shiny.

Apun: You need to keep it like that. It is going to be the background.

Eldad: We have known each other for so long. We started to know each other in Covid, and it is the first time we are meeting in person. So there are quite a few people here that met us when we started. So, just want to quickly say "Thank You" for believing in us and being with us, first when we started.

Eldad: It is great to physically see a lot of you here.

Boaz: Apun, is the software engineering director at Slack, heavy, heavy focus on data. He has been on Slack for around a half-year, spent a few years prior at Dropbox, doing data stuff, and also had a long career behind him, in data, Yahoo, and Oracle. He even used to have a DBA title, but that seems to be disappearing.

Eldad: You want to add further things to this.

Boaz: Things you would like to say about yourself that I did not include?

Apun: Yeah. Sure. My name is Apun Hiran and I have been in the data space for about 20 plus years now. I started my career as a DBA right out of college. The first thing I was doing was Oracle DB and did that for a very long time, almost 10-12 years before moving into the big data space and writing fix fixed scripts and then did a stint in a company called AppDynamics, doing software engineering. That was the first time writing code in Java for about two and a half years, building that database monitoring products, and then moved into Dropbox for about two and a half years doing data engineering. And, the last five years for me have been more around functional data engineering, primarily working with business stakeholders, finance, sales, and marketing and that has been an interesting change for me, moving from the platform side of things and building large databases or Hadoop platforms to doing functional data engineering.

Boaz: Awesome! Let us start with, you are in Slack. First day in the office. What is your impression of what is going on in the data and Slack?

Apun: I think, my first impressions of data on Slack were pretty good. I was pretty excited to see the data platform that was already available in place and the way it was scaling and the team was amazing that I was going to lead the company. We already had a pretty good data stack, and I think my job was much easier in Slack than before. From there, we are just trying to get to the next level.

Boaz: You are saying you worked very hard before?

Apun: Yes, it was.

Apun: At Dropbox, when I joined, me and the team, I build out a data stack, analytics data stack over there.

Eldad: Found Dropbox.

Apun: Yeah. We did not get our files on DropBox, but we used to ingest them, but that was it.

Apun: Everybody is amazing. The technology behind it is really impressive. I think we did load data from Dropbox. There were folx who dropped data at DropBox and we would pull the data into the system.

Eldad: But it is hard to see if you are moving back to Dropbox where you went to release a provider on Dropbox needs to cold scan.

Boaz: You brought into the Slack into which position?

Apun: Yes, I joined Slack a year and a half and my role was leading the data engineering team which is the functional data engineering team. Basically, looking at the sales, marketing, finance, customer success, customer experience, HR, all the different analytics that you can do, other than just product and that was my goal, and I had a wonderful team to lead there and we have been building a lot of data products over there and since the last couple of months, my role as a standard.

Boaz: Most speakers on our podcast, get to hold a pretty good team.

Apun: Yeah, now I also lead the data platform team and the enterprise integration team at Slack

Boaz: Tell us about the various teams that do the data at Slack?

Apun: Sure. At Slack, we have a pretty large, self-service data platform based on Hive and Presto where anybody at Slack has access, who can write sequel queries can go there and write that. But, we have our data engineering team which looks purely around the product data, the usage metrics, and stuff like that. And, then we have a data science team that works very closely with the product data engineering team, looking at all the new features that are being released or beta releases, or A/B testing and things like that. And then, we have business analysts spread across all the different business units who are pretty SQL savvy and logged into the system, write their own queries, and sometimes make our lives difficult because we fix those queries. So, we have a lot of data folks. I am a pretty robust sense of platform at Slack.

Boaz: During those years and a half, when you joined and since, what were the priorities at Slack?

Apun: I think the biggest priority for us was to have reliable metrics as Slack sales were growing rapidly and to support the sales organization with reliable metrics, friendly metrics that were like the first...

Boaz: We all give a contribution to the sales.

Eldad: Allowed the free versions.

Apun: There were pretty lofty goals, right from stewards and all the sales organization around growing sales that are at a very fast rate. So, the first thing that we came in is to build away from a robust sales analytics system and build in all the important metrics around and the user experience that come with those metrics. So, that was probably the first biggest challenge that happened. But for us, the other challenge that happened was the acquisitions. So, then there was a second tier of things how do you look at metrics Salesforce space that we can align on metrics. Some of those things that come with accuracy have become pretty critical. At the same time, we were also looking at some gaps in the platform that we had in terms of, not having a very robust data catalog and having proper data quality tools. So, we were having a lot of issues with data quality and we would usually get notified by the end-users that this does not look right, which is...

Boaz: What was a typical example?

Apun: Typical example will be data load that is say coming from CRM application and we load the data and during the data load, a few of the files got missed because of some error during the copy command that happened and by the time that goes beyond call and had an alerting system, it has already been published. The dashboard has already been refreshed and then, there were other data issues with lack of data. A job was supposed to run in 2 hours and for 24 hours and the end-user does not know why there is no data or the data is half picked.

Boaz: How did you go about fixing that?

Apun: We use Slack a lot for alerting, prioritizing, and all that, so we made Slack in the middle of everything that we do like all the alerts started to go down, go through the channel.

Eldad: Who gave you this idea by the way.

Apun: We definitely leveraged Slack and a lot of workflow in Slack to be proactive in messaging that there are delays and things like that. One thing was just around building the ecosystem alerting, better monitoring, and better on-call support for all of that stuff. The other was looking at what is the root cause of these issues? Can you be more proactive in publishing data only after the data loads are completed and we know that there are no errors? We are thinking of the entire ETL pipeline. I must say we are in a much, much better situation right now, and we are at the place where we are actually trying to identify the data issues, which are not easily related which are actual data issues, somebody has the resources from data has problems, building alerts from that. We have come a long way, in one year.

Boaz: Most of that approach to the data engineering team, your team?

Apun: That's correct. There were functional data issues. So, we had to work with functional teams. We need to fix the data upstream. We did identify. So, we became like that team which was identifying issues from source systems and reporting them as well from just being the folks that we get the alerts through our ETL.

Boaz: Tell us about the various stack?

Apun: As I mentioned, we have a data lake, which is on S3 with Rust environment and where people can write their own queries, and publish dashboards as well.

Boaz: Which kind of users typically run their own queries there?

Apun: Almost all users run their queries because that is the data lake, you get all of the regular data available over there. But I would say data science is probably the biggest use case. Almost everybody in the company who needs any kind of product information would go there and look at the data.

Boaz: They are analysts or people embedded within the different departments?

Apun: Yes, they are. Every team product, product analyst, market analyst, and sales analyst are all pretty people savvy. They would go around queries. At times, we have written queries for them and we have given them the lake and they just run it every now and then to get their reports or numbers as well.

Boaz: You mentioned before the lack of data cataloging in a company, like Slack, and analysts embedded across departments. How does knowledge sharing work? How do they know what statics to look at? How about sharing work?

Apun: I mean, that is a problem. It has always been a problem. There is a nice search and people have keywords, and they find a table which looks like the right thing to look at and they write queries and they look at data, and then the data would not match to publish that. Then, they will come back to some data engineering team and why is this component not looking the same and we will go ahead and tell them what is the right table to look at. That is a problem with any self-serve platform that is there. But, there has been a lot of investment made on the metadata management part, on the Presto and Hive as well and people are putting in comments and putting in column definitions, paper definitions, and creating the right schema where you have conformed data stack. Somebody is monitoring those datasets. Those are some of the changes that have happened to the self-serve side of the data analytics department.

Boaz: What data volumes are you dealing with?

Apun: I do not have the number. The whole data warehouse, we have got 30 petabytes in size as such. I do not have the number like what's the ingestion every day but you can imagine every message is captured and put somewhere else, which is more secure, but the event that you know, who sends a message to somebody is an event and we want to see how many messages you sent and how many messages that Firebolt sending. So, all of that comes, there are a lot of messages.

Eldad: This guy is counting your messages.

Apun: And your emojis too!!

Boaz: What do not you do something more popular data sets, those aggregated...?

Apun: Of course. So, all the user-related metrics are daily aggregated. The company, account metrics, company metrics like Firebolt, what is at the company level, how many messages, how many weekly active users, and monthly active users. Those are aggregated on a daily basis.

Boaz: Tell us a little bit about the place for those jobs?

Apun: We use Airflow as our orchestration platform. Again, airflow, and then we create ETL jobs over there. So, basically, that will be the data engineering teams, which are creating these ETL jobs and publishing them as confirmed data sets on the self-serve. That is one part of our data infrastructure, which caters to quite set product-related metrics and then there is the other part, which is all of the business-facing data sets, dashboards in which we use Snowflake Matillion as an ETL platform, and Looker and Tableau, as the BI platform. So, that's all, there be like the functional data engineering happens, maybe over 100 different data sources at this point of time and bringing them all in, that does not happen to the second part of the platform.

Boaz: As for the data warehouse with Snowflake is the data engineering team solely in charge of everything that is going on there or is there also a level of self-serve that people can use Snowflake on their own?

Apun: I think when we initially started, it was pretty held in place by the data engineering team, but as the use cases, as a lot of people wanted to ingest Excel spreadsheets to do analytics and stuff, so we tried to create a separate environment in Snowflake which is self-serve, which there is the infrastructure around, you have a Google sheet and you want to load that and do some analytics, and you can do that by providing some metadata in spreadsheets. So we do that as well.

Eldad: What happens when someone edits a message? Do you keep the same ID? Do you run different dbt scripts that update every day?

Apun: I have no idea, Eldad.

Eldad: Is someone concerned now that Twitter is going to release an edit feature on Twitter, is that something the board is discussing?

Apun: I am sure a lot of people have that them on their minds since yesterday.

Boaz: How layman should write the thing in Slack as you already mentioned, Presto, Snowflake, BI tools like Looker and Tableau? How does that go along with testing?

Apun: You are talking about like a new feature that is being introduced. Slack, the product engineering team, they have its own roadmap in terms of what product features it would want to introduce. There is a component, whether the data teams are involved in terms of what kind of metrics usage information we would want to derive from a particular feature. It is a more collective thing. You have folks from sales, like how do you want to sell a product, involved? And, what kind of metrics would need from that product involved in this kind of discussion before something goes into production. There is a metaphase where people are testing, AB testing with few customers in more comfort and once it goes into production, it is more like collective agreement, you fill those data sizes, and then it propagates all the way for analytics as well.

Boaz: Can you give a recent example of the process to support the current feature?

Apun: Of course. I think Slack introduced the feature of Huddle and Slack Connect, not too long, I think a year and a half ago. Those two features were built particularly that way in terms of because Slack Connect is one of those features that you can connect with anybody who is on the Slack platform as such beyond your organization. And, it was one of the very critical features that was released by Slack. So, we wanted to make sure that we can track the usage, and how much time people are spending. How are permissions and approvals and all of that stuff is being taken care of? And, that was a collective discussion between even with sales, like, how do you want to sell this feature? And, how will you track that and the same with Huddle? Huddle is another most popular new feature of Slack in terms of people using it and we can see the usage goes on every day. All those features would have that way.

Boaz: You guys played the full game and the metrics are agreed upon in advance?

Apun: It is not as perfect as it sounds, but there are different definitely a recreation of that, but we do go with requirements what would you need at bare minimum to launch something like what kind of data requirements are? Because we are talking about software engineers and data engineers, software engineers are building this particular product feature. They think differently sometimes in terms of what is critical from like metrics perspective or logging perspective, from what data engineers and business things. So it is a good idea to come together.

Boaz: Is there any collaboration between software engineering and data engineering, can you please tell me that?

Apun: I have no idea like that. I have not been just there for a year and a half. I definitely see that we work together a lot, even from as simple as you have to update the website with certain information, which is critical to marketing. We do work very closely with the web team - How do you want to present this? How will I get that information from the logs? Where will it arrive and how can I push it to the marketing team and sales? We do a lot of collaboration that way, but not like on the product as such.

Boaz: What is like a nightmare situation you were in? What is the worst complaint you got into the data engineering department? This does not work. This is wrong. What is a horror story?

Apun: One horror story for me was just coming from a different company to Slack and figuring out that there are no emails. You do not get any emails, everything is on Slack, and then slowly in the first week, you are added to some 200 channels across seven workspaces and it is like a total nightmare. You do not know where to look forward.

Boaz: It is a horror story.

Apun: It was like very difficult for me for the first month and a half. I think it is still difficult on days, but from a data perspective, I think we also do a lot of bots, like the Slack bot that I mentioned where we publish data to execs on a daily basis of the company performance from both the sales and personal usage metrics and that is like a very critical bot and the other bot, marketing bot, sales bot, but that's like an exec bot is very critical.

Eldad: You are sending that over an email.

Apun: No, all Slack, everything at Slack. That is something that everybody has eyes on at 9:00 AM like it is published at 9:00 AM. And those numbers are very critical. Everybody is looking at it.

Boaz: Specific that.

Apun: Yeah, specifically. That particular thing, has broken at certain times, which is very critical where everybody is from the CTO, the CIO or the CEO, everybody is pinging why do numbers were wrong? Why these numbers are not printing correctly and then. If it is a simple data issue, it is fine. But if it is an infrastructure issue, it becomes a nightmare. So, that was one of them that happened around publishing business metrics. That is something I check everyday morning.

Boaz: How do you guys go about if something like that happens? How do you go about learning from your mistakes?

Apun: We do maintain a lot of documentation, purely from run books and from an on-call perspective, and certain critical pipelines like this, as you mentioned, execs bots have pretty solid documentation and as part of all our stories, we have one story that we created around documentation. Make sure you are updating those books for documentation. So, it has been through like brown bag sessions, especially something like this kind of visible event happens, we make sure that people are on that. We fit with the whole team, both offshore, and onshore teams to educate, focus on what exactly happened and record those sessions, document them, and then also update on books.

Boaz: What are your plans now for the future? So 12 months forward, what do you want to achieve in data?

Apun: Right now, if you look at it, it is a pretty simple data platform from my perspective. You have a data warehouse, you have a BI tool, you have an ETL platform, but then there are a lot of questions around like, what do you have in your database warehouse like the catalog. That is something very critical that we are working on collectively right now. We are evaluating tools to build a more cohesive data catalog with proper data lineage models, specifical things like Tablaue. Then, the second thing is purely around data quality checks. We are also evaluating and we are in certain final stages of evaluation of certain tools like to do automated data catalog and similarly in Monte Carlo and things like that. Those are critical junctures. And the third piece that we are working on collectively as a team is how do you present the data that you have in the data warehouse on the different platforms? How do you present it through API? How do you present this data born programmatically? And in certain use cases, we have requirements to build microservices for marketing use cases, where we have enrichment processes and all of that stuff. So, these are, I think in my mind, the next 12 months are going to be programmatic data and then data catalog and data quality.

Boaz: Can you give data programmatic examples?

Apun: Yeah. Sure. I will give you an example of marketing events, an event like this, then you have all the email addresses of folks that come here and then you want to make sure that you send email campaigns on these, but you do not know everybody here, who is a manager, who is an engineer. So, folks from entire different platforms generate the same. So people will get this data, like an excel spreadsheet or CSB. Right now, Part of the job that our team is to ingest the data, enrich this data with information about the company you come from, what is the role and responsibility, and then check your marketing data, like our marketing database to see if this is already exist and if it exists, did we get any new information and update that? So, this whole pipeline needs to be fairly quick from the ingestion to the time you can see into a targeting dashboard to see who all I need to target for marketing.

Eldad: You had to be around all street when you got from home?

Apun: It happens every day, it is a daily pipeline. But, like trying to build all of these more programmatically. A similar example would be consent management, from a compliance perspective. If I have to target somebody, I want to make sure that I have the latest consent information for that product. So, that is an API-based, like programmatic data access that I want to provide to the business just in marketing, we have about 40 different platforms that are being used for either email marketing, web marketing, and whatnot. How To integrate, how to build one single source of truth for this kind of information? That would be one example.

Boaz: Who is the central data engineering team, you guys serve the market team, sales team has a dedicated team essentially within your data team?

Apun: Yes, we have folks on the team who have a lot of experience working with sales organizations at Slack, at previous organizations and they have a very rich experience with CRM and things like that. So, we asked if they need the sales data engineering, and finance data engineering roles. Similar with marketing. Marketing is a very complex trade, so we have a lot of folks who come with that background, multi-touch attribution models, and things like that. So yes, we do have dedicated folks around, but also people get bored doing the same things, we have to move people around.

Eldad: What about quality? Do you also analyze quality or is that something that is being done only by engineering, like the quality of product or quality of service.?

Apun: my team does not look at that part, but what we look at is customer experience, we do analytics, are people logging, are people complaining about something and like, managing those bad data sets on the dashboard and presenting it into the right team so that the support folks can do. So yeah, that part is not like the product and it comes up eventually consistency. I am sure that is the product team.

Boaz: How do all the different data teams work together and collaborate?

Apun: We do work a lot independently in there. My team needs product data, for a particular matter, I would go talk to the product managers on the product data team, and provide them the details of the information that I will need. Examples would be Huddle. I want to use stacks, Huddles, and metrics. So, I will provide them with that information, and then they usually work in 2-week sprints depending on how big that particular requirement is, they will prioritize it, and then once that data is available we will use the data and publish it. But, engineers like to collaborate with each other, they have a community for the data engineers to talk to each other. But more formal data managers and DBMs.

Boaz: Awesome.

Eldad: Question coming from the crowd.

Boaz: Now it is time for a few questions from the crowd, and you know, you do not have to be asked Apun anything, and personnel matters too.

Audience Speaker 1: Thank you for being here. I wanted to ask you is that when you joined and now you are transformed into functional database development, what were some of the challenges that you faced? And, why did you choose the stack, you had? Was it just because of the past or the prior or was there any kind of vision involved?

Apun: Yeah, for what I understand like two parts to question, like, what are the challenges in what we do? And why did I move from more towards functional data?

Audience Speaker 1: Yeah.

Apun: I think challenges that I would say is I did prepare some slides but the challenges around that we have is just the three V's, the famous three V's is volume, variety, and amount of data sources, you get data CSV, XML, JSON. You get flat-file spreadsheets, it is just crazy, and whatnot. So those are the kind of challenges that our team works on a daily basis just to figure out how to ingest data in a more effective manner and build scalable data models. For me, I came from a data platform side, like doing databases for a long time and big data. So, I was on the side where I had no idea what business does. All I did was data. I build the platform. I serve the data and that's all. And I was always curious about a fact like how business works, like, how do you sell things? What is important to that? So, that is why I decided to move into the functional data engineering space. And I am definitely loving it. The challenges are very similar but different.

Audience Speaker 2: From an organizational perspective, you mentioned software engineering, and data engineering, they are part of the job that is happening for data engineers and also a lot of business logic, like the example mentioned, we want to some other even more complex operations. So, your data engineering team stays away from the business logic or you make sense too because if we move the data engineer to the software engineer team for the knowledge of the business logic, a lot of engineers are not that capable to handle all the data engineer challenge?

Apun: I think business logic in general usually lies in either the platform that you are using in the sense like when you say business logic, if it is business logic, build in a CRM platform, it is part of the platform. But as data engineers, we are always looking at the business logic to generate metrics. A simple example is weekly active users are a metric. It is a very complex metric. It is a simple and very complex metric and you need to get the definition and you need to build that business logic as a data engineer and publish those metrics. So if you are looking at more like Slack more as a product and the business logic in the product, I think it is more software engineering in my mind. Data engineering is more like an input to that process like, what metrics we see or regenerate, maybe the business logic needs to change, somehow. I do not have a very great example for that, but that is where I think the delineation is typical.

Audience Speaker 2: Thank You.