Joe Reis and Matt Housley on the fundamentals of data engine
September 6, 2023
September 6, 2023

Joe Reis and Matt Housley on the fundamentals of data engineering

No items found.

Listen to this article

Listen to this article

After co-writing the best-selling book ‘Fundamentals of Data Engineering’, Joe Reis and Matt Housley joined the bros for some much-needed ranting, priceless data advice, and good laughs. So why are we still talking about providing business value and dashboards, even though we don’t really have anything new to say?If there are so many great tools in the data stack, why are we still so troubled? How can we focus more on things like data governance and data quality that’ll actually push the industry forward?

Listen on Spotify or Apple Podcasts

Benjamin: Hi, and welcome back, everyone, on the Data Engineering Show. Today, we again have two amazing guests, kind of Joe and Matt. You probably know them from the book Fundamentals of Data Engineering, which is really well known. And we again have our awesome co-host, Robert, joining in today, because I know you are super excited about getting to talk to Joe and Matt again.

Joe Reis: I'm just getting off a good start. All right.

Benjamin: So yeah, welcome everyone to the show. Perfect. Rob, take it away.

Robert Harmon: Oh, straight to me, huh? Well, it's great to see you guys. You again, Joe, and it's great to meet you for the first time, Matt. Really, we try and keep things fairly low key in these and try and stay out of trouble if we can, but that doesn't always happen. But I do have some questions, and I figured, I've got you guys here and you really do work the entire industry rather than a single product or a single content, you know.

Matt Housley: Great to meet you.

Joe Reis: Hehehe

Robert Harmon: consulting firm, etc. So you've got a different visibility than many of the people that we get to talk to. The other thing I noticed is you guys both have a very long background in data analytics, but you decided to jump over the dark side with, you know, guys like me and get into data engineering and data architecture.

Joe Reis: Yeah, that's... we can talk about that.

Robert Harmon: So that doesn't sound like a subject you want to talk about.

Joe Reis: No, it's fine.

Robert Harmon: So how did this occur?

Matt Housley: Joe, do you want to go first? Do you want to take this one this time?

Joe Reis: I mean, I think we both jokingly call ourselves recovering data scientists. I mean, I think Matt and I had also, I mean, we could, we could write code before. I mean, we kind of grew up around computers and all that fun stuff. But what we realized was, especially with data science and analytics and so forth and the popularity of data, especially the 2010s, you know, you'd see a lot of companies hiring data scientists and I think forgetting to build the foundation that would help data scientists succeed. And so that's, I think it was a lot of our.

Robert Harmon: Mm-hmm.

Joe Reis: I suppose. So we joined the dark side, right? We found data engineering is actually, I guess it wasn't really called that back then, but it's a sudden necessity. You have to, you know, you could basically not do your job or you could figure out ways to make yourself successful. And I suppose that's at least how I envision it. I don't know. What about you, Matt?

Matt Housley: Yeah, yeah, I think it was a combination of factors. So when I was doing a lot of analytics and data science work a few years ago, we had Teradata and Hadoop both on-prem. And both of those systems at some point started becoming a bottleneck for me. And this sort of coincided with this era when a lot of data processes were moving to the cloud. And so I got to participate on big data POCs in both AWS and GCP and see the possibilities of those technologies in terms of scalability.

So now I didn't have to buy all the nodes that I needed to run a really big job, or I could dynamically scale up capacity to support more analytics instead of having just a static system that I pay for upfront. And so that's kind of what pulled me over to the engineering side of seeing the possibilities of those tools and how those could support analytics and data science. And so...

Like both Joe and I are familiar with on-prem systems and we do trainings around those as needed where companies have a need for those, but I think our real emphasis too is the possibilities of the cloud and scalability, et cetera. And how those really, when we say we're recovering data scientists, it's not out of disdain for data science, it's rather out of the possibilities that data engineering can bring to machine learning, data science and analytics.

Benjamin: Thank you.

Joe Reis: Speak for yourself now, just kidding. Ha ha ha. I'm joking, geez. Sort of.

Matt Housley: I mean, disdain is fun, right? It can be a good brand online.

Robert Harmon: If it helps any recently, I got to use the phrase recovering architect for the first time in 20 years. I'm not the architect of anything. I'm not an integration architect. I'm not a solution architect. I'm not even a data architect anymore. So I'm recovering as well. I make, you know, touch base with you later on notes on how to get through this transition.

Joe Reis: Mmm. What's up? Yeah, yeah. I mean, we could, there's a support group out there. You know, we can all get together and, you know, really, just help each other out here in our, in our recovery. But no, it's recovering architect. Did it feel cathartic when you started calling yourself that?

Matt Housley: Several, yeah. Grind to our beer, yeah. Yeah.

Robert Harmon: Uh, it's new, man. I'm still trying to grow into the idea. So it's, you know, it's like buying a new car. It works kind of like the last car. It's just really, really different. We'll figure it out.

Joe Reis: The recovering title sort of has a new car smell to it too, for a bit. And it wears off. Then you're, then you're sick. Then you're just a car mudgeon. Yeah. So.

Matt Housley: Yeah, yeah.

Robert Harmon: Bye.

Matt Housley: Yeah, give it time. Just like modern anything. Yep, yep.

Robert Harmon: Hey, okay, so I do have some of those tendencies. You got me there already. Maybe that's part of why I'm looking to walk away from that title. Speaking of, you bring up the term curmudgeon and this may touch a little bit on that. We've been through some stuff in the data industry. You and I, we're of similar vintage. We've seen some things. I'd like to think we're kind of coming off that whole big data thing and we're...

you know, strategically shifting as an industry. So, you know, phrases are popping around, things like we need to deliver value, for instance. We've seen this a lot lately. Now, I like to think I've been delivering value for a lot of years in different ways, different methods, because my job as a data architect is to get as close to the customer's experience as possible and truly try and influence, reach out from my little data pit to make that happen.

Where do you guys kind of see this going in the upcoming year to five?

Joe Reis: I don't know, Matt, you're writing a book on data value. You should speak on this.

Matt Housley: Yeah, I like what you say about getting really close to the customer. I think in terms of delivering value, that's one of the main channels is like really connecting with the customer. And I think that's where we've seen a lot of really exciting work in data recently. It's also getting close to supply chain. It's getting close to the sea levels. It's getting close to all the goals of your business, right. And actually try and deliver value in those areas. I think where we saw data science, big data engineering, data engineering go astray over time.

Benjamin: Thank you.

Matt Housley: is the kind of gee whiz aspect of technology and new domains, right? So data science craze a few years ago, everyone wanted to be a data scientist, companies wanted to hire data scientists, but that customer focus was often lacking. So it's just like, oh, what cool thing can I do with Kaggle data? Now can I do that with my data? But wait, what am I actually trying to do for the business or what am I trying to do for the customer? That often got left out of the conversation.

So I hope that now as the market is tightening, as the job market is a bit more tight in the tech industry, we're actually thinking about those questions both for data science and data engineering and trying to do things that the business actually needs. I don't know, what are your thoughts on this, Joe?

Joe Reis: I don't know, I'm tired of the word value, right? And sorry that you're writing a whole book on this, but it's talking to, I think, Malcolm Hawker yesterday and he asked me about business value. And I said, if I keep hearing this or reading this on LinkedIn, I'm probably gonna jump off a bridge pretty soon. It's just, we've been talking about the same stuff for decades now, right, Rob? And it's like, it's the same, I mean, I feel like I'm just like an old folks home where you just need to keep talking about the same old war stories over and over again. And it's...

Robert Harmon: Yeah.

Matt Housley: Good old days. Yeah.

Joe Reis: Yeah, yeah. And I just hope that we can move past this. I mean, my dream actually is that we just stopped talking about data. Right. And that's, I think, when you, when you finally delivered values, when you don't have to acknowledge it, because you're just delivering it. You know, you don't have to like rant about it like, hey, I'm delivering value. Because it's like, if you have to scream that far, you know, scream that loud from the rooftops, and you know, I think that there's an inverse correlation to the amount of data or value that you're probably delivering, in my opinion. You know, I mean, you don't see your, your

Robert Harmon: Right, right.

Joe Reis: accountant saying, oh, I deliver value, right. And they just do books and then your books are done. That's pretty easy. And I, I just hope, you know, over the next few years, I just hope that, you know, data becomes, um, a lot more silent. Right. And I wrote about this in one of my blog posts where I, I think I said it's, um, you know, stop using the word data when you're talking to the business. I think that's, if we can reach that goal, you know, or we said that ideal or data sort of, um, it just happens and we're just delivering whatever value we say we're delivering. Then I think that's a win.

Robert Harmon: Yeah.

Joe Reis: If that happens the next five years, super duper. Now I do feel, um, we are at sort of an interesting moment as an industry, right? There's a lot of attention being paid to things like AI now, you know, AI sort of jumped the shark and I think this is, if we're ever going to get anything right in this industry, now's the time to do it. We don't have this opportunity that often or things like data quality, governance, management, all these enterprisey things are suddenly like, you know, you need to get these things right for AI to work. Um, then my concern is if we can't get this right.

Robert Harmon: Mm-hmm.

Joe Reis: When are we gonna get it right?

Matt Housley: Yeah, and that's the question, right? Can we do something different this time or are we just gonna repeat history again and again as we have with every new data fact?

Joe Reis: We'll see. I mean, I was joking with somebody, I think it's Malcolm's post too. He posted something about like large language models and data governance. And at this point, I'm kind of like, whatever Hail Mary you need to do to make data governance work, go for it. Cause I think at this point, we keep trying to do the same stuff over and over again. And that's the definition of insanity really. The success rate on these kinds of projects is not that great. So I mean, I'm hopeful that we can finally figure things out, but we'll see. We'll see.

Matt Housley: Yeah, I guess we will.

Robert Harmon: You know, you raise an interesting issue there, because when I jumped into the data warehouse world way back in the day, the success rate on data warehouse projects was dismal, 10%, 15% tops. It just didn't happen. And I'm looking at the world today, and you do bring up something interesting, Joe, is are we that much better today?

Joe Reis: Well, 10, 15%, that's 85, 90% failure rate, right? According to whatever criteria, and that's sort of the stats that Gardner always throws out and all the other pundits, you know, so, well, you just need to move the goalpost for what success looks like. So, you're suddenly just killing it. So, I don't know, but you're absolutely right. I mean, but you know, walk me through this over, Robert. I mean, back in the day when you got into the industry,

Robert Harmon: Well that's not hopeful.

Joe Reis: whenever that was data warehousing, you know, what was, what were some of the contributing factors to success and failure back then.

Robert Harmon: I can't speak for the entire industry because I was so busy in my own project that that's all I could see. I put blinders on because it was such a pain. Um, very, you know, we did really well. It was just a really big project. I got through in because I had somewhat of a background in process management. And that married really well with the business that we were, we were managing. So I could take that process management background, marry it with the wonderful things I learned from.

Joe Reis: Yeah.

Joe Reis: Mm.

Robert Harmon: Bill Inmon and company through reading and start to build out structured strong data warehouses that met the business's needs. So when I started really, what I was looking at was the outcomes. You know, everybody talks about all kinds of things like models and structures and streaming and whatnot. And the customer doesn't care. All the customer cares about is that they get what they want on time. With high quality.

Joe Reis: Mm-mm.

Robert Harmon: and that we can do it repeatedly. So that's where I set my sights with that project is, okay, I'm gonna map out everything that touches a customer in this company. And that's all sorts of things, whether that's customer service calls or product delivery or any of these things. So I map out these processes and then I apply those same four basic metrics to it. Did it happen on time? What's the backlog? What's the volume? What's the rate? The very boring mathematic process.

But since it was so business process oriented rather than technology oriented, it gave me a lot of latitude within the warehouse itself to just things like software and hardware, nobody cares, we just need to hit these, you know, these things. So that was the goal way back then. And it continued to be that way, at least in my world for at least another decade.

But then things went a little silly for a while. I like to think we're slowly trying to scratch that back. Because if I can deliver on those things, then the customer's happy. The business has no choice but to succeed. Now it's not gonna happen overnight. They'll improve, the company will improve performance 3% per month forever. But if I can get them on that path, I've got at least a path to success. Does that answer your question, Joe?

Joe Reis: Yeah, it's a good perspective, I think. Yeah, it's interesting. I wrote a blog post a couple of weeks ago too about, you know, we have no shortage of great tools at this point in time. I think we sort of have too many great tools actually. It's almost a paradox, right? But even amidst this, you know, embarrassment of riches, as Matt's always fond of saying, why is it that we're still troubled?

So I wrote that, you know, in a lot of cases, I feel like from a practitioner standpoint, I really feel like we need to do better using the tools we have. This comes through like upskilling, learning best practices and so forth. I feel like that's largely ignored. You know, we focus too much on learning to work with, then quote the business and all this stuff. I think if we can start focusing on that, I think that this is actually one of the biggest issues in the industry right now is really the gap.

between the capability that we have with the tools and our ability to properly execute on using these tools. So.

Matt Housley: Yeah, the tools are almost too easy, right? And as part of the problem, they've almost become toys rather than professional tools that we use toward achieving a goal. It's like, uh, yeah. Like, uh, what are we trying to do with these tools? Sometimes we don't have a good answer to that question. It's just like snowflake is cool or EMR is cool or whatever happened thing. Yeah, well, Firebolt is awesome, but sometimes we don't answer the question of like, what we're actually trying to accomplish. And like you said, Joe fundamentals, like data fundamentals, like

Joe Reis: Professional Tools.

Joe Reis: Firebolt is awesome. Yeah, it's from the podcast. Yeah.

Matt Housley: quality, for example, and how do we ingest data properly? What do data contracts look like? Those are often missing.

Joe Reis: Mm-hmm.

Robert Harmon: That is a subject that I've been thinking about and I've been studying on, and I haven't quite wrapped my head around where exactly they came from, but data contracts. Again, I'm from a previous generation. Quality to me is guaranteed by correct schema. Those were the rules from 1990.

Obviously, that's a little more challenging today, especially when we have these massive cloud data warehouses and constraining a thing on one node when it's happening on another node in this giant warehouse is a mathematics trick that, well, I'll leave for Benjamin in another day. It's hard. So these things aren't available. So I can see how an idea like data contracts would come in. I just haven't quite wrapped my head around it. Any help?

Joe Reis: I mean, software engineers have been doing this forever though, right? With schema registries and stuff. And, you know, and I think that that's, I think if you want to know where data is going, like with contracts and whatever else, just look at where software engineering has been for the past 10, 20 years and just adopt those practices there. It's, you know, so I think it was Andrew Jones. I think he, he claims he was the first person to come up with the idea of data contracts or the term. Um, so we'll trace it back to him. Um, he's publishing a new book unpacked, which I still need to read. I think we.

Robert Harmon: Right.

Joe Reis: actually were editing it for reviewing it for a bit there about, but maybe he did more of it. A really cool guy and then obviously Chad Sanderson, you know, is just sort of taking the idea of data contracts. I think really popularized it. You know, he's built a big community around that whole idea and data quality. But it was interesting, you know, one of my software engineering friends, you know, she was at a conference listening to talk on data contracts and she was laughing the entire time.

She's like, this is sad. Like we've been doing, what's new about this? We've been doing this stuff for forever. Like this is nothing, this is nothing new. And so I think she was just kind of shocked at how far behind the data world really is compared to software engineering. So I thought that was really interesting.

Robert Harmon: Right.

Robert Harmon: Well, and maybe this is another artifact of a previous subject we were discussing. If I'm not reaching out to the customer, if I don't live in the customer's world, how am I going to understand what some dev team is doing with some application?

Joe Reis: Bingo. You won't, right?

Robert Harmon (19:09.711) So here we are.

Joe Reis: Here we are. I mean, the last blog post I had last week, it's a Dev and Data Divide. And I have this picture that I like to show. It's Dev on the left, Data on the right. And it's crap flows downhill, to put it more euphemistically. But that's kind of how it is. It's a one-way street typically right now. Data's on the receiving end of a lot of stuff. And hopefully that changes, especially as we start developing more data products and the feedback loop goes back to.

Robert Harmon: Yeah.

Joe Reis: the dev side now, to me it's an artificial divide. It's a divide that had to happen back in the day. Cause it was typically a data was an IT function, but that's disappearing as data becomes more front and center for everything. So I think that's, that's what's going to change. If you kind of rewind to your first question of where things going to go over the next few years. I think that's definitely one of them. I, it has to this, this artificial divide between dev and data. It's really crippling and it's a, I think it's a stupid. So

Robert Harmon: I agree, Joe. And honestly, for me, it was quite shocking because I spent way too many years at a single organization, well, later than the previous warehouse I spoke of, where I was the data guy for both sides. So, I control the entire BI environment and I control the entire operational environment. So I live with the development staff all day. This is what we do. So there couldn't be a divide.

So I had to play both sides of that game. And when I came out of that environment, out into the real world, it was honestly quite shocking. I hadn't seen this before. I didn't imagine it existed. And I got a lot of lessons in a hurry. But I do think, yes, we need to work more toward that homogenous type environment where data people are embedded everywhere.

They don't have to have a sign on their head that says, I'm the data person, but at least if we've got data people everywhere, then we've got a chance.

Matt Housley: Yeah, just moving toward the assumption that a lot of data is going to be customer facing rather than just appearing in reports and quite often, frankly, stale reports traditionally where you get a report after 24 hours or after 48 hours when maybe there are actions you could have taken and it's too late to take those actions now. I think the idea that your data can show up directly in an application, that the customer can get an idea of what's going on with their account or other places and that's all tied into analytics has really taken off in Silicon Valley in the last...

Robert Harmon: Mm.

Matt Housley: 20 years, but we're still kind of behind in certain areas.

Matt Housley: Oh Joe, you're muted.

Joe Reis: I was just gonna keep talking like that the whole time. You should let it go for a while, it'll become funny. No, I mean, the world's moved beyond reports at this point, BIA reports and stuff. It's just like, if we're still struggling with that, I don't know, I'll go do something else with my time, go become a veterinarian or something, it's more fun. But no, I mean, that's kind of where we are. I mean, we're still talking about dashboards and stuff. I'm like, seriously? Like, this is, so it's interesting.

Matt Housley: There we go.

Robert Harmon: Yeah.

Joe Reis: We can talk about solutions though, right? I mean, I'm good at being a curmudgeon at this point and cranky and, you know, uh, irritated and stuff. And I don't know it's, uh, but you know, solutions are interesting. And, um, you know, I think that's where the conversation needs to go. Cause it's again, it's just the same old tropes you see on LinkedIn all the time, especially where it's like, you know, deliver business value and all this stuff. And you need to have a data strategy in place and all the, all the stuff. And I'm like, dude, like we've been talking about this for, for ages. Like let's, uh,

Robert Harmon: Yeah, and the other half is, consider your audience. This kid just came out of college, it's his first year in a data role and you're gonna tell him he needs to deliver value. How's he gonna do that?

Joe Reis: Hmm. Good point. And not that this data value matter. I know, again, I know you're working on a book and stuff, but it's one of these things where I hope you can nail the topic too, in a way that, you know, pushes the industry forward, the stuff I've seen so far, it's like, it's good, but it's like, yeah, it's a tricky subject to tackle. Right. So.

Matt Housley: It is, and I've seen way too much vague consultant speak that I want to avoid. I mean, I think there are, I think if you've worked in data, you've seen very concrete ways of serving customer needs, for example, and that's what we need to talk more about. We need to talk about things that are frustrating in the customer experience and how people working in data can help with other IT teams to improve those experiences. That's what we're talking about when we're talking about business value, like things that make the customer happier, that make the business happier,

Joe Reis: Yeah, that's my problem with it.

Matt Housley: Yeah, that's value very concretely. And the problem is that on the one hand, it's very concrete for data people. On the other hand, it can be a little bit vague from the accounting side, right? Like what is the value of a customer who is happier because they can see what's going on with their account very quickly? It's a little hard to measure, but if we have a strong customer service focus, then there's definite value attached to that.

Benjamin: So one thing I'm curious about in general, right? Kind of like coming out of this is we're saying, hey, our tools got much better, right? But we still have many of the same problems we used to have. We still keep cycling around using kind of topics and both you Joe and Matt kind of, right? You're teaching a lot, kind of you're doing thought leadership, kind of you're writing blogs, you kind of wrote that super well-known book, you're affiliated with the University of Utah, you're consulting kind of like.

Arguably you could say, okay, if we have all of those amazing tools now and we're still cycling around the same kind of types of problems, right? Maybe we're just not teaching it well enough. So what does that mean for your approach to kind of delivering these things to a professional students, those types of things.

Joe Reis: We failed.

Matt Housley: Hahaha

Robert Harmon: The kid comes out swinging.

Benjamin: Hahaha

Joe Reis: It's a good question.

Matt Housley: I mean, I think part of the problem, and this is not to trash vendors too much. I think vendors build fantastic products. Yeah, yeah, yeah. But, but I mean, if, if I'm in sales for a vendor, I'm not necessarily focused on how, how I use the tool. I just want to get the tool out there and get people using it. Right. And that's where there is more need for people on kind of the meta level to come in and say, all right, you've decided on X, Y, and Z tools, how can we actually use these to help the company?

Joe Reis: Firebolt's awesome, yeah, for example. Um.

Matt Housley: And do that training all along. I mean, I think Joe and I have complained a lot about the lack of training for undergraduates and data specifically. And part of that training as we build it out needs to be, obviously they need to learn data fundamentals like data modeling, fin ops, cost management, but also what it's like to work inside of business and the kinds of things that businesses care about and how they can communicate better. I mean, communications are notoriously difficult to teach, right? Because how do you teach someone out of a textbook?

how to communicate with someone. But we need to keep thinking about these problems and figure out how to give students practical concrete experience with communicating with businesses and stakeholders.

Joe Reis: I completely agree. Yep.

Benjamin: So how do you do that? Because that was also a very abstract answer.

Matt Housley: Fair, fair. I mean, I think from our perspective, a lot of this comes down to building better collaboration between undergrad and master's programs and businesses. You know, it's shockingly often we see that you've sort of got this MBA world that operates almost in a vacuum separate from the business world. And that's not ideal, right? You want... Yeah.

Joe Reis: Well, the academic world operates separate from the business world too. I mean, in some cases that's good. In a lot of cases, I think it's, um, it's pretty bad. It does a disservice to, to students. So that's one thing I'd like to see change. Right. So you talk about concrete stuff. I would also like to see more apprenticeship type programs. I think that the notion of a university being a necessity, I think is absolutely the wrong way to go. Um, so I think more people could be trained on this, uh, from practical things like apprenticeships. Um, you know, I'm, uh, creating a new MOOC.

Robert Harmon (26:33.734) Absolutely.

Joe Reis: class right now, of course, for a really big MOOC. One of the things I'm doing is, it's a simulator. It's your first day on the job as a data engineer. You get to go do business requirement gathering. You get to go find out what stakeholders want, and part of it is identifying, okay, so you're given this list of requirements. What are people actually asking for? So I think that's the other topic. We spent too much time teaching tools and not enough time teaching the techniques, right? So I think those are concrete ways that we could address it.

Um, because, uh, it's easy to do like the, you know, PI spark tutorials and stuff. I think, but that's the wrong way to teach data. I think the way we teach it is absolutely, uh, it's backwards, right? Know the techniques and then learn the tools. That's why we wrote the book the way we did. It's, it's, um, technology agnostic, for example, right. And, um, pretty much every company in the universe is using it for their data teams right now, right. Almost every university that we know it's increasingly, uh, being used as a default textbook for data engineering. So

To me that's part of the process, right? But it's not gonna be an overnight thing. But I think the way we approached our book, it's similar to how Martin Kleppman approached his book. It's agnostic, it stands the test of time, and that's kind of where we need to get to. Yeah, so hopefully I answered your question. We are making an effort. It is slow, especially universities are slow. They're so slow. And that's part of the problem with them.

Robert Harmon: Yeah.

Robert Harmon: Mm-hmm. And, you know, I've experienced this myself and I've had the awesome opportunity to work with some great kids that came straight out of college. They were bright and worked with me for a couple of years and that builds that mentorship type relationship. And then of course I achieved my goal. They get all full of themselves and they quit on me and go somewhere else.

which is absolutely awesome. This is the best day in anyone's life as a data monkey when you send another one off to open his own shop. The problem is, he's not on my team.

Joe Reis: Benjamin, when are you quitting? Just kidding. Oh, OK. Sorry.

Matt Housley: Yeah, yeah. He's announcing it right now.

Robert Harmon: So no, those days do happen. The problem that we run into is how do you do that at scale? And that I haven't figured out yet.

Joe Reis: It's an interesting one. Mentorship is that's the other key component of it. I think is mentorship, right? And, um, you know, that that's, it's a huge, huge thing. So I think all the above really, how do you scale it though? I don't know. Right. It's hard because mentorship is inherently kind of a one-to-one type thing. So it's like, I don't know. And a lot of people don't want to be mentors, for example, like it's work.

Benjamin: Very encouraging.

Robert Harmon: Um, well, from, yeah, it's work. So there's that.

Joe Reis: Yeah, it's an interesting one though, but it is something that Matt and I think a lot about. I mean, you know, we're both educators and I think, um, you know, we did our small contribution to the universe with our book and we're writing new books, but you know, books, books will only take you so far, right? This is very much a practice oriented field. So.

Robert Harmon: right. You do have to be out in the pits doing the job to get it. And that's hard to explain. There's a million things you'll get hit with on any given day as a data guy that are not in the books anywhere. And the other thing that I'm really, you've heard me or you've seen me write it, Joe, data is not a technical game.

Joe Reis: It helps.

Joe Reis: Yep.

Robert Harmon: It's a social club. It really is. I need to know everything that's going on in my team, in other teams. I need to keep that socialization going or I can't achieve the technical solutions. And that I don't think is coming out of college.

Joe Reis: It truly is.

Joe Reis: Nope, not at all. So it'll be interesting, but you know, we were talking yesterday, uh, with Hall and Elson about this was a, the rare opportunity, uh, we're doing a podcast with her and we're opportunity of a three math nerds, um, three professors, uh, three O'Reilly authors in one podcast. So we were talking about the, you know, the, the idea of tenure really. And it's a double edged sword for that. Cause like tenured professors, you know, on one hand it's, it's great. It allows you to the, um, academic freedom and the

Robert Harmon: Wow.

Joe Reis: psychological safety to pursue your work. On the other hand, I think it also provides incentives not to do things in the student's interest. I've seen this happen where some professors and data programs especially won't update their stuff because it's too much work. Or maybe they have other things going on. So these students are learning outdated stuff like this Hadoop and all this other crap. I'm just like, why are you teaching this? This is nothing to do with reality at this point. So, but that's what it is.

Matt Housley: Yeah. And it's tough to stay current and you know, it's tough to find people from, from business who want to teach as well. I mean, they, if you have a job in data, you're very busy and you're probably well compensated for your time. And so teaching is almost a charity exercise. Maybe isn't so feeling. Yeah. You know about this.

Joe Reis: especially adjuncting dear god that's i mean it's like the worst job in the universe it's the best and worst job you've done it before

Robert Harmon: I have not. Yeah. Though I have been involved in a number of K-12 tech institutions and it's the same game.

Joe Reis: Yep. Yeah. I will say the best talk I gave this year though, is at my sixth grade, or my kids sixth grade class, we talked about AI, and that was really fun. You know, so I think like, you know, teaching younger kids is almost easier than teaching college in some ways, because they're just more fun to talk to, for one. But yeah, I think, you know, but it's, it's an interesting one. I think there's a lot of anti-patterns established at this point though, and how you could probably improve on with respect to teaching.

But to me, this is the crux in the industry right now. Like again, we have all the tools in the universe. That's not the consideration at this point. You can solve any problem you want to basically, but it's like, you don't even know what problem to solve because you don't know how to think through problem solving. That's a fundamentally different thing. So.

Robert Harmon: Well, and I do believe that part of that is just the youth of our industry. Now, sure, we've been collecting data on stone tablets since Mesopotamia, but not like this. This is a new

Joe Reis: That's when you got started in data warehousing, right? Just kidding. Um, okay. Okay.

Robert Harmon: I may have been there when stone tablets were invented. But not like this. This is a new industry. It's not like I'm dealing with architecture or finance or manufacturing where they've got ideas on how to do their job. We're still working out the kinks on this thing. So, you know, I think some of that is to be expected.

Joe Reis: Yeah.

Joe Reis: We talked to Bill Inmon too, you know, and I always ask him like, geez, Bill, what was it like back in the day? Um, he's like, oh, it's, we're, we're a very immature industry. I say, I remember, I remember calling him and kind of just. Anx one day. I was just like, why? I'm kind of tired of this industry, Bill, you know, like why, why is it that we keep repeating ourselves over and over? He's like, well, Joe, we're very immature as an industry. We haven't been around that long. I'm like, okay. Like.

Robert Harmon: Well, and I think intrinsically, we as data professionals have extremely short memories. Because if we could remember anything, we wouldn't run systems to memorize stuff for us. That's.

Joe Reis: But easy.

Benjamin: Thank you.

Matt Housley: Maybe it's the same thing about software engineers basically being working very hard to be very lazy or what they say about mathematicians, right? It's like you do all this work to write code so you don't have to do the work day to day.

Joe Reis: Yeah. But I don't know. It, you know, when you talk to people like Bill though, who's been around, I mean, he, in my opinion, he is the industry. He kind of helped, you know, he's the godfather of the data industry. And so, you know, um, but he'd been programming since 1960, I think. So that's, that's a long time. That's basically stone tablets at that point or a punch cards are basically the same thing. So, you know, but it's, it's. So he convinced me to stick around.

Robert Harmon: Yeah.

Joe Reis: So I was I was really gonna leave. I was like, I'm tired of this. This is just the dumbest industry I've ever seen. Who knows, I still might, but it's just, you know, I think it depends. And like, if, if we can like make tangible efforts to move the industry forward, you know, I think that's a good thing. But if we're still here talking about the same crap in like five years, like I'm gonna go find something else to do, I don't have time for this, like, you know what I'm saying? So

Robert Harmon: I actually did try and exit the data warehouse space after my second data warehouse, I got a new job. They had a whole lot of, they had a whole lot of operational database issues. I figured I could go work on that for a while. So when they, when they hired me, one of my contingencies was I'm not touching a data warehouse as long as I'm here. And they agreed. And then two years later, I rebuilt the data warehouse. So.

Joe Reis: It was that bad, huh?

Matt Housley: And were you coerced to or you just like, no, I've got to fix this mess.

Robert Harmon: No, I can't deal with this chaos anymore. I have to fix this. Yeah, it was pretty much it. Yeah, I tried to quit. I couldn't get out. So here I am. It's just can't finish. So yeah, we've talked about a lot of somewhat curmudgeonly topics and woe was me in the industry, but the upside here is we're all still here and we're all still fighting.

Matt Housley: So it's your own damn fault basically. Okay.

Joe Reis: That's pretty funny actually.

Matt Housley: So it's like smoking, essentially.

Joe Reis: Yeah, he's recovering now.

Matt Housley: quits thousands of times.

Joe Reis: Hahaha.

Robert Harmon (36:22.08) Um, I'm not sure why, but here we are. I do want to kind of move to some lighter subjects. If you don't mind. I know Joe and Matt, you guys are everywhere lately. What's coming up next.

Joe Reis: Yeah, that's fine.

Joe Reis: What about you Matt?

Matt Housley: Let's see, I have a couple of things upcoming over the next couple of months. So I'll be on a couple of panels at Big Data London in September. And then there's a conference in Budapest at the beginning of October called CrunchConf. And so I'll be speaking there as well. So I can, if you guys want to put that in the show notes, I can put a couple of links out there. And I'll be at your barbecue on Friday as well. That's right.

Robert Harmon: outstanding.

Joe Reis: It'll be at my barbecue on Friday. So yeah, that's like the highlight of the year. Um, yeah. And then it gets you right in a book too. Right. So that's, uh, ongoing, but, um, yeah, for me, what I'm starting a world tour this Saturday, actually, small to Australia, and then, um, I do the dbt and Joe Reese road show too. So they, uh, dbt and I have a.

Kind of a traveling circus goes to different cities actually would be in your neck of the woods in Seattle. And September. So you better Yeah, yeah, I'd be like that. Where's Robert? I can't have a party about Robert

Robert Harmon: Well, then I'll have to stop by. I have no choice, right?

I could be totally, yeah, I could be unsociable. I mean, that's in my nature, right? Everybody knows I'm shy.

Joe Reis: Yeah, you'll just send your kid. His kids keep stealing his fun rails of data engineering book. So it's almost a constant prank.

Robert Harmon: Oh, that was no, it wasn't just him. So there is a story here. I wasn't going to bring it up, but since you did Joe, before, before we ever talked to each other, I ordered a copy of the book and then, uh, my sister-in-law and nephew came to hang out for a week over Christmas. They leave. I go looking for the book. It's gone.

Matt Housley: Here we go.

Joe Reis: That's a funny story.

Robert Harmon: So, and then a month later, they come back to visit, book reappears for a day, and then it's gone. Cause the son now stole it. And I finally, I mentioned this to Joe, and he was so kind, he sent me an autograph copy and gave the kids instructions not to steal it. It says so right here. So I finally got through the book the hard way.

Joe Reis: No! That's awesome. That made my day though. And it made my day too that your kid I think was interested in data engineering too. Like that was one of the coolest things I'd heard.

Robert Harmon: Yeah, you kind of fired. I can't explain it. He's seen me go through an entire career of pain and misery. And then he's decided he wants to do the same. I don't suggest it. But what am I going to do? He's soon an adult. I can't stop him. So in a country where I'm not going to be able to do anything

Matt Housley: Yeah, tell us about that. Okay.

Joe Reis: Yeah. It's like that old 80s anti drug, you know, drug ad, the parents use drugs have children that use drugs. That was my childhood.

Robert Harmon: Oh, yeah. Yeah, I learned it from you, dad. I learned it from you. And you know, he's, there's some, there's some really attractive things about this industry. So I'm not going to tell him no. But I am going to be there, you know, the day when he falls into the industry trap, the consulting trap until I told you so. That's a guarantee. So he's off to school here in a couple weeks. He goes off to

Matt Housley: say no. Yeah.

Robert Harmon: be on his own in a computer science program, and we'll deal with the rest later.

Joe Reis: That's cool. That's pretty awesome. Yeah. So I think, uh, you know, the book is Impacted Kids. It's, it's awesome. Um, you know, uh, other stuff I'm working on, I got a, um, course for Riley coming out, it's a mini course, it's just basically the greatest hits of the book. It's like a, not that long of a course. It's, um, and then, uh, I can't announce it yet, but it'll be a big announcement in November for something I'm working on, so, uh, this is a ways away. But, uh,

can wait. So then I got a new book too, on data modeling that it should have been done by now, but I think a few things, one courses, two speaking, and then three just large language models and chat GPT kind of threw me for a loop with respect to data modeling. So I was like, okay, I don't know what this does. Does it do anything? Is it a nothing burger? But you know, it'd be, it'd be kind of weird to like write a book and not acknowledge that. So you have to really think through the consequences. So I spent the last

I think I'm finally coming to a resolution on it, but it was one of these events that I think it was too big to ignore. You know, so just, yeah.

Robert Harmon: Mm-hmm. So even I started playing with chat GPT and data modeling because I've written on this before. I'm a big nerd when it comes to business rules. Business rules drive data models. I write them almost compilable, kind of like Chris Date taught us back in the day. What I've found over the years though is that people write them poorly, so they're not exact. And what I found with...

LLMs is I can take my business rules, feed them to the LLM and say, generate me a schema and I can see immediately where I screwed up on my business rules. And then I can go back, tune up my business rules. And once I've got that, I can support a strong data dictionary, master data management, all that fun stuff. But the reality is chat GPT is an idiot. And if I can convince an idiot to get it right, then I got my business rules right.

Robert Harmon: So that's my little foray into that world.

Joe Reis: That's pretty cool. It's really cool. I've been spending the week nerding out on a vector databases too. I think that's a really fascinating technology. Because I think, again, with data modeling, I feel it's one of these topics where you mentioned Chris Date and relational model and stuff. And as an industry, we're still talking about things like relational models and Kimball, all this stuff. And I'm like, why are you?

People are just arguing about star schemas. And it's like, they came out in like the nineties, like let's move on. Like use it or not use it. I don't really care at this point. Use one big table. I don't really care, you know, but what's happened in the meantime, right? You got streaming, NoSQL, machine learning, all these new things. And you know, the modeling practices around that still need to congeal. And so that's, you know,

Robert Harmon: Yeah.

I, for the life of me, I'm not figured out how to pull off integration and time variance with a streaming source. This seems like a Doctor Who moment. You're going to have to fold time. Not that anyone cares just yet. But

Joe Reis: Well, walk us through, walk the audience through what you're thinking, because I know you and I have talked about this, I think, but you know, it's an interesting topic.

Robert Harmon: It's just a lot of workload to pull that off. Integration itself, any consulting firms run from integration projects for good reason. They're hard. They're just hard and they're work. You can't buy a product to make this happen. So if I've got like a salesperson table in one application, I've got an employee table in another application.

And really, this individual in both applications is the same individual in a subject-mat-oriented world. I need to put him together before I can make sense of all the data that applies to him, right? Now, try this in a streaming environment. I dare you. It's hard enough in a batch environment, because if there's an interference, at least in a batch environment, I've got an hour to fix this before the next batch comes through. In a streaming environment, I'm lost.

And then you start thinking about things like time variance, where we're looking at temporalities of that employee. I want to see what that employee state was over time and quickly grab the activity data over time for that employee. So he was in Montana, now he's in Wyoming. Did his sales numbers change? This is a tough question to start with because now we've got all this temporality to pull together.

Joe Reis: potentially.

Robert Harmon: Okay, so now I've got to integrate the employee, pull all the temporality off and do that while that Kafka queue is busy chewing my butt. I don't know how to do this yet. It's a problem I think the industry will have to solve. Probably not today.

Joe Reis: We're talking to the guys from SGR yesterday about streaming a kind of a deep dive.

Matt Housley: Yeah, you might check out, yeah, they had an interesting idea about this. So you might check out our Monday morning data chat from yesterday, actually. Yeah. So for the audience listening after the fact, that was a August 14th episode with Estuary and the idea there was to, you keep all the incoming data and sort of a raw JSON schema, and then when you have problems with your schema, for example, you actually pause the kind of more refined side of your data.

Robert Harmon: can do.

Joe Reis: Yeah.

Matt Housley: So you have an incoming pipeline, you're ingesting, collecting. You have a transformation pipeline. You basically pause the transformation pipeline and say, hey, what's going on here? You set off an alarm, you have engineers check the schema changes and say, okay, here's what happened. They can make a schema change and then replay from the point where the failure happened. Ideally it happens within like a day or something like this. And so that way as schema and data are changing, you do kind of have pauses on the output side.

The real time side, you want to focus more on the raw data basically, but you're able to go back in time using replay capabilities to fix issues as they arise fairly quickly. And the idea is to make this as agile as possible. So you don't accumulate several problems before you fix them. You're very operational with your fixes.

Robert Harmon: Right.

Joe Reis: So really too, I mean, if you take about a bounded and unbounded time as well, right? I mean, that should solve a temporality problem. If you're keeping an append-only log, for example, right? It's just, I mean, inherently it should be there unless you have late arriving data, which case, you know, you said need to know like when did the event happen and when did he get it, right? So by temporal, tritemporal would be like, you know, what did I do with it? But that's a different subject. But yeah, it's, in theory, it should be easier with streaming. But

You know, in practice, um, I don't know. We'll see.

Robert Harmon: That's not where I'm not seeing it as easier yet. I'm sure if we mature a little bit again as an industry, we'll figure it out. But right now that one's, yeah, it absolutely does.

Joe Reis: It'll have to happen though. It has to happen. I mean, Matt and I predicted this in our book, the, uh, we call it the live data stack and it's really a feedback loop between applications and events and, um, analytics and data and machine learning. That feedback loop shrinks. Right. And this actually yesterday, we're talking about how this might actually bridge the, uh, dev and data divide too, because if you can shorten the time, uh, between working with devs, right. Uh, the feedback loop is there in which case you do, you have to collaborate.

Robert Harmon: Yeah, yeah, right, right. And exactly. And one of the things that may help there is isolation of some of these ideas. Historically, we had operational data stores as part of our BI environment, where we can grab operational data from operational applications, bring it into that operational data store to co-locate it, and sometimes apply temporality there, but we wouldn't do integration in that place.

Joe Reis: It's a forcing function. So.

Robert Harmon: And that can really accelerate that communication between Dev and the BI space because this becomes our DMZ. What you guys do over there, you do over there. What we do over here, we do over here. We're going to discuss what's going on in this operational data store as a negotiation. And that helped a lot. The problem is making that jump from an operational data store to a real data warehouse. That's hard to do in streaming. I don't know quite how to do that yet. I'm working on it.

Joe Reis: Yeah, I think Druid does a pretty good job at this, but it kind of changes, I think, the nature of what, you know, we might consider data warehousing as well. It's still analytics, but maybe it doesn't, you know, in some ways it fits Bill's original definition, in other ways, you're probably going to have to morph it a bit too, just because of outer necessity.

Robert Harmon: Yeah.

Right, right. Or break it out into smaller structures that, like I said, make more sense or leverage the physical stuff underneath. But I think there's opportunity there as opposed to our previous conversations where we were all grumpy. I think there's definitely, yeah. Well, that was kind of random.

Matt Housley: We think we can actually make progress here.

Great. I like random conversations.

Joe Reis: But no, it's, it's fun. Um, yeah, that's, that's all of Matt's and Matt and his conversations are just, uh, super random, but, uh, happens. I don't know. When we get on podcasts, for example, right? We don't ever have scripts. And I know, and if people want scripts, I usually, I usually tell them I don't want a script, like do not have a script. I don't know, cause it's a, yeah.

But this is what you get. You get, you get the kind of our meandering, uh, you know, um, you know, crank cranky guy fast here. And, um, you know, just except for Benjamin, he's, uh, smiles too much. Um, just kidding.

Benjamin: Sorry, maybe in 10, 20 years, I'll smile less. Yeah. Perfect. So yeah, I think that's an awesome conclusion to the episode. Just be more bitter. Joe, Matt, I'll work on that. Thank you so much for joining today. Yeah, it was great having you on the show. See you around.

Read all the posts

Intrigued? Want to read some more?