Listen to this article
In this episode of The Data Engineering Show, Benjamin sits down with Artie CTO and co-founder Robin Tang, to explore the complexities of high-performance data movement. Robin shares his journey from building Maxwell at Zendesk to scaling data systems at Open Door, highlighting the gap between business-oriented SaaS connectors and the rigorous demands of production database replication.Robin dives deep into Artie’s architecture, explaining how they leverage a split-plane model (Control Plane and Data Plane) to provide a "Bring Your Own Cloud" (BYOC) experience that engineering teams actually trust. You’ll hear about the technical nuances of CDC, from handling Postgres TOAST columns to the "economy of scale" challenges of processing billions of rows for Substack, Artie’s first customer.. Whether you're struggling with real-time ingestion costs or curious about the future of platform-agnostic partitioning, this conversation provides a masterclass in modern data movement.
Listen on Spotify or Apple Podcasts
[Intro:] Let me just build this myself. Cause like I've touched CDC, this shouldn't be that bad.
The Data Engineering Show is brought to you by Firebolt, the cloud data warehouse for AI apps and low latency analytics. Get your free credits and start your trial at firebolt.io.
Benjamin (00:02.548)
All right. Hi, everyone, and welcome back to the Data Engineering Show. It's great to have you.Today I'm super excited because we actually have like a kind of data platform vendor on again, which we haven't had in a while. And I always love these technical conversations. So Robin is the CTO and co-founder of Artie. And Artie is a data movement platform, which Robin can give us all of the details on now. So welcome to the show, Robin. It's great to have you. Do you want to quickly introduce yourself and tell us what Artie is doing?
Robin (00:32.034)
Yeah, and thanks for having me. So in a nutshell, Artie helps companies make data streaming accessible. I started a company primarily around the fact that like, I don't know, I wanted data in my warehouse really, really fast and I didn't want to make any sort of compromises and it just turned out to be a really hard problem. So then we started accompanying around this.
Benjamin (00:52.126)
Nice. And how did you actually come up with that being an important problem to solve? Because to me or to someone listening to it the first time, there's mature things like 5Tran, performance-focused vendors like Sjeri, who we all sat on the show before. So why do I need another data movement vendor, basically?
Robin (01:12.876)
Yeah, so maybe I can just quickly start with my career so far. Like most of my career I've spent working on large scale data systems. About 10 years ago, I was at another like small YC startup that was moving a bunch of events over to do like events automation. At that point, I was primarily building out the database architectures to be able to ingest billions of rows and be able to do that low latency processing. That company then got acquired by Zendesk.
Benjamin (01:16.598)
Sure.
Robin (01:41.43)
I joined Zendesk to work on Maxwell. And the reason why Maxwell, I bring up Maxwell is relevant is because Maxwell is an open source CDC framework for my SQL to read binlog into Kafka. There's two really popular open source frameworks that I'm sure you guys know. The most famous one is Devesium, right? Where it basically handles any sort of database into some sort of Kafka. I was using Maxwell, at the time I was using it for more product integrations so that Zendesk specific portfolio products would talk to each other about it.
Benjamin (02:02.794)
Right.
Robin (02:12.342)
After that, I joined the open door to run growth, stopped working on any thing CDC related and started being like a downstream consumer of Snowflake. And I was really missing the low latency data that, you know, I had when I was like querying from source DBs. I would ask my data team at the time as well, even at Zendesk, I'll be like, Hey, give me this data a little bit faster in BigQuery or like Databricks, depending on the company I was working at at the time. And they would always tell me some version of like, this is too hard.
We need more engineering resources, unless this is a company P zero, like we can't do it. Think like some derivative of that. And then I opened door. I tried to buy a tool. Then I discovered the tools, like the managed vendors out there, like the five trends in the airbites. I found that they were really good for SAS data sources, like Zendesk, HubSpot, Salesforce. This has been like CRM data more or less. And the thing is like,
That makes sense because your ICP there is like more business oriented users or marketing or like growth marketing oriented users. Whereas like production databases has a completely different type of volume and nuance to it. And you know, when you're talking about the upper echelon of like volume, you're primarily selling to the in front platform engineers who have a completely different job to be done where like they care a lot more about failure recovery modes, telemetry, observability of the platform, advanced use cases that they can enable, like, I don't know, doing like
partitioning at the column level or schema exclusion, things like that. Couldn't basically couldn't find a good tool out there. I tried a bunch of different tools and I was like, okay, let me just build this myself. Cause like I've touched CDC, this shouldn't be that bad. And then I had a team of seven engineers tried to build this for a year. And by the end of it, it wasn't production ready. So then I was like, okay, like this makes no sense at this point.
Like do we spend, does every company that hits a certain volume scale have to just spend a year or two building like a post-recess snowflake connector? Like that just doesn't make sense. So then I started asking around and then eventually like that line of questioning got me to start already.
Benjamin (04:11.424)
Okay, nice. That's an awesome story. I think what you're touching upon, Ryan, is like a thing we've also observed in Firebolt, which is like...
If you think about real time ingestion into system, there's almost these two kind of very different types of use cases that historically were all done by one type of company, which could be the five trends of the world, which are totally different though, right? Like, okay, if I connect my JIRA, internal JIRA or linear or whatever other project management tool I use into my data warehouse.
fine, like what's it going to be like a kind of like couple of thousand tickets, tens of thousand tickets, like maybe it's fine for it to be a bit stale, maybe there's a five updates every minute. It's like these are not things done at volume, which makes it much easier to build these connectors. Building a tool like 5Trend is still hard because all of a sudden, right, you need to add like 2000 source connectors and you're basically like in this like very large integration project to add support to a bunch of different tools.
The reality though is that for teams building, well, kind of products or data platforms or with production data, that's not a very important problem to solve, right? I don't care about my linear kind of issue history to be in my production analytical database that powers the business or anything like that. And that's why I think it's been interesting to see like this.
fork in the road of basically like best in breed vendors popping up that really focus on.
Benjamin (05:42.892)
focusing on few sources, few destinations, but getting those right. If you actually zoom in the future, like do you think these things will converge again? Like do you envision a future in which then ARTIE adds 2000 sources, including like linear, Zendesk or whatever? Or do you actually think like this is a permanent fork and it's just gonna be okay, data engineering teams working with tools like Fivetran and then data platform teams or the engineering teams working with tools like ARTIE.
Robin (06:11.948)
Yeah, and you bring up an interesting point and this is also like something I failed to mention. Thank you for bringing it up. Like what I found was basically like the managed vendors out there, they're really good for BFS, Hundreds, thousands of connectors out there. And like, that's great. But then the key insight I had was like doing DFS for data sources that actually matter. And like, that's why I started with databases. And to answer your question, it's unclear to me, like,
In my opinion, I think this is more of a go to market problem, less of a technical problem. Like sure, we can build just as many connectors as possible, but in reality, like linear, Zenda, Salesforce, they're also trying to build their own connectors, right? Like they also have enterprise customers who don't want to use their analytical dashboard. They'd just rather dump this into a snowflake and then be able to put either like an LLM on top of it or like give their data scientists or like business analysts to be able to just query that directly and join it with their core KPI dashboards, right?
So like, to me it's it's unclear. Like, I feel like we basically, the answer might be twofold where like maybe we need to be able to implement some sort of open API spec to be able to ingest this type of data more freely. But on the other side, I would also imagine just working more closely with these companies so that we can build a better integration rather than using the publicly available APIs, which may not, you know, actually be able to capture every single edge case.
Benjamin (07:36.82)
When you talk about like an open API is like, isn't it kind of what the open source vendors in that space do already is like, I think of airbite, for example, I think there's a few others like that is effectively your open API for data movement, right?
Robin (07:45.528)
Yeah.
Robin (07:52.066)
Yeah, I think so. think so. My biggest thing is more like, we're not so much focused on this particular problem right now. And I'm not really sure how the future will shake out. I think what you're saying, like if it's just an open API spec, basically at this point, code generation tools are really good out there already. There's plenty of LLM-based tools that can basically just take an open API spec and generate client libraries.
And then it's just one step away from mapping that to your like basically your data model. And then you can now sync it pretty easily. So I don't think this is particularly a hard problem. However, when I think about the problems that like, for instance, like George Raser was talking about around like Stripe and like Salesforce, we're like, there's a lot more nuance there given the fact that the APIs available are actually not perfect. Like for instance, like the one that he mentioned for Stripe where like he had to build his own pagination or like
Benjamin (08:25.386)
Right.
Robin (08:51.032)
you have to reverse engineer APEX to be able to seek data from Salesforce. Or like to be able to regenerate some of the formulas as an example. Like I feel like that type of stuff is still unsolved. Maybe the open APIs, things are a lot easier, but at the same time, if it's so easy, why do you, I don't really see a point in having a vendor at the same time because it's so easy. yeah.
Benjamin (09:13.12)
Right. Yep. Okay. Makes sense. So zooming a bit back in on Artie and kind of the high performance part of data movement, because well, that's the part I also care most about them. most interested in, of course, is like, take me through a reference, like the perfect use case for Artie basically is like, how do I deploy it? Is it managed service? Is it more like bring your own cloud? How, what data volumes does it handle? How fresh is data? Yeah. Would be very interested in all of that.
Robin (09:33.964)
Thank you.
Robin (09:42.168)
Yeah, so maybe I can just walk through a cookie cutter example. So the one that I really started to build first was the happy path that I was trying to solve for at Open Door, which is Postgresive Snowflake, right? And from the get-go, like my whole thing is like, I wanted the same ease of use as Fivetran because they nailed it. It's super intuitive and like onboarding flow should just be as simple as possible. Like I shouldn't have to click on any unnecessary clicks that deviates from the
from the happy path should be avoided. And we have done that as well for onboarding flow. So it's really three steps to set up. Like you first, you pick your source Postgres as an example, you put in all your connection details. We'll validate to make sure for instance, like you have the right level, right ahead log. We have the right permission for replication. You have publication enabled, things like that. We explain to the customer by the end of the day, like we provide a service account script so that customers basically can just copy and paste.
And then comments are in line with every single step so that they know exactly what's going on. Once you figure out your source details, then we have a second tab, which is the tables tab. Pick all the tables that you care about. There's advanced settings such as you know, column level exclusion or inclusion or column hashing. You might want, you might, this table might not have a primary key. So then you might want to use a unique index instead, things like that. Once that's all done, you pick the destination. And then again, similar to your source,
fill out all the destination required details. We have service account strip on the right, fill that out. You hit deploy and then everything just works out of the box. And the key insight here really is like, I used to be a five-time user too. I loved the onboarding flow. So I wanted to bring as much of that through as possible. Then the second thing is like why we're different is our orchestration logic is really what happens underneath the hood. That's magical. So like, for instance, when we do backfills, we follow the DB log framework, which does online backfills.
So instead of doing like a typical, what I call theoretically correct and practically wrong is what some tools out there do, like the BZman as an example, where it starts a read transaction, then strains, then it does a bunch of a, then it snapshots the table and once it's done, it then commits a transaction and reads the write a head log. And like that's fine if you have a test database, but like if you have a production database and then your back takes an hour, a hundred hours, like your write a head log is just gonna blow. As well as like we add.
Robin (12:06.752)
What would also be bad is like if you have one pipe, one table that has an error and that also needs to stop the whole entire pipeline. So then we have parallelism at the table level. Cause like things like this, that's I found it was really important. Another thing that we also have is like typical SAS when you talk about BYOC like VP like on-prem type of deployments are typically like bolted on. They're not like a first-class citizen. So the UX feels wonky as well. For us, it was built from day one.
We have a very similar architecture as astronomers, split plane architecture, control plane, data plane. And then the pipelines live at the data plane level. So then a company, as an example, that wants their own BYOC, they can have as many data planes as possible. And when they create a pipeline, they can specify the data plane that they want to be in.
Benjamin (12:48.98)
Right. Okay. And then, so you basically don't have it's always that deployment model. It's not like some customers use a managed version of Artie. It's always that split control plane, data plane, bring your own cloud deployment basically.
Robin (13:03.339)
It's like BYOC, but customers that don't care about BYOC, as an example, we have a cloud data plane that they can choose to use. know at the pipeline creation level, they can just choose one of our cloud data planes.
Benjamin (13:09.792)
Gotcha. Okay.
Benjamin (13:16.076)
All right. That's pretty neat. So then at that point, you basically just take your data plane that you built for the VYOC deployment and deploy that internally to like make it look or turn it into a fully managed service, basically. Neat. All right. That's nice trick I actually hadn't seen before. Very cool. So what happens under the hood? Right. I know that some vendors like S-Jerry, for example, make heavy use of object storage under the hood for resiliency and performance and cost efficiency.
Robin (13:29.08)
That's it. That's it. Yeah.
Benjamin (13:45.642)
take us through the actual engineering architecture of what makes ArtieFast.
Robin (13:50.306)
Yeah. So I wanted to use, so the philosophy that I had was like, basically I wanted to use the best in breed class of tooling out there. And as well as I wanted to make everything as pluggable as possible. So what that means specifically is how we deploy a data plane is a data plane has is a helm chart. And then inside of each customer, if it's a gen pop data plane as an example, then multiple customers live inside each customer is a sub chart inside of the helm chart. And we deploy just using a helm.
like upgrade type command so that if a customer wants to own their own CIT CD process, they can go ahead like also all the data is encrypted and we sort into Git repo. So then all we need is that just if they want to host everything, they give us their public key. We encrypt it with the public key. They can decrypt it. Then they all, they just need to run help and upgrade themselves. That's pretty standard inside of a data plane really, instead of using an object store, we use Kafka. So a data plan really comprises of two things.
Kubernetes for compute and Kafka. And typically most of the time we run Kafka inside the Kubernetes using string C. So it's really just Kubernetes I did end of the day. And what makes it fast is really like we use, we follow a very similar PubSub model. We have a publisher, all that it does is read from the database journal. It doesn't do any sort of fancy business logic. All it does read and immediately dump it to Kafka. Once that's there, we then have multiple consumers.
We have a consumer per table that reads these messages and then writes it to the downstream destination.
Benjamin (15:22.699)
Interesting.
And if you look at Kafka itself, that's actually, there's been so much innovation around Kafka as well recently because there you have that whole mindset emerging of, can make Kafka radically cheaper if I relax some of the kind of latency constraints I have, right? And I think WarpStream, which is now part of Confluent, is a perfect example for that. Basically saying, hey, let's start moving things to object storage and maybe 90 % of Kafka use cases out there actually don't need like ones or tens
of milliseconds freshness, maybe they're fine with like 200 milliseconds freshness. Do you see something similar in this real-time data movement world where you think over time there's basically going to be trades-off between costs, real-time data freshness, all of these things? It's like take me zooming out like a year or two from now, how do you think that will evolve and do you think engineering teams will actually start getting a lot of flexibility around that?
Robin (16:11.566)
Mm-hmm.
Robin (16:23.81)
I think there is, but I think it's more of a chicken and egg problem and the fitting that needs to be solved first is tooling. All this is accessible today, right? If you wanted to hack StreamZ to use an EFS backend, it's not recommended, but you can, and that's more or less your budget warp stream, right? You can't do all of this before. The problem is it's just really hard. So unless you have specialized industry knowledge about how to do this, you won't be able to do it.
Another thing is that Kafka MSK also has tier storage, right? That you can enable that immediately offloads everything to S3. So this is all possible. It's just really, really hard and not locked behind knowledge silos. Like give you an example. We help a lot of customers. When it comes to BigQuery, the first thing we tell them is like, change your pricing model. Because by default you have to on demand pricing and immediately every single customer that's not sophisticated immediately just gets charged an insane amount because Google is really good at making a query.
that's 10 terabytes versus one gigabyte run basically at the same speed because of like super parallelism. And it's like, most people don't need that. like being able to educate people, I think is the number one thing. Like I do think in the future, real time will be real. Basically, I think the biggest bottleneck for real time right now is accessibility. When people think about real time, there's two things that come up. One, they think they immediately think it's not worth it because they implicitly have a cost associated with it. And then number two,
Benjamin (17:23.872)
Good.
Robin (17:49.89)
When they think about real time, they typically think about building first, not buying first. So once we change those two behaviors, then the consumer behavior that results out of this then becomes like, what else can I build on top? And instead of fiddling around with low level settings, they can now play around with like, hey, do you want, we build a concept called Eco Mode. During business hours, you want to run at one flush rule. Outside of business hours, run another flush rule. So during business hours, do one minute syncs.
outside of business hours, maybe do every six hours as an example. Like, there'll just be more and more of these things. Another one is like, immediately just offering. One thing that we also have is like, help customers give recommendation to them to be like, hey, your BigQuery table needs to be partitioned because this is heavily not optimized and OLAP DBs for merges, unless there's like a merge condition, it's going to do table scans. And if you're on the big default pricing model, then you're paying for byte scan. So it's not optimized.
Benjamin (18:19.019)
Right.
Robin (18:46.254)
It just requires more and more of these things to happen. And naturally, I think the premise of real time will eventually happen.
Benjamin (18:55.466)
Right, that also kind of means that what...
Benjamin (19:04.256)
Sorry, I'm like, kind of what you're saying is like that your sync pipeline or the way you're ultimately like producer behaves is very tailored to the underlying kind of target data warehouse as well. And.
Robin (19:18.84)
Yes.
Benjamin (19:20.138)
That makes it very, very difficult. So we, for example, like recently into like built a merge statement for Firebolt and that's deeply integrated with our sort keys in the object storage layer, basically, which means that if you do a merge on a small amount of values, we can actually use our sort keys on the base table to make things more efficient. Now that's pretty neat, but honestly, like if you guys were now building out a kind of target connector for Firebolt, that requires a certain level of Firebolt expertise.
to really understand like the performance implications of the merge statement, how to tune it, et cetera, et One trend you're seeing in industry right now, like, okay, like now I'll intentionally be a bit pushy, is that some of these vendors are adding managed CDC offerings as well, right? I think Snowflake actually recently launched their Postgres CDC offering. Do you think that...
Robin (20:06.478)
Yep.
Benjamin (20:13.352)
the world is going to end up with like best in breed data movement vendors? Or do you actually think the company building the data warehouse has like an asymmetric advantage basically on like building a best in breed tool for their own data warehouse?
Robin (20:31.66)
Yeah. And that's a good point. And like, if you think about it now, like Databricks is launched just last year, required RCA. I'm sure they wouldn't come up with their own. If not, they already have Delta live tables. PRDB was acquired by Clickhouse, Clickpipes, Redshift has zero ETL. Snowflake has their own. Google has Datastream. They all are trying to build their own. And at the same time, this, like these tools are, the problem is like,
The nuance here, they're so great that the complexity is not just at the destination level, the complexity is also at the source level. Give you an example, Postgres has toast columns, whereas other databases don't. Another one is like all majority of the enterprises are still using SQL Server and Oracle. And like for SQL Server, the best in class CDC tool out there, the only way to do this is reverse engineering their code because the publicly stated docs for CDC is not great.
Benjamin (21:24.79)
Bye.
Robin (21:29.67)
I think there's way too much complications here. as well as like, what, it's not just about CDC. Like for instance, one thing that we're also doing is like, given where we're sitting, which is consuming from the CDC logs, what are additional things we can be doing that we can do better than others? And like one very obvious case here is real-time monitoring, right? We detect that there's a source Delta. can, before we land this into your destination, we can immediately notify you. As well as like, what's the number one issue that we deal with for CDC pipelines?
most likely something related to schemas. So then we have a built-in data catalog into our product so that we can be able to observe the two behaviors. I feel like the job to be done for more or less like a data movement vendor is so great that it's not just about dragging and dropping one particular CDC source and then getting it done. In reality, organizations have...
plenty of different database types, either through other necessity, because like maybe they use DynamoDB because Postgres doesn't scale, or they acquire another company that uses MySQL 5.7. Like, I don't think you will be able to capture all of them. And there's so much nuance with every single data source that like, it's very, very difficult. It becomes a whack-a-mole problem. And like, yeah.
Benjamin (22:41.6)
Right. Yeah. That makes a lot of sense. The other thing you're seeing is that I think, especially when you sell to engineering teams, like they're very conscious of...
Keeping architectures that allow them to move off a specific vendor in the future, right? Like they're very conscious about minimizing lock-in, et cetera, all of those things, which if you start buying into one platform and your data movement vendor kind of comes as part of that also starts to become harder because you, you'd start getting worried about basically like, Hey, like now really being fully locked into that single platform.
Robin (23:14.808)
Yeah, as well as like, you know, some really, really large organizations actually run both Snowflake and Databricks. So like copying the data twice just doesn't follow dry. So then as a result, like they're looking more into like an iceberg neutral architecture where the data lives in iceberg and then the two platforms just drop data from there.
Benjamin (23:32.832)
Nice. All right. Cool. So moving away a from the concrete technology is like, how's your experience building a startup in this space, right? Like, kind of like, how's it working with customers, working with real workloads, kind of getting paged in the middle of the night, kind of take us through some of your experiences. Theo, are you enjoying it?
Robin (23:42.286)
Mm-hmm.
Robin (23:52.599)
Yeah. In short, it's fun, but extremely challenging work. Like every pipeline that we touch is mission critical for customers, or else they would just use either their existing pipeline where they would use like a managed vendor that's out there. the reason, majority of our customers care about real time latency, low latency and care about mission critical use cases and they have high volumes of data. So like our first customer was Substack and like
Immediately, we went from like a theoretical thing that I tested out roughly with like a couple thousand rows, because I didn't have that much synthetic data, to like onboarding them and the first month they process a billion rows. There was a lot of learnings there. The thing is like, it's kind of like economy of scale where like databases basically allow you to do anything, which is great and scary at the same time. It's great if you're just like inserting data into it, but if you're reading data out,
Benjamin (24:33.397)
Wow.
Robin (24:51.246)
There's a whole sleuth of use errors and edges that you just think about. My favorite one last week was MongoDB allows you to store in the timestamp yyymmdd, But it allows you to for ymm, it allows you to store values greater than 12. It's like, there's no month 49. How did this data come in? And I was like, wait, what? But every single day, there's something new like this that I feel like.
Benjamin (25:09.65)
Yeah.
Robin (25:20.334)
It makes the job really interesting. like one thing that we also spend a lot of time doing, kind of going back to like making things a little bit simpler. We're also doing that for our team. Like we're going really deep into Kafka protocols and like learning about them. And like one bug that we spent a past year debugging was we were like, wait, Kafka is ordered, but somehow the data that we're landing for this one customer is not ordered. And like, we couldn't repro the issue. And then turns out the SDK that we were using was not.
Well, first it's like on life support right now. And then the second one is during Kafka rebalances, the batches that need to retry are not retrying in order. They basically just have like an async of weight that like basically waits for any single batch to finish. But then that makes it so that if there's a rebalance and you're retrying batches, it's not guaranteed that it will be in order. So then we're like, wait, what? Like these are the type of errors that we're dealing with. They're fun, but really challenging.
Benjamin (26:20.266)
Yeah, as, like an earlier stage startup in that space, right? Cause you're saying, Hey, you're selling two engineering teams, it's mission critical apps and so on. think if you look at many like traditional modern data stack, dbt, five trend looker.
Okay. Your data movement tool is like, has a quick outage, like it blips for 10 minutes. You're going to be fine. Like obviously it hurts the business and you might get escalations and an incident and all of these things, but most cases it's not like your product is now broken. Right. If you use that tool and you guys are really like in that mission critical world where Artie breaking means your customer's product isn't working or your customer's ad monetization platform isn't working or anything like these things. Right. It's like, how do you build
trust as an early stage vendor, kind of, especially as you start talking to bigger organizations, because we have like, kind of in our journey at Firebolt, right, kind of been through similar things. And I actually think that's something I'd love to hear your take on. And I have my own opinions that I can share afterwards. So yeah, kind of give me your thoughts.
Robin (27:26.178)
Yeah. And also like part of the reason why we're doing mission, we're powering, like on top of the, you know, the destinations we kind of talked about other things that make it also mission critical is like we have, what's it called? We have Postgres and MySQL as a SQL server as a destination. So we're doing, helping customers migrate data between databases, right? That requires real time OLTP to OLTP. Another one is like, have, because of our architecture, we have customers that just directly read from Kafka. That also needs to be real time. like.
Benjamin (27:45.856)
Right.
Robin (27:56.086)
As we build on more of this, mission critical only goes up, not down. And like going back to your question about like, how do you build trust as an early stage startup? Like I think for us really like our first few sets of customers are really large. How we build trust with them is like, I literally show them the architecture. I show them our design patterns. And then the last one is like, we are not, we didn't like, this is how I thought we would like.
I would, before I started even writing code, was thinking about like, if I was at Zendesk where I opened the door, if I'm buying this type of tool, what does the architecture need to look like for me to, for me to like trust this product and like, are the, all the possible failure recovery modes that needs to be there? So that's kind of how we did it. So like, as an example, like I don't need to like really sell people on Kafka cause they kind of all already use it or like are about to use it and then publish your consumer frameworks pretty standard.
And then on the other hand, like how we guarantee that we don't lose data, it's pretty straightforward. Like number one, we use Kafka transactions. So we do not commit offsets until the destination tell us, Hey, this data actually has been flushed. Then on the other hand, we flush data using SQL transactions. it's atomic. then like, and then the last thing is like, okay, how do we make sure that like the data doesn't just keep piling up? Well, because of the fact that we're not acting these messages, it will just keep like the...
Benjamin (29:07.957)
Right.
Robin (29:18.862)
Kafka offset will just keep going up. And then like, you'll be able to see consumer like very easily. And then you just build a monitor, slap the monitor on top of that. And then we'll just page our team. So like, it was not that hard. If that makes sense, like I broke it down into like more like first principles. And then like, we talked about all the concerns and like all the potential failure modes. And we kind of can walk through every single one and like discern them. And then that kind of addressed the most of the issue. And then the other, the next thing is like, we also in journey trials, we encourage them.
to just load as much data as they want to stress test the system. And then after that, it's pretty self-explanatory at that point.
Benjamin (29:57.802)
Right, okay, nice. think, look, one thing you almost...
last over as a side comment, right, which is part of something I think we've learned here is like, especially in the beginning, there has to be deep partnership within the engineering teams, right? Because you don't build trust through having 10 incredible reference customers at scale. And if, you get those over time in the beginning, there has to be like just deep partnership and kind of trust between the engineering teams on both sides. And I think this is how you then go from like zero to one basically. And it seems like that's very similar for you guys ultimately. Like when you're saying, Hey, you're sitting there going through the architect,
kind of explaining how you thought about all of the failure modes, right? Like that's ultimately building trust between the engineering teams on both sides.
Robin (30:40.11)
Yeah. And I think another thing is like when we first launched, nothing worked. everything was just broken. But the thing is like that customer experience of like feedback of like, Hey, this is the exact error and we're fixing it. And then it's funny because like, I remember self-stack, like they gave us an engineer to also help us with this problem because like we were just stuck on this postgres optimize like lack of optimization. then, and then the head of data at the time was just like, Hey,
Let me just give you an engineer. Let's just run some queries together and then we can help you optimize. Because like the prop, they understood the problem that we're trying to solve is hard enough that like they believe they really liked the mission. Whereas like, you know, what we found the one insight also that caught me to start the company was like a lot of the vendors out there for CDC or like this like real time aspect, they don't actually provide plug and play. They more or less provide a component and they expect the engineers to either build or attach additional like.
Benjamin (31:15.574)
as well.
Benjamin (31:33.089)
Bye.
Robin (31:38.574)
they will land the data into Snowflake, but then it require you to materialize the table and then merge the table as an example. Like we handle the whole thing. So I feel like also some of our early adopters, they appreciated that we were trying to handle everything. And then just like when we fell short, it's not like we weren't trying. We'll just keep iterating with them until it fixes it.
Benjamin (31:43.371)
Yeah.
Benjamin (31:51.531)
Right.
Benjamin (31:57.63)
Sure. And like when there's sufficient pain on the other side, right. And the other side kind of binds into your vision and okay. It's also like easier to come overcome obstacles during technical implementation, basically, because well, kind of there you're hopefully someone who has like such a compelling vision of the future and the state. Once you've overcome those obstacles that it's worth going through that. Awesome. Cool. So if you look ahead in 20
Robin (32:26.816)
I think what I'm excited about is I am excited about making, spending even more time making this even more accessible. Like, give you an example. You mentioned about like merging data and like understanding the intricacies and like there might be platform specific nuances. We're also building things in the ingestion level so that we don't have to understand platform nuances as an example. So like one thing that we really like is like, BigQuery has a time partition aspect and that handles it.
that you can specify the table level and then they optimize at the storage level. Snowflake doesn't have this, Redshift doesn't have this, Databricks doesn't have this. So we came out with the concept of soft partitioning that is platform agnostic that allows us to basically be able to create these partitions out of the box, slap on a view on top of this, any DDLs that happen will update the partitions and update the view and email the customer. I would expect us to spend even more time developing these things. We're also spending more time thinking about after CDC, what's next?
Another bit, then the next obvious question is, well, aside from monitoring, the next question is like data consistency and data reliability. So then we're spending a lot of time also like implementing this paper called DEQ, D E Q U, where Amazon came up with it for being able to monitor data, monitor data quality across data sets at large scale. And like really that's a, that's a really hard problem because like you can do a select star, do a select star on both ends, but like that's not performant.
So like, how do you do a performant one while maintaining like, while maintaining like, what's it called? While being thorough, that's also hard. So we're trying to tackle that problem. Trying to build the best SQL server connector out there. That's the kind of work that we're currently thinking about. We're also trying to expand into different destinations where real time makes sense. So we just launched an events API. The idea here is you hit this events API that's similar to segment. We'll land this data into your snowflake in 200 milliseconds using Snowpipe Streaming.
And so you using segment and like doing hourly warehousing.
Benjamin (34:23.318)
Nice.
Benjamin (34:27.948)
Okay, awesome. Hey, look, I'm excited to follow all of that along and kind of read through your announcements as you launch new features throughout this year. It was great having you on the show, seriously. You guys are building very hard to build things. Moving data at scale and at volume is super difficult. So it was great to get that deep dive into Artie today. I appreciate your time and yeah, see you around.
Robin (34:53.006)
Thank you. Thanks for having me.
