Data Mesh is hot stuff. But from a technology perspective it’s still not very well defined, and confusion appears to reign. I hear all manner of nonsense that such and such a tool will/won’t fit into a data mesh architecture, whereas really, almost any data and database technology can be used within your data mesh. It’s just a case that a single technology isn’t going to cut it.
Let’s dive in, and spoilers: I will be referencing Firebolt in this blog. Probably a lot. So yeah.
What’s Data Mesh again? (Promise I’ll be real quick….)
Very quick catch up on the relevant bits of Data Mesh.
- We’re looking to solve problems, specifically bottlenecks and reduced domain knowledge within the data, caused by having a centralized data team and corresponding centralized data pipelines and data repository.
- We want domain teams to take ownership of their data and deliver it, in product form (i.e. data products or data-as-a-product), to the business consumers; instead of them not caring and leaving it up to the previously mentioned central data team to hoover up and sort out.
- In order to make a lot of this easier, self-service data platform components need to be made available to the domain teams, who are less familiar with the technology in this space.
If you want to read more on that - there’s Zhamak’s book, and articles on the Martin Fowler architecture site as some great material to read and digest.
A quick note about nomenclature for teams here, if I refer to “the data team” I’m meaning the historical, central and all encompassing data team that organizations typically have. I’ll be more specific when talking about teams as described in the data mesh material, so “data platform team” will be a team who build, maintain and own the data platform infrastructure, but not the data itself. “Domain team” will be the team who work with a domain product, and who will now be taking ownership of the data product under data mesh.
“Modern” Data Architecture
We also need a starting point, what do current data architectures typically look like?
At the top, we have our production applications, owned, built and operated by their domains. Typically these will have some form of data store, such as a relational database, a document store, a graph db, object store etc. The concern of these data stores is to hold data pertinent to the ongoing running of the application, and deliver that data as fast and as consistently as possible for the application to function. Temporal concerns are not often required here, usually it’s more a case of “current state”; e.g. what’s in my basket, what was in my order, what is the price of this product now, etc. Running analytical queries directly on these databases is generally frowned upon, as those queries tend to require heavier processing and therefore put the operational focus of the application at risk: no business wants to lose orders because some analyst wants to know whether Blue Monday sells better on a Monday!
Therefore, in order to segregate those analytics workloads, data is typically copied to another repository. Mechanisms for this might include a whole host of ETL and data integration tools, or processes such as Change Data Capture (CDC) which replicate the data upon each change to the source database into another repository. It could be a streaming process emitting each event that occurs in real time. And so on. Whatever the mechanism, the goal should effectively be the same: dump each change to a record into a repository as an immutable event recording the data as it was at that point in time. The storage for this could be an Operational Data Store (ODS) database or typically the “raw zone” of a Data Lake, likely cloud storage such as AWS S3, GCP Cloud Storage or Azure Blob Storage. The advantage of the data lake over an ODS is that schema change over time is easier to deal with - you don’t have to rewrite all your data when you want to add/modify a column. Immutability is a key concept here - no matter what, that event happened at that point in time, and shouldn’t be changed (other than deletion for regulatory reasons). Usually, the data team is responsible for this process.
This data will then go through some form of transformation into one or more data models, these could be relational, dimensional, data vault and so on, and could reside in a transformed zone of a data lake, a data warehouse database, or even if fairly small, a relational database. The data team is again responsible for this process, both in terms of the modeling and the ETL processing required to do the transformations themselves.
Analytics queries can then be run against this data; but often a further step is also taken to build out specific data marts for certain business use cases. The data might be further aggregated and flattened in order to speed up serving to end users in dashboards or file dumps or all manner of delivery mechanisms.
Lastly, peripheral components like data catalogs, data dictionaries and (hopefully) some data observability tools will sit over all of this, to allow data to be found and governed.
How does Data Mesh change this Architecture?
In a nutshell: it doesn’t actually have to, all that much. The main concerns are a shift in ownership, and where boundaries exist between components that will be shared. You could well take a single domain data concern, say Orders, and build an architecture as above around it.
Your application exists much as it did in the example above. It will have its own data store, with the data modeled appropriately to ensure that the application functions correctly and performantly. This data may then have a pipeline, using say CDC, to emit the changes as immutable events to a separate storage database for analytics purposes. The data will be curated here and any modeling required to handle the now temporal nature of it completed. This output may well be a dimensional model, with the usual fact and dimensions created. From a consumption purpose, the so called data mesh “output ports”, or as I prefer them, endpoints, can be created on top of this model, perhaps giving point in time access to the data, or differing levels of granularity - much like data marts. Access methods might be SQL to the dimensional model, or an API to various pre-canned queries for that data, or dashboards, or data extracts…. you get the picture!
So: what’s different?
The main drivers behind implementing data mesh are to remove the bottleneck a centralized data team brings, and to preserve domain knowledge within the analytical datasets. And this is where we see the difference between the more traditional centralized platform and the data mesh data product as described above. Whereas in a traditional sense, the data team will pick up everything from a CDC process onwards, under data mesh everything described above should be carried out entirely by the product team.
The big gotcha here is this means that we aren’t going to be producing a conformed model across all our data. But the goal is not to do this. If a requirement exists to produce data utilizing more than one domain’s source data, then a new data product will be created, using the data product outputs of those sources in order to build it. This is one of the most contentious parts of data mesh. First: who actually does this, in terms of teams etc; the old data team is gone? Second: is there a risk of chaos from having multiple possibilities of combining data and the lack of control a conformed model will bring? These are questions that need to be asked and answered, and weighed up against the issues that you are attempting to solve.
Firebolt and Data Mesh
So the preamble here is largely to get the idea across that from a technology perspective not all that much has changed. The tooling is very similar, albeit from a provisioning perspective it needs to meet a self service implementation standard. But otherwise, we still have the same problems to work with - a datastore optimized for supporting an application, and then to get that data into a shape that can enable analytics without negatively impacting the application; the difference is that this is within a node on the data mesh rather than for the entire enterprise data space.
Where does Firebolt fit in?
With the introduction of the “product” concept, performance becomes a far more critical concept than before in the data analytics space. Data is delivered as a product, and it simply changes expectations of how that will work. The consumers aren’t likely just analysts, and when connecting applications and so on, the likelihood of the data having a customer facing impact increases dramatically. Minutes run time are no longer acceptable - everything needs to be subsecond.
With Firebolt’s inherent performance, it therefore will be a great candidate to consider when selecting a database to deliver these datasets, especially over larger volumes where it becomes increasingly more challenging to deliver sub-second query performance.
By utilizing the optimisations we provide, interacting with a data product can move into real time responsiveness and answer questions much faster.
Tailored Data Product
A data product delivers data in a shape, or several shapes, of queries. The endpoints will be defined in a manner to deliver data to meet certain questions and requirements that are largely already known, and certainly within a set of bounds. Even if that endpoint is a SQL endpoint, you will be limiting the possibilities for the consumer to a simple model that is predictable.
Delivering a tailored experience can imply heavy lift in terms of planning and day 2 operations. In many cases, day 2 operations will require continual optimization of data models and data access. Firebolt's SaaS abstraction, smart storage and tailored indexes eliminate the heavy lift. This translates to a consistent product experience for the consumers with the least operational overhead.
Segregation of workloads
Firebolt’s underlying architecture is based on a separation of storage and compute. The same underlying data store can be accessed via different compute engines and so contention of workloads can be completely eliminated. If you are delivering a data product, the last thing you need is babysitting a bunch of nosy, noisy neighbors.
Imagine that your data product has three endpoints: an API delivering point in time data requests and alerting, a dashboard providing a prebaked visual view of the data, and a SQL endpoint allowing much more ad-hoc querying of that data (enabled via views to ensure the model remains consistent). It’s imperative to ensure that the first two workloads are not impacted by an analyst running a particularly large workload, and indeed with Data Mesh you should have SLAs to guarantee delivery of your data.
If those workloads can be separated to different compute engines, then those guarantees can be easily achieved. What’s more, from a cost perspective, each engine can be sized and scheduled accordingly for the use case - if the dashboards are only run during working hours, the engine servicing them can be shutdown. Rather than sizing for peak of all workloads, you can size for the peaks of that particular workload, and even amend those sizings based on your consumption patterns (e.g. a small engine overnight dealing with few requests, and much larger resources when concurrency ramps up during busy periods).
This segregation also allows costs to be easily ascribed to a data product, and potentially even passed on to the consumers.
In order to accelerate the build process of data products, self servicing tooling is paramount. All components of Firebolt can be fully created, configured and automated through the API and SDKs, making this an easy process for your data platform team to create the wrappers to allow your domain teams to quickly build out their data products. Firebolt's extensibility with ecosystem integrations for ELT, orchestrations, transformations, reverse ETL, data observability, BI etc gives your teams the flexibility to build for their needs.
To conclude, once we establish that within a data mesh node, the tooling will largely remain the same as an existing data platform, then the selection of those tools can be assessed on a case by case basis. Each team will first need to understand the needs of their data product and only then can the tools, including a datastore for analytical purposes, be selected.
Those considerations need to take into account user experience of the data product, both from an end user and from the developer perspective. Ultimately, Firebolt is well placed for both those use cases by providing excellent cost/performance for the delivery of data alongside a strong set of features for self service data product development purposes.