Data is continuously evolving, and with it, we’re seeing rapid changes in streaming architecture, data warehousing, platform clustering and more. Exponential growth in the volume and velocity of available data has made centralized organization and management extremely difficult, with bottlenecks emerging in transforming and governing data workloads for near real-time applications across multiple business domains. While every organization today wants to be data-driven, and why not, since it holds the key to advanced insights, trends analysis, and business transformation and personalization, evolving business requirements call for a contemporary and dynamic architectural solution that can scale evolving data objectives.
A data mesh is designed to address scalability and complexity issues with a domain-driven approach to data management and analytics. It shifts the responsibility for data integration, retrieval and analytics from a centralized data team to the respective domains, leading to large-scale decentralization and scalability towards enterprise BI (Business Intelligence) needs.
What is a data mesh and how does it work?
At a conceptual level, a data mesh is very similar to a microservices architecture. It enables individual domains to approach data as the primary product, allowing them to connect to other domains and perform cross-domain queries and analytics, much like how APIs (application programming interfaces) allow services and applications to communicate with each other.
In her groundbreaking publication, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, Zhamak Dehghani outlined four primary data mesh principles for a distributed data architecture:
While domain ownership is a concept most businesses are familiar with at an operational level, data and pipeline ownership is very often centralized under monolithic platforms. A data mesh federates the data ownership to individual domains, incentivizing ownership of how they collect, transform and distribute the data as a product. While the domains will collectively have a standardized set of capabilities to manage the entire data pipeline, including ingestion, cleansing and transformation and distribution, they will individually be responsible for aligning the data parameters to business standards.
Data as a product:
Democratizing data ownership contributes to making it ‘bite-sized’ and more manageable, which leads to the next principle; treating the data itself as a product. It imprints the awareness that the data is being consumed and applied outside of the domains, encouraging data owners to improve data quality and interoperability for the ‘product’. Similar to how public APIs work, projecting a product mindset on data encourages the domains responsible to make it more discoverable, addressable, trustworthy, accurate, accessible, interoperable and secure, adding value throughout the entire pipeline.
Self-service infrastructure platform:
A self-service infrastructure is key to securing stakeholder engagement in a data mesh architecture. Individual business domains will need domain-agnostic tools, resources and systems to train and develop a product mentality to create and consume data as a product. Having a self-serve platform goes a long way towards making analytical data products accessible for general developers.
Implementing a data mesh is a significant cultural shift for any organization, and securing stakeholder buy-in will pose a significant challenge. Domain heads might not be comfortable with the additional workload and processes required to streamline a distributed data architecture. A federated governance framework ensures there’s a centralized set of policies and standards in place to align data processes with business rules and industry standards while giving individual domains the necessary flexibility to modify certain parameters while maintaining interoperability. These parameters can be defined and mutually agreed upon through SLAs (service level agreements).
While the general approaches to these data mesh principles will differ from organization to organization based on maturity and business goals, any basic data platform with mesh capabilities will have functions for operating the technology stack and creating, accessing, and discovering data products. Assigning data owners and engineers to individual domains can secure developer engagement towards adopting a distributed data mesh system.
Data platforms in the context of data mesh
A traditional data platform covers all aspects of the data pipeline, across ingestion and transformation to consumption and reporting. In most instances, data engineering teams are responsible for designing ETL (Extract, Transform and Load) pipelines, running reports, evaluating data quality, loading data into data warehouses and online analytical processing (OLAP) databases, and more.
Since the architecture is primarily monolithic with centralized data lakes for storing and transforming streaming data, most of the data ends up being combined in the end. As such, in traditional setups, since all queries go through the central data teams, they have ownership over most of the reports and data products, leading to bottlenecks in downstream applications. In some instances, shadow IT teams develop and manage their own demarcated data platforms to meet their functional and technical requirements quickly without explicit IT approval.
Data mesh, on the other hand, is built around a self-service infrastructure model that prioritizes building data infrastructure components which domain teams can then use to create and serve their own data products (hence, self-serve). Since data mesh depends on business domain ownership, it can function with just the metadata of the data products. It simplifies integration on the metadata level, where all uses like reporting, warehousing, analytics are downstream applications of the data as a product.
Primary points of divergence between a data mesh and data lakes
- Domain-centric ownership of data
- Uses a federated computational governance model
- Treats data as a product
- Manages the lifecycle of interconnected data products and ensures domain-agnostic interoperability
- A self-serve platform processes both operational and analytical data
- Data workers are spread across all the domains to ensure localized autonomy
- Data ownership revolves around data engineering teams
- Relies on a top-down governance model
- Treats data as an asset
- Requires centralized management of data pipelines, code and associated policies
- Operational and analytical systems communicate via point-to-point integrations
- All data requests and processes are routed through specialized data teams
In addition to that, a distributed data infrastructure platform should have tools and technologies for data observability, policy and compliance automation, monitoring functions and educational resources for training domain teams on accessing and exposing analytical data products. The federated computational governance framework should give the domains enough flexibility to maneuver around the standardized data components while sticking to relevant product roadmaps.
Depending on the data requirements within an organization, the self-serve platform can be configured to serve different purposes, including data provisioning, lifecycle management and supervision. This current model diminishes the gap between operational and analytical data while respecting the autonomy of individual domains. The domain-centric approach allows businesses to scale their data architecture along with their business needs and maintain data quality and interoperability within a federated governance structure.
Case studies: Nokia and Netflix
Nokia needed to implement a data framework that will provide real-time insights into the performance of business-critical domains to support advanced 5G services. However, traditional data lakes can’t effectively manage distributed data computation, data streaming from disparate sources and fragmented data handling at the same time.
In order to realize their vision of a data mesh, Nokia used their existing Nokia Open Analytics (NOA) framework to create a low footprint, analytical and unified architecture providing complete visibility across multiple application use cases and data products.
Figure: Data products under Nokia’s data mesh
(Source: Importance of Data Mesh Architecture for Telcom Operators | by Adnan Khan | Medium)
With a self-serve data infrastructure and a federated governance model, the NOA management plane made it significantly easier for data teams to access the data they need while automating relevant compliance tasks like data standardization, alerting, logging, data lineage, product monitoring and more.
Netflix began experimenting with data mesh to explore alternate avenues for increasing visibility into studio data at scale. In the entertainment industry, it’s even more important to be able to react to market changes and adapt production parameters accordingly; Netflix needed near real-time visibility into activities taking place under various business domains.
With a data mesh platform, Netflix tried to overcome existing challenges with latency in ETL pipelines, stale data, inconsistent security controls, broken entity onboarding and several other issues. By using a data mesh to streamline data movement and consume, transform and retrieve CDC (change data capture) events, they have managed to successfully address most of the pressing pain points.
Netflix has used reusable processors (a unit of data processing logic for CDC events) and schema evolution to keep reporting pipelines up to date with upstream data changes. With Apache Iceberg as the data warehouse sink, they have provided data workers with the functionality to build analytical views directly on top of data tables. The data mesh platform has built in metrics and dashboards at both the processor and the pipeline level to maintain data quality. Trackers provide near real-time reports through downstream applications to provide maximum business impact for data workers, leading to improved data observability and change management.
Since their initial successes with data mesh, Netflix has further increased its scope with a growing number of applications. You can read more about it here.
Choosing the right data architecture for your business needs
While the role of data mesh in shaping the future of data architecture is still not very clear, it is well positioned to become the data management framework of choice for businesses working with enormous amounts of data. Successful data mesh implementation requires considerable amounts of expertise and a large data engineering team. For organizations that don’t need to reckon with extensive volumes of data for operational reporting on a day-to-day basis, data integration and data virtualization are preferable frameworks to meet business data needs in terms of complexity and cost optimization.
When it comes to data, it’s difficult for most businesses to evaluate their maturity level, especially if the existing data architecture is siloed. Large-scale procedural and structural changes require a solid foundation and a comprehensive roadmap; therefore, it’s essential to understand where you are in the pipeline.
While implementing and following data mesh best practices will eventually give you more ownership of your data and make you more self-sufficient, you should consider a data lake or data warehouse for your OLAP systems if:
- You are a small to mid-sized organization with clear points of integration between analytical and operational data
- Your data engineering team is not extensive enough to lend resources to autonomous domain teams
- You need to store, process and archive massive volumes of raw, structured and unstructured dataoutside of a restricted time window
- You need a cost-effective solution for big data
However, you should consider upgrading to a data mesh if:
- AI/ML and data-driven decision making is a strategic differentiator for your business and can provide a guaranteed competitive advantage
- Your data engineering team has the maturity and experience to design and implement DevOps, CI/CD, AIOps and more to augment the technology stack
- You have enough data workers to support a domain-centric approach to data management
- Your current data infrastructure is bottlenecking your ability to extract real-time insights from disparate sources at scale
From our experience, most businesses can solve a majority of their immediate problems with an outcome-oriented approach to data integration. Data mesh strategy and execution require a longer timeline, stakeholder buy-in and comprehensive resources. In most cases, data virtualization can bring speed and scale to most business needs while giving organizations the time to prepare a roadmap towards implementing a distributed data architecture in the long run.
It’s a good idea to consult a data specialist to figure out where your business falls within the data management spectrum. The better of an idea you have, the easier it will be for you to ascertain what kind of solutions would be a good fit for your business data needs.