2

Importance of Data Discovery in Data Mesh Architecture

 2 years ago
source link: https://dzone.com/articles/importance-of-data-discovery-in-data-mesh-architec
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Data Discovery

Data Mesh/Discovery — Panel Recap

Recently, I came across a great panel hosted by data mesh learning incorporation with the open-source data podcast — to discuss the significance of data discovery in data mesh architecture and other important issues surrounding data mesh delivery.

The panel consisted of expert solution architects, including Shinji Kim, CEO Select Star, Sophie Watson, Principal Data Scientist Red Hat, Mark Grover, Founder of Stemma, and Shirshanka Das, CEO Acryl Data. 

The objective of the discussion was to understand how data discovery addresses the frequent data management issues, such as how to make it easier to access data? How to help customers understand what data they’re using? Or how to give data producers the mechanisms they need to produce high-quality data products?

This blog intends to recap all the highlights of the discussion.

Why Is Data Discovery Important?

Data discovery is a business-user-oriented process to visually navigate data and understand different patterns through analytics. However, access to data is a hurdle that every data scientist, software developer, product manager, or business analyst encounters every day.

Whether we are producers of data or consumers, data discovery affects us all. To use and analyze the data, we need to access it but accessing data means we need to know where and what exists before we can analyze and operationalize it. This makes data discovery critical for data professionals and industries to query data and make informed business decisions.

Why Now?

The area of data discovery is increasingly changing; we can’t set a catalog of schema once and use it multiple times anymore. This change essentially lies in the rise of modern data stacks. Today, companies are collecting myriads of data from a range of different sources.

Connecting this dynamically sourced data in one place has become a major challenge because it’s not just a centralized data team anymore that is using the data. Now it’s engineers, analysts, marketing and sales ops teams, and other functional teams who are using that data.

The concept of what data is has also changed dramatically from just being data tables in the data warehouse to machine learning (ML) models, analytical reports, business intelligence (BI) dashboards, etc., for consumption side and Postgres/Kafka upstream of warehouse and operational databases, APIs, etc., for production.

Additionally, the migration of centralized data warehouse into the cloud has changed the way of ingesting and processing data from the extract, transform, and load (ETL) process to extract, load, and transform (ELT), which leaves businesses with more data sets. Add to it the decentralized ownership and distributed data access of data mesh architecture, data discovery becomes more difficult today than ever before.

This hyperspecialization and steady growth of data have led us to not knowing what data exists, why it exists, and where it lives. All of which prevent organizations from using the data, and that makes it all the more important to solve this problem now.

Role of Discovery in Data Mesh

The full notion of data mesh is the recognition that how we model, produce and consume data is decoupled. With decoupled data, the common concern is If users need to access data or services that they didn’t create, how are they going to find it and learn to use it? It is this part of the data mesh that affects the data discovery the most.

Data mesh splits up the centralized data into data domains and allows users to apply high-quality data product thinking to how data is shared. Data discovery is essentially a capability of enabling data and control plane on the data mesh, which creates a better environment for discovering and tagging data.

The companies that already have a data mesh model need a data discovery platform initially to discover and understand their data which is where discovery starts with data mesh. Then, as the data teams start owning their data by putting tags and ownership, data mesh allows these teams to invite other users through democratized access to data while maintaining full governance and control over a source of truth with distributed ownership — this is the main intersection of discovery and its role in data mesh.

Data governance is also about visibility that provides data teams a context for what is in process or what other teams have already done to eliminate the need of rediscovering or building everything anew.

Issues and Opportunities Around Data Mesh

Data mesh with discovery makes it possible for the teams to know about data production, so they don’t reinvent the wheel. It prevents the two common scenarios where data teams have to spend a lot of time rediscovering metadata. First, when businesses hire new experts who have knowledge of making data-driven decisions but lack data context. Secondly, when a business unit moves to a different unit for some time and upon return, finds that metadata has completely changed during that period.

At any given time, organizations have many different data models running to log data into the warehouse and make it available to users. The company’s data warehouse may have 200 columns and dashboards that have something to do with one operational aspect. Which makes it nearly impossible for users to tell what’s the single source of truth.

Discovery in data mesh helps establish the balance between data producers and consumers to make data more discoverable and reliable through the following practices:

Open Source Inspired Shared Ownership

Like in the open-source communities, the ownership of data reliability and discovery lies with everyone who interacts with the data. The main reason why data discovery fails is that the data does not have enough documentation for users to derive value. This shared sense of responsibility from the open-source approach incentivizes users to fix the data issues they discover in order to save others the trouble.

Integration of Automated Insights

Data documentation is vital for better discovery, especially for the producer side of things, but at the same time, it just creates more data tables. What we need is automation to pull out existing, operational metadata to augment the discovery perspective. Users can use automated insights to foster better documentation and creating lineage to propagate different information.

Simplified User Experience

It is important to understand how and where the data is being used for a simplified user experience. Like is the data primarily being used for sale reports, or is it powering product analytics? Once the data analyst teams or business intelligence teams can define the structure of how they would view the categorizing of data, then other people can contribute and maintain the protocol. Simplified user experience can help the documentation process or foster the initial documentation effort that usually also needs to happen with data discovery.

Treating Data as a Code

Treating data and metadata as code is common in the data mesh community. When we create a data product, there should be rules/documentation for what makes it valid and those rules should be applied as a part of the system built. It needs to have documentation with it including, compliance tags, automated id checks, etc. These systems integrated into the data discovery platform considerably reduce the likelihood of producing bad data.

Code Centric Discovery

For effective data governance, which often leads to data compliance, data discovery should be user as well as code-centric. It has to have programmatic abstractions where the data discovery abstractions for the user also apply to data discovery for code, e.g., feature or model registry. They all need back end they can reliably address relevant queries at runtime so users can apply the right policies at runtime instead of rending data back.

Watch the full panel discussion here


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK