The Rise of DataOps (from the ashes of Data Governance)

Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline to drive orders-of-magnitude improvements.

Ryan Gross

Companies know they need data governance, but aren’t making any progress in achieving it

These days, executives are interested in data governance because of articles like these:

Recent Gartner research has found that organizations believe poor data quality to be responsible for an average of $15 million per year in losses.
The first major GDPR fine was Google’s $57 million fine from the French data authority
The Equifax data breach has cost the firm $1.4 Billion (and counting) despite the fact that the data has never been found .

On the other hand, the vast majority of data governance initiatives fail to move the needle, with Gartner also categorizing 84% of companies as low maturity in data governance . Despite the fact that nearly every organization recognizes the need for data governance, many companies are not even starting data governance programs due to the strong negative connotation of the term within the executive ranks.

Current data governance “best practices” are broken

In my experience, the reason for the lack of progress is that we have been doing data governance the wrong way, making it dead on arrival . Stan Christiaens got this right in his Forbes article , despite the fact that it was essentially an ad ( a very effective one ) for his company. I agree with him that the primary reason governance has failed in the past is because the technology just wasn’t ready, and organizations couldn’t find ways to motivate people to follow the processes that filled the technology gaps. However, I disagree that modern data catalog tools provide the complete technology answer we need to be successful (although they are a step in the right direction).

If Data Catalog tools aren’t the answer, what is?

Recent advances in data lake tooling (specifically the ability to version data at scale) have put us at a tipping point where we can reimagine the way we govern data (i.e. the culture, structures, and processes in place to achieve the risk mitigation and cost take out from governance). At the end of the transformation, data governance will look a lot more like DevOps, with data stewards, scientists, and engineers working closely together to codify the governance policies throughout the data analytics lifecycle. Companies who adopt these changes early on will create a huge competitive advantage.

To understand how I came to that conclusion we will have to go back through some of the history of Software Engineering, where 2 core technical innovations enabled process and eventually cultural changes that transformed coding from a hobby to a world-eating revolution. We’ll then see how similar innovations were the primary enablers of the DevOps movement, which has similarly transformed IT infrastructure in the cloud era. Finally, we’ll see how these innovations are poised to drive similar process and cultural changes to data governance. It’ll take a little while to build the case, but I haven’t found a better way to get the point across yet, so please stick with me.

Background: How Source Control and Compilation created Software Engineering

The core innovations that created the discipline of software engineering are:

The ability to compile a set of inputs to executable outputs
Version control systems to keep track of the inputs

Before these systems, back in the 1960s, software development was a craft, where a single craftsman had to deliver an entire working system. These innovations enabled new organizational structures and processes to be applied to the creation of software, and programming became an engineering discipline. This is not to say that the art of programming is not extremely important, it’s just not the topic of this article.

The first step to moving from craft to engineering was the ability to express programs in higher level languages through compilers. This made the programs easier to understand to the people who were writing them, and easier to share across multiple people on a team, because the program could be broken down into multiple files. Additionally, as the compilers got more advanced, they added automated improvements to the code by passing it through many intermediate representations.

By adding a consistent version system across all of the changes made to the code that ended up producing the system, the art of coding became “measurable“ over time (in the sense of Peter Drucker‘s famous quote: “you cannot manage what you cannot measure“). From there, all sorts of incremental innovations, like automated tests, static analysis for code quality, refactoring, continuous integration, and many others were added to define additional measures. Most importantly, teams could file and track bugs against specific versions of code and make guarantees about specific aspects of the software they were delivering. Obviously there have been many other innovations to improve software development, but it is hard to think of ones that aren’t dependent in some way on compilers and version control.

Everything-as-code: Applying Software Engineering’s core innovations elsewhere

In recent years, these core innovations have been applied to new areas, leading to a movement aptly titled everything-as-code . While I wasn’t personally there, I can only assume that software developers met the first versions of SVN back in the 70s with a skeptical eye. In much the same way, many new areas consumed by the everything-as-code movement have garnered similar skepticism, some even claiming that their discipline could never be reduced to code. Then, within a few years, everything within the discipline has been reduced to code, and that this has led to many-fold improvements over the “legacy” way of doing things.

Turning code into infrastructure using a “compiler” layer of virtualization and configuration management

The first area of expansion was infrastructure provisioning. In this example, the code is a set of config files and scripts specifying the infrastructure configuration across environments, and the compilation happens within a cloud platform, where the config is read and executed alongside scripts against the cloud service APIs to create and configure virtual infrastructure. While it may seem like the Infrastructure as Code movement swept through all infrastructure teams overnight, a ton of amazing innovations (Virtual machines, software defined networks, resource management APIs, etc.) went into making the “compilation” step possible. This likely started with proprietary solutions from firms like VMWare and Chef, but it became widely adopted when public cloud providers made the core functionality free to use on their platforms. Before this shift, infrastructure teams managed their environments to ensure consistency and quality because they hard to recreate. This led to layers of governance, designed to apply control at various checkpoints in the development process. Today, DevOps teams e ngineer their environments, and the controls can be built into the “compiler”. This has created an orders of magnitude improvement in the ability to deploy changes, going from months or weeks to hours or minutes.

This enables a complete rethink of the possibilities for improving infrastructure. Teams started to codify each of the stages for creating their system from scratch, making the compilation, unit testing, analysis, infra setup, deployment, functional and load testing a fully automated process (Continuous Delivery). Additionally, teams started testing that the system was secure both before and after deployment (DevSecOps). As each new component moves into version control, the evolution of that component becomes measurable over time, which will inevitably lead to continuous improvement because we can now make guarantees about specific aspects of the environments we deliver.

Getting to the point: the same thing will happen to data governance

The next field to be consumed by this phenomenon will be data governance / data management. I’m not sure what the name will be (DataOps, Data as Code, and DevDataOps all seem a bit off), but its effects will likely be even more impactful than DevOps/infrastructure as code.

Data pipelines as compilers

“With Machine Learning, your data writes the code.” — Kris Skrinak, ML Segment Lead at AWS

The rapid rise of Machine Learning has provided a new way to build complex software (typically for classifying or predicting things, but it’s going to do more over time). This mindset shift to thinking of the data as the code will be a key first step to converting data governance to an engineering discipline. Said another way:

“Data pipelines are simply compilers that use data as the source code.”

There are 3 things that are different, but also more complex, about these “data compilers” compared to those for software or infrastructure:

Data teams own both the data processing code and the underlying data. But if the data is now the source code, it’s as if each data team is writing its own compiler to builds something executable from the data.
With data we have been specifying the structure of data manually through metadata, because this helps the teams writing the data compiler understand what to do at each step. Software and Infrastructure compilers typically infer the structure of their inputs.

We don’t understand how data writes code

3. We still don’t really understand how data writes code. This is why we have data scientists experiment to figure out the logic of the compilers and then data engineers come in later to build the optimizers.

The current set of data management technology platforms (Collibra, Waterline, Tamr, etc.) are built to enable this workflow, and they’re doing a pretty good job. However, the workflow they support still makes the definition of data governance a manual process handled in review meetings, which blocks type of improvements we saw after the advent of DevOps & Infrastructure as Code.

The missing link: Data Version Control

Applying data version control. Credit to the DVC Project: https://dvc.org/

Because data is generated “in the real world,” not by the data team, data teams have focused on controlling the metadata that describes it. This is why we draw the line between data governance (trying to manage to something you can’t directly control) and data engineering (who are actually engineering the data compilers rather than the data itself). Currently, data governance teams attempt to apply manual control at various points to control the consistency and quality of the data. The introduction of version tracking to the data would allow data governance and engineering teams to engineer the data together, filing bugs against data versions, applying quality control checks to the data compilers, etc. This will allow data teams to make guarantees about the system components that the data delivers, which history has shown will inevitably lead to orders of magnitude improvement in the reliability and efficiency of data driven systems.

The data version control tipping point has arrived

Platforms like Palantir Foundry already treat the management of data in much the same way as developers treat the versioning of code. Within these platforms, datasets can be versioned, branched, acted upon by versioned code to create new data sets. This enables data driven testing, Where the data itself is tested in much the same way that the code that modifies it might be tested by a unit test as data flows through the system in this way, the lineage of the data is tracked automatically by the system as are the data products that are produced at each stage of each data pipeline. Each of these transformations can be considered a compile step, converting the input data into an intermediate representation, before machine learning algorithms convert the final Intermediate Representation (which data teams usually call the Feature Engineered dataset) into an executable form to make predictions. If you have $10M-$40M laying around who are willing to go all in with a vendor, the integration of all of this in Foundry is pretty impressive (disclaimer: I don’t have a ton of hands on experience with Foundry; these statements were based on demos I’ve seen of real implementations at clients).

The DataBricks Delta Lake open source project enables data version control for data lakes

For the rest of us, there are now open source alternatives. The Data Version Control project is one option that is focused on data scientist users. For big data workloads, DataBricks has taken the first step in open sourcing a true version control system for data lakes with the release of their open source Delta Lake project . These projects are brand new, so branching, tagging, lineage tracking, bug filing, etc. haven’t been added yet, but I’m pretty sure the community will add them over the next year or so.

The next step is to rebuild data governance

The arrival of technology for versioning & compiling data puts the impetus on data teams to start rethinking how their processes can take advantage of this new capability. Those who can actively leverage the capability to make guarantees will likely create a massive competitive advantage for their organization. The first step will be killing off the checkpoint based governance process and instead asking the data governance, science, and engineering teams to work closely together to enable continuous governance of data as it is compiled by data pipelines into something executable. Somewhere behind that will be the integration of the components compiled from data alongside the pure software and infrastructure as a single unit; although I don’t think the technology to enable this exists. The rest will emerge over time (and in another post). I know it sounds crazy to say, but this is an exciting time to be in data governance.

The Rise of DataOps (from the ashes of Data Governance)