There’s been a lot of discussion lately about systems for doing version control for data. Most recently, Ryan Gross wrote a blog post “The Rise of DataOps” where he lays out how data version control is the most obvious next step in moving data pipelines from something that’s “maintained” to something that’s “engineered.” I enjoyed this blog post and I like many of the analogies that Gross draws, but I’ve encountered this idea of “git for data” in a number of other places and I think that the notion obfuscates more than it clarifies when it comes to thinking about the properties of an ideal data platform.
So to start, let’s lay out what we mean when we talk about version-control for data. Where I see this term used most is when someone is talking about the ability to “time travel” through a dataset in order to see its state at different points in time (potentially marked with something equivalent to a named “commit”). In this post, I’m not going to be talking about version-control for data science research (e.g., version-control for Jupyter notebooks) which I do believe is an important topic but is one that I believe should be solved in a very different manner.
I’ll directly quote the description from Gross’s article:
Within these platforms, datasets can be versioned, branched, acted upon by versioned code to create new data sets. This enables data driven testing, where the data itself is tested in much the same way as that the code that modifies it might be tested by a unit test. As data flows through the system in this way, the lineage of the data is tracked automatically by the system as are the data products that are produced at each stage of each data pipeline. Each of these transformations can be considered a compile step, converting the input data into an intermediate representation, before machine learning algorithms convert the final Intermediate Representation (which data teams usually call the Feature Engineered dataset) into an executable form to make predictions.
That is, in this paradigm, many different versions of a dataset are saved such that a user can easily reproduce the data from a given point in time (or, more abstractly, from any given point in the work tree regardless of calendar time).
Where Version Control for Data is Useful
The description Gross provides is an interesting one, and I can definitely see how such a system would be useful in a narrow set of use-cases. Specifically use cases where:
- Data quality for machine learning models is the most important data-quality problem
- The machine learning models are trained statically and not on “live” data (i.e., the training, compiling, and deploy process is happening at some rate much less frequent than “continuous”)
- You’re working in a field or industry where you’re not often working with data that changes frequently (e.g., you get a static dataset of survey results that you work with over a long period of time). That is you’re not getting “new” data frequently and the structure of the data (data model) is not evolving rapidly.
In this situation, I can see how a data version control system could help you improve the reproducibility and legibility of machine learning models in a way that would be valuable.
However, I believe that the majority of practitioners working with data today are not working in that environment and the idea of data version control doesn’t address the most important way that data changes cause issues for practitioners
How Data Change
In the majority of organizations that I’ve worked with, the data that we’re using on a day-to-day basis is constantly increasing even though it’s generally not changing. That is to say, the raw data that we’re receiving from our systems do not alter historical data but do add new rows to business activity tables (new orders were created, boxes were shipped, user accounts were created, buttons were clicked, etc.).
When it comes to the data that are being used to operate the business, the most important changes in the data comes from new data that arrive and look different from historic data, not from new logic that changes how data were processed in the past.
So a canonical example is that you come into the office one Tuesday morning and you look at your company’s sales dashboard and you notice that there were no sales in California yesterday. “hmm. that’s strange” you might say to yourself. And now you need to go investigate why it is that California isn’t showing any sales yesterday — for the sake of the story let’s posit that a software engineer changed the state-code for California from CA to Ca and so all new orders placed in California yesterday have that new short-code.
So, in the situation where we have our data version control system, what do we do? Do we … roll back our dataset to yesterday? We’d still show California as having no sales today (in fact, now every state has zero sales, but at least they’re consistent!). And then what do we do tomorrow? Let’s assume that the software engineers push a fix to the website and new orders start coming in with the correct CA state-code. Well, we still have at least a day’s worth of orders with the “incorrect” short-code that we’re going to have to do something about.
Data version control does nothing to help us in this situation. And that’s okay! But I’d contend that the majority of issues that data practitioners face are of this type where the source data is arriving in a way that is new / different / surprising and will need to be fixed or addressed in some way.
When transformation logic changes
In the situation where the data transformation code logic changes, then it is useful to have version control, but not of the data! The result of a data transformation is a combination of:
- The raw data
- The transformation code
If you have the raw data and you have the transformation code in version control, you can re-create the transformed data from any point in time just by checking out the correct version of the transformation code and applying it to the raw data — assuming your transformations are deterministic (please, please make sure this is the case) you don’t need to version your data because you can always trivially recreate it.
In the use cases that I’ve worked on, a system that can efficiently and easily do the above is much better than a system that actually versions the data itself because, as we noted above, we’re always getting new raw data in. I want to compare two different transformations against the same (ideally updated and fully complete) data not data from today with data from two days ago. Because if we’re comparing data from today with data from two days ago then we’re making a comparison where we’ve changed both the underlying data and the transformation logic so it’s much harder to reason about.
Changing Source Data
The system I’m describing above only works in situations where the raw source data you’re working with are actually append-only. If you’re writing data transformations against a stateful database that’s managed by another team, then you’re going to have a bad time and I can see how the idea of “version control for data” would be appealing.
However, I’d advocate that the solution to that problem is not to build a complicated git-like data-versioning system, but rather transition to using an append-only style of data logging so that you can always deterministically re-create the state of the database at a given time just by using a timestamps filter (e.g., select * from table where timestamp < $some_timestamp;
).
The notion of table-log-duality gives us a clear roadmap for shifting between immutable events and stateful tables in a clear way that supports all of the goals of “version control for data” without actually introducing any new tools or concepts that we need to learn.
It turns out that Databricks’s Delta Lake product, referenced in Gross’s article, takes exactly that approach — it’s not a new git-like system for data, but rather a familiar log-table system applied to less-structured data.
The Full System
Just to pull all of the pieces together, I believe that you can achieve effectively the same aims of “version control for data” with a straightforward system that I’ve seen implemented (and implemented myself) in many organizations:
- Data warehouse collects data in an append-only event format (i.e., we don’t receive an orders table with a status column (e.g., shipped ) that changes over time, but rather we have an
orders_events
table that tracks how an order moves from started to paid to shipped to received).- With the
orders_events
table we can of course easily re-create the state of an order at any given time.
- With the
- A programmatic data transformation layer in version control that allows us to selectively apply transformations to raw source data
- A toolset that allows data practitioners to easily:
- Select a timepoint at which to view the data (i.e., exclude all data in the events table after a timestamp)
- Apply transformations from certain points in the version history to data (nothing more sophisticated than
git checkout hash && tool run transformations
) - Easily compare the results of those transformations in whatever way makes the most sense for the task at hand. Often that can be as easy as a
select * from table_v1 EXCEPT select * from table_v2
but of course could include the running of downstream machine-learning models and comparing the fit.
The system outlined here, when designed and implemented correctly gives the practitioner the ability to achieve a sense of “data version control” in a way that’s more flexible than a git-like system, easier to reason about, and doesn’t require any new software or tools.
Conclusion
I’m broadly sympathetic to the goals that people who are working on “git for data” projects have. However, I continue to believe that it’s important to keep code separate from data and that if your data system is deterministic and append-only, then you can achieve all of your goals by using version-control for your code and then selectively applying transformations to subsets of the data to re-create the data state at any time. The motto remains:
Keep version control for your code, and keep a log for your data.
Working on a version control system for data instead of solving the problem of not having an immutable log of your source data seems like investing a lot of time and tooling into solving the wrong problem (and is what I assume these systems are effectively doing under-the-hood-anyway).