In the past few years, much has been written about problems like discoverability, observability, data quality, and the need for data teams to become more “engineering oriented” in their mindset. Movements like analytics engineering and open source tooling like dbt, Dagster, and Great Expectations have done an amazing job arming data practitioners with the tools that they need to start adopting the best practices of software engineering like modularity, testing, and release management. This shift in mindset has resulted in very real and very exciting progress in data as a discipline over the past 3-5 years, and will likely be looked back upon similarly to how React changed the frontend (Laurie Voss does a good job articulating this). It is clear that much has improved in data land, yet many of the core problems outlined at the start of this post remain quite painful. Why is this?
It’s About The Who, Not The How
I believe that the answer to this lies in the way that scaled companies organize their data teams. While much has been said about how data teams can borrow low-level process/tooling ideas from software teams, much less has been said about what data teams can learn from the organizational principles that allow these processes to flourish. Software engineers have built some truly incredible tooling for making the process of developing software extraordinarily efficient and fault tolerant over the years. This is widely talked about, but only tells part of the story. What is not as widely talked about is the incredible organizational systems that software engineers have developed to facilitate the same efficiency. A lot of this organizational efficiency is attributed to buzzy development methodologies like Agile, but I think they actually have very little to do with what makes software teams work at scale. We have been developing large scale software since well before Agile was popularized. We were also doing it well before the fancy tooling that we all have the benefit of using today. While these two things contribute to overall development velocity, I believe the core driver of developer productivity is something much more mundane and far less original: organizational structure. To be specific, I believe software development teams have a few key characteristics that make them hyper efficient, even at scale:
- Specialization — The process of developing software is broken into concrete “roles” that are relatively specialized. There is some overlap, but it is well understood where the responsibilities of one role ends and the other begins. Frontend and Backend is the most obvious manifestation.
- Modularization — The problem is broken into clear, self contained chunks. The chunks are often designed to be extensible or “lego-able”, which allows for flexibility in solving not just what you need to solve today, but what you might need to solve tomorrow. It is known that chunks should not repeat major work from other chunks, and all widely applicable code should be separated out and placed somewhere shared. This is enforced at the code level by creation of services, APIs, etc. This is enforced at the organizational level via the creation of teams mapped to these services, and specialists work on the modules corresponding to their skillset. This means the question of “Who owns what” is usually at least relatively clear in scaled software orgs.
- Clarity — Interactions between modules/teams have very clear contracts at assorted “pass off points.” APIs are the contract to facilitate this barrier in code, team organization is the contract that facilitates this organizationally. Upstream teams are expected to understand who depends on their code and communicate with those stakeholders clearly and regularly. API breaks/breaches in these contracts are done with fair warning and not treated lightly.
- Buy In — There are certain cultural expectations around building software that are accepted even by non software roles. It is widely understood that shipping hacky software is quicker up front and slower in the long term, and the entire organization agrees that technical debt is (for the most part) bad.
Combined, these principles create clear ownership within software engineering teams, allowing them to ship secure, stable, and flexible software at breakneck speed. That said, it’s worth mentioning that these principles are not unique to software. They are the logical end state for the efficient operationalization of any complex system involving interactions between groups of human beings at scale and the core principles of effective, scaled industrialization. The same “assembly line” style of thinking that allowed Henry Ford to crank out Model Ts is what allows a software team to crank out releases.
Where Is Data Today?
Clearly, the aforementioned systems have worked quite well for software teams over the years, allowing them to reach massive scale and even collaborate on code across companies. But what about in data?
Let’s take a look at how this looks today:
- Specialization — This is probably the furthest along of the four principles, though there is still work to be done. The data world has crafted specialized roles, and this is a great start. The issue is that there is still a tremendous amount of overlap between these roles. It’s pretty clear what backend owns and what front end owns in software. It’s a little more ambiguous with data engineering and analytics engineering.
- Modularity — There is also some exciting movement here, and I think things like dbt and Dagster have gone a long way in solving this technically. It’s just a matter of pairing these tools with the right organizational design. Data teams today have a lot of repeated work / code across sub-teams, which creates an enormous headache with things like mismatched values for the same abstract metrics (e.g. revenue). It is also often the case that numerous sub-teams are touching / altering the same underlying asset or code, which makes bugs more likely. In an ideal world, no two sub-teams should be touching the same code or re-writing the same functionality. Code should have clear ownership, and shared functionality should be put into shared libraries that themselves have a clear owner. Don’t repeat yourself!
It’s worth noting that in many ways this is much more challenging in data than it is in software, so requires a lot of thoughtful attention. Software architecture is (in general) a software team only problem, and thus can be decided on within the confines of that software team. Data/information architecture is an entire company problem. This particular post will be largely focused on interactions within the data team, but solving the “entire company” problem is the real holy grail. It is an even more complex organizational challenge, and deserves a post of it’s own. Solving this problem is Modern Data Act 3.
- Clarity — Data observability tooling, particularly those that allow testing/contracts at the code level, have kicked us in the right direction here. There has also been great conversation around creating data products recently, but it’s mostly been focused on the data producer to data consumer angle. As data teams scale, this ethos should apply to data sub–team to data sub-team deliverables just as much as it does to data team to data consumer deliverables. I feel very strongly that data contracts are the key to this, but they need to be organizationally enforced and managed the way traditional APIs are: clear ownership, an expectation that they cannot break without good reason and fair warning, and semantically versioned.
- Buy In — Most organizations have been convinced on the value of data in decision making (that is why you are constantly being pinged to do a “quick data pull”). The next step is to convince them of the long term value of scalable, efficient data systems. Today, data consumers often push data teams to do hacky things to release things quickly — they need the data “right now”. The same folks would never demand the engineer team release a feature tomorrow. Convincing data consumers of the long term efficiency benefits of “going slow to go fast” is key to the continued success of the data function. This is a complicated meta-org challenge.
We are definitely making progress, but there is still a good bit left to be done. The larger the organization, the more critical these principles become, and as data teams continue to grow, so too will the pain of not having them. The solution is likely one part tooling and three parts people, but they are related. The best B2B products and developer tools are usually just organizational and operational best practices eternalized in code, and I believe that the most effective and successful tools will be those targeted at the facilitation of the above principles. But, this is the easy part. Organizing data is a hard problem, but organizing people is an even harder one. And until we solve it, I don’t think we will live in the harmonious data world that we all desire.
I think the data world is at a tipping point, and that the time to start building towards this world is now. The lack of these organizational principles has not posed as much of a problem for data teams historically because A) they have been been relatively small teams and B) the data landscape was comparatively simple. Today, neither A nor B is true, and the growth in both the size of data teams and the complexity of data as a discipline has made the lack of these principles apparent.
When viewed through an organizational lens, it becomes clear that the concepts of data observability and the existing tooling around it are papier-mâché solutions to a much larger problem. They provide a starting point to understand the landscape, and will work well for smaller data teams where one group is responsible for the entire pipeline. But without solving the ownership problem, the question of “who is responsible for this” is still a shrug at scale. A stack trace is only useful if you know who to talk to about which part.
I believe that there needs to be an “Modern Data Act 2” to the “engineering-ization” of data. Phase one was to draw inspiration from the toolchain of software engineers, adding “on the ground” best practices like testing, version control, and a separation between staging and production. But just as you cannot out-train a bad diet, you can’t out-tool poor organization, and though Data Ops is a critical piece of the future of data puzzle, we cannot get to where we want to be on the back of tooling alone. Solving these organizational problems needs to be the act 2 of the modern data stack.
This has, in many ways, already begun. Taylor and Emilie’s post and talk about thinking of data as a product is one of the most exciting movements in this direction, and I believe it is the right framework for coming at this. We know that we need to think about building data products the way that we build software products. We need also to think about building data organizations the way we build software organizations.
We now have the tools we need to build the data world we want to live in. Now we just need a little organization. It’s time to bring the assembly line to data.
Huge thanks to Emilie Schario, Joe Reis, Chad Sanderson, Allegra Holland, Michael Kaminsky, Caitlin Moorman, and Sarah Krasnik for the brainstorming, editing, feedback, and overall support on this article.