DataOps Principles: How Startups Do Data The Right Way

If you have been trying to harness the power of data science and machine learning — but, like many teams, struggling to produce results — there’s a secret you are missing out on. All of those models and sophisticated insights require lots of good data, and the best way to get good data quickly is by using DataOps.

What is DataOps? It’s a way of thinking about how an organization deals with data. It’s a set of tools to automate processes and empower individuals. And it’s a new DataOps Engineer role designed to make that thinking real by managing and building those tools.

DataOps Principles

DataOps was inspired by DevOps, which brought the power of agile development to operations (infrastructure management and production deployment).  DevOps transformed the way that software development is done; and now DataOps is transforming the way that data management is done.

For larger enterprises with a dedicated data engineering team, DataOps is about breaking down barriers and re-aligning priorities. For smaller startup teams like ours, DataOps enables the tackling of large and complex data problems that previously were beyond reach.

There have been some other takes on the principles that make up DataOps, but this is our list:

  • Self-Service Data Over Data Requests
  • Autonomation Over Manual Processes
  • Frequent Small Changes Over Infrequent Large Changes
  • Reproducibility at All Levels
  • Data Scientists / Analysts and DataOps Engineers Must be on the Same Team
  • Value From Data is the Primary Measure of Progress
  • Continuous attention to Technical Excellence and Good Design
  • Simplicity As a Design Requirement
  • The Best Architectures, Requirements, and Designs Emerge From Self-Organizing Teams

Self-Service Data Over Data Requests

This is one of the key distinctions between data engineering and DataOps.

If it’s someone’s job to handle all data requests by writing a new SQL query or by downloading data from external systems, your team is headed in the wrong direction. The person in this role will be frustrated by the repetitive nature of those requests and overwhelmed by the large number of “high priority” requests. And the stakeholders who need that data will be frustrated by the amount of time their requests spend in queues, waiting for someone to get to them. Eventually each new problem will be met with fewer questions and even fewer answers. The fact that bureaucracy and red tape are often stacked on top of this process makes it even more painful.

There needs to be a data platform where data scientists and analysts can run their own queries off of the source data, as well as build their own reusable data transformations and views. Many business intelligence tools — such as Looker, with its persistent derived tables — can turn queries into reusable data sets. Data warehouses like Snowflake support the creation of derived views. And big data systems like Spark support the creation of ad-hoc data tables sourced from data lakes. Furthermore, a variety of tools exist to aid in data discovery and schema tracking.

Letting data users efficiently create and reuse their own transformations of data unlocks a whole new level of capabilities.  This takes the value of data democratization and goes beyond it to make everyone both a producer and a consumer of data. DataOps is about empowering and trusting the people who rely on data through the smarter use of modern tools.

Autonomation Over Manual Processes

Autonomation is not a misspelling of automation. It’s a specific method for thoughtfully leveraging the abilities of automation. And it’s key to creating data processes that are reliable and scalable.

Start with manual processes — like SQL queries or API pulls — to understand the problem space. Next, automate the repetitive parts and start manually monitoring that automation. Finally, automate the actions taken to correct issues found via monitoring and manually check performance metrics.

My own personal saying on how to do this right is: “One, Two, Automate”. That is, whenever there is a new process, to do it manually at least two times first, then to introduce automation and abstraction after that. This ensures that one doesn’t create premature abstractions that solve the wrong problems, but also not to miss out on the powers of automation.

Some examples of things to autonomate include:

  • Infrastructure using “infrastructure as code” or serverless
  • Data availability and latency
  • Data schema validation
  • Data quality checks
  • Business logic validation
  • Data governance requirements

Frequent Small Changes Over Infrequent Large Changes

Change can be scary, but it doesn’t have to be. When a new data source is introduced, a new data store migration is rolled out, or a new use case for data is implemented, it can have unintended consequences and cause failures. Traditionally, the experiences of facing those failures have caused enterprises to dictate how frequently changes are made and add extensive manual testing stages. But these lock downs have a huge opportunity cost — value could have been derived by implementing changes quickly.

The DataOps alternative is to batch changes in small amounts, continually monitor quality through automated tests, and to build in “undo” buttons when rolling out change. Small changes are less likely to have complex unintended consequences, so any issues they do cause can be quickly diagnosed. Then, with a pre-built “undo” button — perhaps a version rollback in your infrastructure — you can easily fix the issue and revisit ways to implement the change without the pressure of broken systems. Continuous and automatic monitoring ensures that as data systems interact with real-world data and usage, data remains reliable and available as changes are pushed out.

DataOps gives you the ability to fearlessly plan and implement changes to your data infrastructure. And once you can quickly implement many small steps, you’ll find that you cover much more distance than you would with slow, large steps.

Reproducibility at All Levels

Any good data analysis, data pipeline stage, or infrastructure setup must be reproducible. That way, it can be reused and leveraged to improve scale. And it can be taken into isolation for troubleshooting.

For data analysis, that means you should build out the capability for using snapshots of data, and locked dependencies for any code used in that analysis. This can be achieved by leveraging inexpensive block storage such as AWS S3, and tools such as Anaconda or Docker.

For data pipelines, you can troubleshoot a failing pipeline stage by feeding it the same input data repeatedly. It should also be easy to reproduce the implementation of that stage to troubleshoot past data transformations, as it was at an earlier time.

Reproducibility in data and data manipulation are essential for DataOps.

Data Scientists / Analysts and DataOps Engineers Must be on the Same Team

The number one cause of failure in data projects is miscommunication. A data scientist who doesn’t understand how his or her data was collected and prepared will waste time trying to make inferences out of noise, or being blinded by data bias. A data engineer who doesn’t understand how his or her data is being used will make unusable data schemas and miss crucial data quality issues. Assigning these tasks to different teams with different priorities will result in half-complete initiatives, awaiting re-prioritization from another team.

The skills needed to use data and prepare data (and data systems) are distinct, yet complementary. Every data team needs at least one DataOps team member and one data scientist or analyst. And data projects should be completed by small data teams comprised of members with an appropriate set of diverse skills who are aligned on objectives.

This seems like common sense, but it is not a common reality. Data scientists are often forced to spend large amounts of time struggling with APIs and big data systems. Data engineers are often outsourced and kept in the dark about how data is being used.

Value From Data is the Primary Measure of Progress

Deriving business value from data is the main goal of any data project. A team that is implementing a DataOps approach focuses on maximizing that value and getting to it faster.

It is important to keep in mind that the main valuable output of a data project is not documentation; it is also not databases, data tables, plots, or models. The main valuable output is the insights and actions that an organization can take based on data.

A proper approach to DataOps always ties the value of something being implemented to the end business value. The reason for implementing an automated data quality check is to ensure that data is used to generate meaningful insights about the state of a business.  The reason for creating documentation about a dataset is only because the dataset has business value. The reason for processes and procedures around data is that they ensure secure, reliable, and high quality data insights. Anything beyond what is required for business value from data should be considered waste.

Historically, large amounts of effort have gone into data engineering projects, with limited results. Startups and modern enterprises don’t have the luxury of wasting time and effort, and should not mistake the way that things have been done with real business value. DataOps, when done right, applies the principles of lean engineering in order to reduce waste by focusing on the goal.

Continuous Attention to Technical Excellence and Good Design

Quality and technical excellence are the keys to any technical endeavor. In DataOps, if a data platform is not available or reliable, or the data unclean and untrusted, then it is unusable. Any data pipeline is only as good as its weakest link, so it is vital to ensure that all of the parts of the pipeline are well engineered.

When a team using DataOps builds out high-quality data infrastructure, then they reduce the amount of time spent troubleshooting problems — and can focus on confidently building out new capabilities.  Teams that instead sacrifice quality to “save costs” inevitably end up paying more for it in the future with troubleshooting and bugfixes.

Simplicity As a Design Requirement

Every well-designed data architecture makes complex data simple to access.

Every badly designed data architecture makes simple data complex to access.

The difference between ordinary data management and great DataOps engineering is the thoughtful application of simple solutions to complex problems. It will mean different things in different contexts, but inevitably, when you draw the architecture block diagram, there will be fewer blocks and lines — and more capabilities.

Simple data systems are easier to understand, troubleshoot, and change. Make sure to always choose the simpler solution.

The Best Architectures, Requirements, and Designs Emerge from Self-Organizing Teams

At the end of the day, DataOps is all about people. A self-motivated team with appropriately diverse skills, clearly tasked with deriving business value from data, will generate the rest. They will pick the data architectures that best meet their needs in the simplest manner. They will write, prioritize, and implement the most important data requirements. And they will design the right data products and analyses.

Any top-down efforts to micromanage who does what or to pre-plan large chunks of tasks and stages too far into the future will result in siloing, misalignment, and failure.  This principle comes from the Agile Manifesto and recognizes that the best results don’t come from great plans; they come from great teams.

DataOps at Retina

Speaking of teams: I’ve been leveraging technology to manage data for well over a decade, and leading the teams that make it happen. At Retina, we apply a DataOps approach to what we do, using autonomation to empower our data scientists to derive value from data in a self-service manner.

If this sounds like the type of team you’d like to join, we are hiring!