If you have been trying to harness the power of data science and machine learning —but, like many teams, struggling to produce results— there’s a secret you are missing out on. Building models and extracting sophisticated insights requires lots of good data, and the best way to get good data quickly is by using DataOps.
What is DataOps? It’s a way of thinking about how an organization deals with data and a set of tools to automate processes and empower individuals. And, in order to make this philosophy real in your organization, it may mean you need a new DataOps Engineer role to design, manage, and build those tools.
DataOps was inspired by DevOps, which brought the power of agile development to operations, which refers to infrastructure management and production deployment more specifically. DevOps revolutionized software development, and now DataOps is transforming data management.
For larger enterprises with a dedicated data engineering team, DataOps is about breaking down barriers and re-aligning priorities. For smaller startup teams like ours, DataOps enables tackling large and complex data problems that previously were beyond our reach.
There have been some other takes on the principles that make up DataOps, but this is our list:
- Self-Service Data Over Data Requests
- Autonomation Over Manual Processes
- Frequent Small Changes Over Infrequent Large Changes
- Reproducibility at All Levels
- Data Scientists/Analysts and DataOps Engineers Must be on the Same Team
- Value From Data is the Primary Measure of Progress
- Continuous Attention to Technical Excellence and Good Design
- Simplicity As a Design Requirement
- The Best Architectures, Requirements, and Designs Emerge From Self-Organizing Teams
Self-Service Data Over Data Requests
This is one of the key distinctions between data engineering and DataOps.
If it’s someone’s job to write a new SQL query or download data from external systems to handle all data requests, your team is headed in the wrong direction. That’s because the person in this role will quickly become frustrated by the repetitive nature of those requests and overwhelmed by the large number of “high priority” requests. And, the stakeholders who need that data will be frustrated by the amount of time their requests spend in queues. Eventually each new problem will be met with fewer questions and even fewer answers. Bureaucracy and red tape are often stacked on top of this process, which naturally makes it even more painful.
A data platform that allows data scientists and analysts to run their own queries off of the source data and build their own reusable data transformations and views is a necessity. Many business intelligence tools can turn queries into reusable data sets. For example, Looker has persistent derived tables. Data warehouses like Snowflake support the creation of derived views. Finally, big data systems like Spark support the creation of ad-hoc data tables sourced from data lakes. Furthermore, a variety of tools exist to aid in data discovery and schema tracking.
Letting data users efficiently create and reuse their own transformations of data unlocks a whole new level of capabilities. This takes the value of data democratization and goes beyond it to make everyone both a producer and a consumer of data. DataOps is about empowering and trusting the people who rely on data through the smarter use of modern tools.
Autonomation Over Manual Processes
Autonomation is not a misspelling of automation. It’s a specific method for thoughtfully leveraging the abilities of automation. And, it’s key to creating data processes that are reliable and scalable.
Start with manual processes like SQL queries or API pulls to understand the problem space. Next, automate the repetitive parts and start manually monitoring that automation. Finally, automate the actions taken to correct issues found via monitoring and manually check performance metrics.
My own personal saying on how to do this right is: “One, Two, Automate”. That is, whenever there is a new process, do it manually at least two times first. Then, introduce automation and abstraction after that. When you begin with automation straight away, you risk creating premature abstractions that solve the wrong problems.
Some examples of things to autonomate include:
- Infrastructure using “infrastructure as code” or serverless
- Data availability and latency
- Data schema validation
- Business logic validation
- Data quality checks
- Data governance requirements
Frequent Small Changes Over Infrequent Large Changes
Change can be scary, but it doesn’t have to be. When a new data source is introduced, a new data store migration is rolled out, or a new use case for data is implemented, it can have unintended consequences and cause failures. Traditionally, the experiences of facing those failures have caused enterprises to dictate how frequently changes are made and add extensive manual testing stages. But these lock downs have a huge opportunity cost—value could have been derived by implementing changes quickly.
The DataOps alternative is to batch changes in small amounts, continually monitor quality through automated tests, and to build in “undo” buttons when rolling out change. Small changes are less likely to have complex unintended consequences, so any issues they do cause can be quickly diagnosed. Then, with a pre-built “undo” button —perhaps a version rollback in your infrastructure— you can easily fix the issue and revisit ways to implement the change without the pressure of broken systems. Monitor data systems continuously and automatically to ensure that data remains reliable and available as teams interact with real-world data and usage or push out changes.
DataOps empowers you to fearlessly plan and implement changes to your data infrastructure. Once you can quickly implement many small steps, you’ll find that you cover much more distance than you would with slow, large steps.
Reproducibility at All Levels
Any good data analysis, data pipeline stage, or infrastructure setup must be reproducible. That means it can be reused and leveraged to be more scalable. It also means it can be taken into isolation for troubleshooting.
For data analysis, that means you should build out the capability for using snapshots of data and locked dependencies for any code used in that analysis. This can be achieved by leveraging inexpensive block storage such as AWS S3 and tools such as Anaconda or Docker.
For data pipelines, you can troubleshoot a failing pipeline stage by feeding it the same input data repeatedly. It should also be easy to reproduce the implementation of that stage to troubleshoot past data transformations, as it occurred at an earlier time.
Reproducibility in data and data manipulation are essential for DataOps.
Data Scientists / Analysts and DataOps Engineers Must be on the Same Team
The number one cause of failure in data projects is miscommunication. A data scientist who doesn’t understand how the engineer collected and prepared their data will waste time trying to make inferences out of noise. Data bias may also blind them to the insights they seek. A data engineer who doesn’t understand the use cases of their data will make unusable data schemas and miss crucial data quality issues. Assigning these tasks to different teams with different priorities will result in half-complete initiatives, awaiting re-prioritization from another team.
The skills needed to use and prepare data and data systems are distinct, yet complementary. This means every data team needs at least one DataOps team member and one data scientist or analyst. Similarly, small data teams including members with an appropriate set of diverse skills complete data projects most effectively—especially when they align on their objectives from the beginning.
This seems like common sense, but it is not a common reality. Usually, data scientists must spend large amounts of time struggling with APIs and big data systems. By contrast, companies ofter outsource data engineers and keep them in the dark about the use cases of data.
Value From Data is the Primary Measure of Progress
Deriving business value from data is the main goal of any data project. A team that is implementing a DataOps approach focuses on maximizing that value and getting to it faster.
It is important to keep in mind that the main valuable output of a data project is not documentation, nor is it a database, data table, plot, or model. In reality, the main valuable outputs are the insights and actions that an organization can take based on their data.
So, a proper approach to DataOps always ties the value of a project to the end business value. For example, you can implement an automated data quality check to ensure that data generates meaningful insights about the state of a business. Similarly, you create documentation about a dataset because it has value. You implement processes and procedures around data because they ensure secure, reliable, and high quality data insights. Anything beyond the call of extracting business value from data is ultimately a wasteful endeavor.
Historically, large amounts of effort have gone into data engineering projects, with limited results. Startups and modern enterprises don’t have the luxury of wasting time and effort, and should not confuse the way that things have been done with producing real business value. DataOps, when done right, applies the principles of lean engineering in order to reduce waste by focusing on the goal.
Continuous Attention to Technical Excellence and Good Design
Quality and technical excellence are the keys to any technical endeavor. In DataOps, if a data platform is not available or reliable, or the data unclean and untrusted, it is unusable. Any data pipeline is only as good as its weakest link, so it is vital to ensure that all of the parts of the pipeline are well engineered.
When a team using DataOps builds out high-quality data infrastructure, they reduce the amount of time they spend troubleshooting problems and can focus on confidently building out new capabilities. Teams that instead sacrifice quality to “save costs” inevitably end up paying more for it in the future with troubleshooting and bugfixes.
Simplicity As a Design Requirement
Every well-designed data architecture makes complex data simple to access.
Every badly designed data architecture makes simple data complex to access.
The difference between ordinary data management and great DataOps engineering is the thoughtful application of simple solutions to complex problems. It will mean different things in different contexts, but inevitably, when you draw the architecture block diagram, there will be fewer blocks and lines — and more capabilities.
Simple data systems are easier to understand, troubleshoot, and change. Make sure to always choose the simpler solution.
The Best Architectures, Requirements, and Designs Emerge from Self-Organizing Teams
At the end of the day, DataOps is all about people. A self-motivated team with appropriately diverse skills, clearly tasked with deriving business value from data, will generate the rest. They will pick the data architectures that best meet their needs in the simplest manner. A good team will write, prioritize, and implement the most important data requirements. They also will design the right data products and analyses.
Any top-down efforts to micromanage who does what or to pre-plan large chunks of tasks and stages too far into the future will result in siloing, misalignment, and failure. This principle comes from the Agile Manifesto and recognizes that the best results don’t come from great plans; they come from great teams.
DataOps at Retina
Speaking of teams: I’ve been leveraging technology to manage data for well over a decade, and leading the teams that make it happen. At Retina, we apply a DataOps approach to what we do, using autonomation to empower our data scientists to derive value from data in a self-service manner.
If this sounds like the type of team you’d like to join, we are hiring!