How can DataOps help improve efficiency at an AI or machine learning company? At Retina, we follow several DataOps principles in order to empower our small team to quickly and reliably handle multiple client datasets and models. We’ve been asked many times how we do it, so to answer that question, we’ve gathered some of the techniques that work for us — and we hope will help you too.
Addressing Data Needs at Retina
Retina uses machine learning to make predictions and inferences based on our clients’ customer data. As such, the primary internal consumers of data at Retina are data scientists. Our team uses both R and Python, so our interactions with data are usually programmatic and leverage dataframes. We deal with both large and small data sets, as well as run models both in automated and ad-hoc form.
We are fortunate to have started out cloud native, without the weight of legacy systems. Retina also has a small and cross-functional team, free of political or data silos.
Consequently, our data challenges are linked to the fact that our data sets come from our clients in diverse forms, both in terms of technology and schema. Our delivery dates are also ambitious — so there’s little time to tackle internal data quality issues.
Building a Data Platform
At Retina, we have chosen the technologies in our stack that best fulfill our own data needs. These choices reflect the needs of our team and involve some technical trade-offs.
We store our data “at rest” in AWS S3/Azure Blob Storage. This greatly decreases our storage costs by leveraging the separation of compute and storage. We can store multiple data set versions and scale up compute resources based on workload, rather than on the amount of data which we are processing. Also, having our primary data storage in cloud-native systems lets us access them directly on team members’ laptops as needed, as well as enforce security and data retention policies.
Retina uses Apache Spark, managed via Databricks, as our primary team workspace. Spark leverages the speed of in-memory and columnar data stores for big data processing. It also lets us tap into a large ecosystem of data connectors to help with data ingest and processing. With Spark, we can easily distribute parallel workloads to operate at scale.
Databricks provides us with a shared multi-language notebook environment where the team can use both R and Python to interact with Spark dataframes. Databricks notebooks let us have notebooks call other notebooks with various paremeters. We leverage this to create data pipelines using notebooks. This setup allows us to validate, clean, and prepare data in the same environment where we conduct data exploration and modeling.
We separate our notebooks into three types:
- Common libraries are used to reduce code repetition and provide fast implementation of new models or new datasets
- “Production” notebooks are version-controlled, tested, and kept clean and in reliable working order
- Ad-hoc and research notebooks, which are used for exploration and new model development
This separation is key, both to reducing friction while working and providing automated high-quality data outputs. In a typical workflow, we would start a new model in an ad-hoc notebook and create a few variations of it. Then, once the variations are understood, the notebook would be converted into a “production” notebook with parameterized inputs, more handling of edge cases, and improved testability. If common functions are needed across multiple notebooks, we would build those functions into a common custom package that includes automated tests and stricter version control.
Creating Reproducible Environments
At Retina, we use a multi-pronged approach to avoid what we like to call “dependency hell.” Because we are a modern data science company that depends on multiple external packages, we are at increased risk of one of those package maintainers pushing new bugs or breaking backwards compatibility. To avoid wasting time tracking down code that works one day and breaks the next, we leverage different techniques for reproducible and deployable data science.
For Python data science code, we use conda for version-controlled base environments that capture the Python runtime, as well as exact versions of Python dependencies and their sub-dependencies. This provides a common base environment that is shared across the team. Ad-hoc notebooks sometimes install their own new packages within the conda environment, but these packages are eventually brought into the common environment when the notebook is productionized.
For R data science code, we use the MRAN repository snapshots as our base set of packages. This means that we install packages based on the way that they looked on the CRAN repository at certain date-locked points in time. Because not all R packages are updated in CRAN, we also install directly from Github when necessary — but use commit hashes to refer to specific versions of those packages on Github.
To further ensure reproducibility — and that our environments can be deployed, along with code, into production — we leverage Docker images. Docker lets us capture and re-use the exact, compiled C code behind many Python and R packages, providing exact file replicas of runtime dependencies, such as the packages maintained by an operating system. These Docker containers can also be extended to include bundles of our own code in order to create deployed machine learning code using the same dependencies as in development.
Allowing for Different Data Formats
Retina uses Apache Parquet as our primary format for “at rest” data. It is well supported by both R and Python, is native to our Spark big data environment, and stores data efficiently. The Parquet format permits us to define and enforce data types for each column, leverages data compression, and uses columnar data stores, which are particularly efficient for the sparse data sets common to machine learning.
When we are dealing with large data sets, Parquet lets us partition our data across Spark workers. Then, when reading a partitioned Parquet file, we can run multiple Spark workers in parallel and load those Parquet partitions in a distributed manner quickly and efficiently.
Parquet does have some limitations that don’t hinder our work at Retina, but are worth noting. For one, Parquet doesn’t support streaming data well. It also isn’t designed for indexes of the sort used in databases for fast selection of single rows of data.
Ensuring Data Connectivity
To connect external environments such as Jupyter Notebooks and RStudio to our Spark clusters, and access Parquet files on S3, we use Databricks Connect. This permits faster data science iteration in the environments where our team is most productive. It’s then a simple transition to translate their code into the cloud-deployed notebooks that make up our data pipelines.
Protecting Data Reproducibility
We are also big proponents of the idea of reproducible data, especially when it comes to our pipelines. That is, we pass the paths to at-rest Parquet files between stages of our data pipelines — and new Parquet files are made each time there is an output from a pipeline stage. This does result in data duplication, but what we get in return is pipeline stages and notebooks that can be re-run on the same inputs.
Most of the data processed by stages of our pipelines are idempotent transformations with immutable data stores. Idempotency means that we can re-run a given pipeline stage or notebook without worrying that data will be corrupted during the debugging or development process. Immutable data stores are enforced by generating new paths each time data is output, so that when they are read as input, a single path will always refer to the same data.
Retina also has data stores which are idempotent and mutable. These are used when we want to accumulate multiple past runs of data into a single, larger data set. To do this, we generate sub-paths using named hive partitions and ensure that each run writes to a separate partition.
For example, we’d write to: “/data/order_summary/as_of_date=2019-08-01” for data processed on August 1, and adjust that path for each day of data being processed. Then, when reading, we read from “/data/order_summary/” and take advantage of the hive path conventions that convert “as_of_date” into what looks like another column in our dataset.
With these two approaches to data, we can flexibly handle multiple data needs while keeping our systems robust and re-usable.
Automating Quality Checks
To adhere to our DataOps autonomation principle, Retina also leverages automation to ensure data quality. We implement this across many parts of our data pipeline. Data comes in from client and external systems as “raw” data, which is then transformed only enough to get it into Parquet form. Then, we run data validation checks on it to create a set of “validated” data, which our DataOps team assesses for quality. This process results in a data lake where the data is democratized — our data science team can access the data as needed in a clean form.
We adjust and enhance our data quality checks on an ongoing basis, while always striving for a balance between strict standards and flexibility.
Looking to The Future
The approach that Retina is taking to DataOps continues to evolve. As our company’s data needs grow to encompass different types of data and applications, we hope to keep our data platform implementations simple and scalable. As we evolve, however, we will continue to be guided by our DataOps principles — while being open to new advancements in the data ecosystem.