Designing a Data Processing Pipeline
During your ML system design interview round, the interviewer expects you to discuss how to achieve and maintain a clean dataset before moving to model selection. To cover data pipelines successfully in an interview, you want to demonstrate your ability to:
- Reasonably scope the signals the pipeline will process (including too small a set could hold back model quality, but including too large a set would be impractical).
- Anticipate realistic flaws in the data and design practical mitigations for them.
- Get a v1 out quickly while scaling to a production-grade level.
- Build in a future-proof way that’s maintainable and extensible.
- Probe your pipeline for failures/potential improvements once it’s running.
Designing a data pipeline is one of many steps involved in the ML system design interview. Check out How to Answer ML System Design Interview Questions to learn a framework that guides how and when to discuss data processing pipelines.
In this lesson, we describe the main steps in designing a data processing pipeline and real-world examples of different processing methods. We also list strategies to help you develop an efficient data pipeline in your interview. Before we dive in, check out the graphic below to understand the different components of a standard data pipeline.

In the graphic above, "Canonical Data Store" refers to general data stores that can be implemented in various ways. Usually it's best to start with a simple method, like a set of record files, and then move to a more robust method, like a database, as the system matures.
Why data quality matters
In the modern era of machine learning, high-quality large-scale datasets are just as, if not more, important than smart algorithms.
Nowadays, state-of-the-art models such as GPT-4, Stable Diffusion, and DALL-E 2 are primarily self-supervised on massive datasets scraped from the Internet. While these datasets have enabled impressive performance across a variety of text and image generation tasks, they are not without their problems:
- Poisoning the data can allow adversaries to change model behavior.
- Datasets containing private or copyrighted data can leak sensitive information.
- Low-quality datasets can cause models to generate toxic text or buggy code.
Other stages of training, such as fine-tuning or RL tuning, often use manually or programmatically annotated labeled data, where the labels have a specific task, and the data is generally cleaner/less noisy than pre-training datasets.
Given how important data quality is for an ML system, we’ve created a step-by-step guide on designing a robust data pipeline that you can use during your interview.
Data collection
Data can be numbers, text, images, multimodal, etc.
Even if your model only needs to target one particular task, mixing data from other tasks can be helpful because the model can transfer learned skills from the other tasks. For example, BERT was pre-trained on 2 language tasks and fine-tuned on 4.
However, build on a pre-trained model if you’re low on compute or annotation resources. In this case, you only need a small set of specialized data to tune the model. For example, many vision models will start with the ResNet50 encoder, or the whole net, and then fine-tune on some task-specific data.
Data pre-processing
Most data will need to be pre-processed and cleaned using methods like:
- Data engineering-based pre-processing (e.g. clearing punctuation, joining tables, normalizing numbers, etc.)
- Tokenization
- Normalization
- Encoding categorical features in numerical form
- Removing low-quality data
It may also be necessary to impute missing values or augment your data through synthetic means.
The appropriate choice of data pre-processing is highly dependent on the task. In general, the data should:
- Contain correct information relevant to your task, with minimal incorrect or irrelevant information.
- Be represented in a convenient way (e.g. choosing normalized over unnormalized values, since they’re more convenient for training models).
A general framework for designing a data pre-processing pipeline includes:
- Unification. Do you have multiple sources of data? If yes, unify them (e.g. joining operations on tables) through a defined strategy (e.g. determining what to key by).
- Filtering. Are there some irrelevant or even counterproductive fields in the data that you’ll need to filter out?
- Restructuring, or feature engineering. Are there ways you can transform the relevant information so it’s more useful for learning?
Important considerations for data pre-processing
- Privacy. Privacy concerns are particularly relevant for big companies. Companies behind major applications (e.g. Google, Facebook, Airbnb, Uber) have a lot of user information. Companies behind serious enterprise products, (e.g. Microsoft, Salesforce, Stripe) don’t always have as much information, but it’s usually very sensitive. Personalization tasks are particularly sensitive here, because there are many legal restrictions. In many big companies, there are whole organizations dedicated to maintaining privacy standards. If privacy matters for the kind of data you’re using, consider:
- Removing identifying information.
- Applying filtering or pre-processing techniques that induce k-anonymity for sufficiently large k.
- Data contamination. Train-test data leakage can occur in a variety of ways, which aren’t always obvious or the result of exact duplicates. For example, if segments of your data are being generated by the same process (ex: the same spammer creates multiple spam emails in the dataset for your spam classification system), then it’s important to ensure that those segments are in the same split of your data. Many modern models, such as GPT-4, suffer from this problem and are therefore difficult to evaluate. To prevent data contamination, consider:
- Similarity analysis of datapoints. For example, say you have some representation of the elements in your domain (e.g. image embeddings for domain involved images). For each validation datapoint, you can find the training datapoints whose embeddings are most similar to the embedding of the validation datapoint (for a metric like cosine distance). Then, you can manually inspect the most similar datapoints to evaluate if there’s contamination. To mitigate the contamination, you can remove the train datapoints that are most similar to the validation datapoints.
- Data-lineage based analysis of datapoints. Keep detailed metadata about how datapoints were generated, and then ensure training and validation splits have sufficient differences.
- Data scarcity. Developing effective models without much data is especially desirable in situations where human-labeled data is difficult or expensive to obtain. Complex-task and/or long-tail type problems are where data scarcity occur most frequently. Also, many domains with long-tails, like self-driving car companies, need to make data-scarce situations work. To mitigate sparse data, consider the following techniques:
- Generative adversarial networks (GANs): a kind of generative model that generates synthetic datapoints. GANs are trained on samples from some domain (e.g. images of zebras) and then produce estimations of samples from that domain (e.g. estimations of images of zebras).
- Synthetic data: other techniques for generating synthetic data include diffusion models, simple augmentations, and simulation.
- Targeted data collection: selecting high value-add datapoints for labeling is an excellent way to mitigate data scarcity. The Pareto principle often applies to training datasets. The 20% of datapoints that are most useful provide 80% of the value. If you can target the 20%, you can maximize the value from limited datapoints.
Data labeling
Many self-supervised models depend solely on massive corpora scraped from the internet, but other models may require labeled data. In fact, almost all production models will use at least some human-labeled data. For example, Tesla has a large human-labeled data organization, and their production models leverage this data.
Data labels can be collected in a variety of ways, such as:
- Human annotation: a human follows some instructions to create some labels (e.g. a human reads instructions about what kind of motorcycle is of interest, and then draws polygons on all such motorcycles in an image).
- Programmatic labeling: a human defines a process for generating labels, and then runs that process on machines. This process could be represented in code, as a software system, or in some other form (e.g. SnorkelAI’s custom rules).
- Synthetic data generation: a pre-existing system generates domain samples and labels on those samples (e.g. Applied Intuition’s driving simulation platform, which yields simulated sensor data and labels that correspond to those samples).
Although more automated techniques offers the advantage of higher quantities of data, the quality of the labels is usually lower than human-labeled data.
The graphic below illustrates a standard flow for label generation, both human-labeled and auto-labeled. Auto-labeling refers to all non-human-labeling methods.

When designing a data pipeline, consider these factors to determine how you’ll label the data:
- What’s the cost of getting human labels for the full task you care about?
- Are there partial versions of the task that would also be useful, for which human-labeling isn’t as expensive? (e.g. if the full task is segmentation masks, a partial task might be bounding boxes).
- What’s the quality of the best programmatic labeling approach you currently have? You can evaluate this by computing metrics of your programmatic labels with respect to human-labels.
- If there’s a simulation option available, how realistic does it seem? Even though it’s impossible to get a full answer to this, it’s helpful to consider.
Feature engineering
Feature engineering is when a human designs a process for producing signals that are downstream from base signals.
For example, let’s say a raw dataset has one row per person, and the columns include “self-reported life satisfaction score out of 10”, “country”. We could normalize the satisfaction scores on a per-country basis (S_new = (S_old - mean_country) / std_dev_country).
There are many useful patterns in feature engineering, and they’re often informed by domain expertise. Some generally relevant patterns include:
- Normalize to be in N(0, 1), like the above example. Another common example is to transform an image so the distribution of each pixel location is N(0, 1), over your whole dataset.
- Take the log of high-range numbers. For example, if you have a feature that varies from 0 to very high numbers, (e.g. wealth), take the log of this to make it less extreme.
There are many more ways to do feature engineering. Research in whole fields, like computer vision, used to be dominated by this kind of activity. Now, neural network-based modeling essentially automates feature engineering and outperforms manual feature engineering when there is enough data.
There are recurring feature engineering patterns among certain domains. Some domains and their common features include:
- Audio data: the raw data is in spectrogram form. Physics-based features can be useful, like root-mean-squared energy or the amplitude envelope.
- Image data: the raw data is usually RGB values. One common set of features is the scale-invariant feature transform (SIFT) set.
- Natural language data: sometimes it’s useful to run lightweight classifiers, like sentiment classifiers or part of speech taggers, to get features.
- Tabular data: feature engineering here will depend on the meaning of the data. It can be useful to normalize the values per column or use other core transforms.
When to use feature engineering
Although many modern models do well with raw dumps of unstructured data as inputs (e.g. GPT-4 or Stable Diffusion models), sometimes it’s still advantageous to manually engineer features.
If you’re working with a small dataset, have strong inductive biases that the model must learn, or want to create a more interpretable model, consider feature engineering. For example, let’s say you have a dataset of 256 zoo animal images, and you want to make a lightweight classifier that predicts what zoo animal is in the image. You could make a set of features that uses SIFT to extract keypoints, and then takes the colors of those keypoints to make a colors bag of words. Then, you could build a simple classifier (e.g. decision tree or logistic regression) on that feature set.
Generally, the kind of dataset that will not benefit from feature engineering is a large dataset with properties that are not obvious to a human. For example, imagine you have a large dataset of images from the internet, and you’re trying to detect which images are AI-generated. In this situation, it’s best to rely on automated methods for feature extraction, like contrastive learning. The differences between AI-generated and not are nearly imperceptible to a human, and the dataset is likely too large to go through.
Pipeline maintenance
At a high level, pipeline maintenance should consider the following:
- Infrastructure. This refers to the systems that support your pipeline (e.g. compute, storage, memory).
- Tool maintenance. Generally, pipelines are orchestrated by third party tools (e.g. Prefect, Luigi, and Airflow). Airflow is the most common pipeline orchestration tool. It allows users to define and deploy their data transformation pipelines, while Airflow handles the operational heavy lifting.
- Transformation logic. The purpose of a pipeline is to read raw data from a storage layer and then apply transformations on it (in real-time or in batches, depending on the use case). Depending on the volume, velocity, and variety of data you work with, it’s essential to build in optimizations and efficiencies in your transformation logic so everything can work smoothly.
Once the above 3 are addressed, pipelines also require observation and data quality tools. These tools ensure that the pipelines are performing as intended, and they proactively catch any errors while in production.
Now that you’ve built a data pipeline that produces clean data, it’s time to select the right ML model for your system.