Building Machine Learning Pipelines: Common Pitfalls

In recent years, there have been rapid advancements in Machine Learning and this has led to many companies and startups delving into the field without understanding the pitfalls. Common examples are the pitfalls involved when building ML pipelines.

Machine Learning pipelines are complex and there are several ways they can fail or be misused. Stakeholders involved in ML projects need to understand how Machine Learning pipelines can fail, possible pitfalls, and how to avoid such pitfalls.

There are several pitfalls you should be aware of when building machine learning pipelines. The most common pitfall is the black-box problem — where the pipeline is too complex to understand. This could lead to failure in identifying what’s wrong with a given system or why it isn’t working as expected.

To understand other pitfalls, we will take a look at a typical ML pipeline architecture, including the steps involved, and the pitfalls to avoid under the various steps.

General ML pipeline architectures

Machine Learning pipelines help teams organize and automate ML workflows. The pipeline also gives ML (and data) engineers a way to manage data for training, orchestrate training and serving jobs, and manage models in production.

Let’s go over the typical process in an ML pipeline that defines its architecture.

Data process in the whole ML pipeline | Source: Building Machine Learning Pipelines by Hannes Hapke, Catherine Nelson

An ML pipeline should focus on the following steps:

Data ingestion: Collecting the necessary data is the first step in the entire procedure. The data that would be utilized for training would be defined by a specialized team of data scientists or persons with business expertise working together with the data engineer.

Data validation and preprocessing: The collected data is subjected to plenty of changes. This is frequently done manually to format, clean, label, and enhance data to ensure acceptable data quality for the models. The model’s features are the data values that it will utilize in both training and production.

Model training: The training is one of the most important aspects of the entire procedure. Data scientists match the model to previous data to learn from, training it to make predictions on unseen data.

Model analysis and validation: To guarantee high predicted accuracy, trained models are validated against testing and validation data. When the outcomes of the tests are compared, the model may have been tuned/modified/trained on different data.

Model deployment: The final stage is to apply the ML model to the production setting. So, in essence, the end-user may utilize it to obtain predictions based on real-time data.

Pipeline orchestration: Pipeline orchestration technologies employ a simple, collaborative interface to automate and manage all the pipeline processes and in some cases, the infrastructure.

Metadata management: This step tracks metadata like code versions, model versions, hyperparameter values, environments, and evaluation metric results, organizing them in a way that makes them accessible and collaborative inside your company.

Several options are possible in terms of ML pipeline architectures and we would rely on these providers for the ML pipelines and services, like Google Cloud and AWS. Azure ML pipeline, for example, aids the creation, management, and optimization of machine learning workflows. It is a standalone deployable process for an ML workflow. It is quite simple to use and offers a variety of additional pipelines, each with a distinct function.

Common pitfalls in the ML pipeline steps

Running ML pipelines from ingesting data to modeling, operationalizing the models can be very tedious. Managing the pipelines is also a significant difficulty in the life cycle of an ML application. In this section, you will learn the common pitfalls you may encounter in each step of building ML pipelines.

Data ingestion step

Dealing with a variety of data sources

Data ingestion is about moving data from many sources into a centralized database, often a data warehouse, where it can be consumed by downstream systems. This may be done in either real-time or batch mode. Data ingestion and data versioning form the central backbone of a data analytics architecture. The most common pitfalls regarding this step concern the different formats and types of data that might be confusing and different to processes depending on the nature of the data.

Batch mode

The most prevalent kind of data ingestion models batching. In batch processing, the ingestion layer takes source data on a defined basis and moves it to a data warehouse or other databases. Batching might be initiated by a timetable, a predefined logical sequence, or by certain pre-defined criteria. Because batch processing is often less expensive, it is frequently employed when real-time data ingestion is not required.

Real-time streaming mode

Real-time streaming is a technique for ingesting data from a source to a target in real-time. Streaming has no periodic element, and this implies that data is ingested into the data warehouse as soon as it becomes accessible at the source. There is no waiting period. This requires a system that can constantly monitor the data producer for new data.

Solution

The most important rule here is to keep a consistent data layer throughout the pipeline.

Always focus on maintaining data with a similar format even when fetching from a variety of data sources.

You need to have good reporting and other downstream analytics systems fed with good data quality and traceable data lineage to work well.
Your data needs to be consistent because of the different modes (real-time and batch) the data may be ingested through

Data validation and processing

Choosing the wrong architecture

Data validation and data processing are two steps that might be hampered by issues with the pipeline. Because data sources change regularly, so will the formats and types of data gathered over time, future-proofing a data input system is a significant problem.

In the data input and pipeline processes, speed might be an issue. Building a real-time pipeline, for example, is incredibly expensive, therefore it’s critical to assess what speed is truly required for your firm.

Neglecting data quality monitoring

Before the computer can conduct a batch task, for example, all of its input data must be ready; this implies it must be thoroughly examined. Data quality issues, data mistakes, errors, and software failures that occur during batch jobs can bring the entire process to a standstill or worse, cause silent model failures.

Solution

The data quality must be thoroughly monitored to ensure the following steps in the pipeline are making use of quality data.. Minor data mistakes, such as date typos, might cause a batch process to fail.

With that being said, one must always keep in mind that the models running on the production server would utilize real-world data to make predictions for the users, so we need to also monitor the shift in data distribution over time.

Creating an ML model is not a simple task and to make the model perform well in different settings, high-quality data is required. Bad data that enters the pipeline will not only cause your model to function incorrectly, but it might also be devastating when making crucial business decisions, particularly in mission-critical sectors such as healthcare or self-driving cars.

Model training

Using unverified and unstructured data during model training

One of the most prevalent mistakes made by machine learning engineers in AI research is the usage of unverified and unstructured data. Unverified data may contain problems such as duplication, conflicting data, inaccurate or incomplete classification, discrepancies, and other data issues that may cause anomalies throughout the training process.

Solution

Of course, one way to remedy all these issues is to leverage an experiment tracking tool. That way you can keep track of all your pipeline running sessions, the multiple versions of the data you train with, and in the production phase, you can easily monitor your model versions and data streams with a few clicks. Neptune.ai is the appropriate tool to use in such contexts.

Model validation and analysis

Careless preprocessing can introduce train/test leakages during model validation

Model validation properly assesses a model’s real-world performance before it is deployed. There is a list of important points to keep in mind:

Model application: Our model may be used for mission-critical applications. Is it reliable?
Model generalizability: We don’t want to achieve fantastic test-set performance just to be disappointed when our model is deployed and performs poorly in the real world.
Model evaluation: We won’t always know the ground truth for new inputs during deployment. So measuring the model’s performance after deployment may be difficult.

Under this pitfall, there are 2 things to always keep in mind:

A naive train/test split implicitly assumes that our data consists of iid samples.
If our data violates this iid assumption, then the test-set performance may mislead us and cause us to overestimate our model’s predictive abilities.

Solutions

If your data is iid, then you may use standard splits or cross-validation. Here are some implementations of Scikit-learn:
When your data has a sequential structure (like text streams, audio clips, video clips), then you ought to use a cross-validator suited for that situation.

Model deployment

Thinking deployment is the final step

A prevalent misconception is that machine learning models automatically correct themselves after deployment and that little should be done to the model. This may be true in areas such as reinforcement learning, however even using this technique, model parameters are updated over a period of time to perform optimum.

Naturally, it is not the case with typical ML models and a lot of errors can arise in the deployment phase. One common mistake is to neglect the monitoring model performance and usage cost in production.

Solution

To ensure that the model is monitored we could leverage the use of various model monitoring tools, depending on their ease of use, flexibility, monitoring functionalities, overhead, and alert system.

Also, root cause analysis may be used to determine the root causes of a problem and then resolve it with a correct plan of action.

Metadata management

Neglecting pipeline metadata management

As you have learned in this article, working with ML pipelines can get pretty complex quickly. Each step produces metadata that, if not managed, can lead to potential problems such as not being able to trace and debug pipeline failures.

Solution

Use pipeline metadata management tools to track and manage the metadata produced by each step of the pipeline. One of the tools that do this quite well is Neptune.ai. Another tool that is adept at managing pipeline metadata is Kedro (it’s actually possible to easily integrate them both thanks to the Neptune-Kedro plugin).

With Neptune.ai you can track all your ML pipeline experiments and metadata with ease. Neptune can be used to avoid problems when dealing with on-production settings.

Learn more

Building and Managing Data Science Pipelines with Kedro

Metadata management and experiment tracking

The ML metadata store is an essential component of the MLOps stack for managing model pipeline metadata. Neptune.ai is a centralized metadata store that can be used in any MLOps process.

Example of run comparisons in Neptune.ai for speech recognition project | Source

The example above shows a dashboard comparing the metrics and training results from experiments in a pipeline. The experiment training extends beyond the example above, and it includes several handy functionalities like the following:

Hyperparameters
Learning curves
Training code and configuration files
Predictions (images, tables, etc.)
Diagnostic charts (Confusion matrix, ROC curve, etc.) — you can log interactive graphing charts using external libraries such as Plotly, etc.
Console logs
Hardware logs
Model binary or location to your model asset
Dataset versions
Links to recorded model training runs and experiments
Model descriptions and notes

To learn more, you can check Neptune’s documentation which is very exhaustive. Here you can learn how to integrate Neptune.ai with your pipelines.

Concluding thoughts

As you can see, a lot can go wrong when designing a pipeline that handles all different stages of the ML process. Especially in production where several unexpected issues could occur that cause serious trouble and even in some cases, cause business damages.

The pitfalls we have discussed here are quite common and in this article, we have listed some solutions to remedy them. This article also gave you a small glance of what could cause things to deviate from the original planning.

Finally, if it fits your workflow, I would strongly recommend Neptune.ai as your pipeline metadata store, regardless of where you or your colleagues run your pipelines– whether it’s in the cloud, locally, in notebooks, or anywhere else.

Aymane Hachcham

Data Scientist at Spotbills | Machine Learning enthusiast.

Follow me on

READ NEXT

The Best MLOps Tools and How to Evaluate Them

12 mins read | Jakub Czakon | Updated August 25th, 2021

In one of our articles—The Best Tools, Libraries, Frameworks and Methodologies that Machine Learning Teams Actually Use – Things We Learned from 41 ML Startups—Jean-Christophe Petkovich, CTO at Acerta, explained how their ML team approaches MLOps.

According to him, there are several ingredients for a complete MLOps system:

You need to be able to build model artifacts that contain all the information needed to preprocess your data and generate a result.
Once you can build model artifacts, you have to be able to track the code that builds them, and the data they were trained and tested on.
You need to keep track of how all three of these things, the models, their code, and their data, are related.
Once you can track all these things, you can also mark them ready for staging, and production, and run them through a CI/CD process.
Finally, to actually deploy them at the end of that process, you need some way to spin up a service based on that model artifact.

It’s a great high-level summary of how to successfully implement MLOps in a company. But understanding what is needed in high-level is just a part of the puzzle. The other one is adopting or creating proper tooling that gets things done.

That’s why we’ve compiled a list of the best MLOps tools. We’ve divided them into six categories so you can choose the right tools for your team and for your business. Let’s dig in!

Continue reading ->

Building Machine Learning Pipelines: Common Pitfalls

Building Machine Learning Pipelines: Common Pitfalls

General ML pipeline architectures

Common pitfalls in the ML pipeline steps

Data ingestion step

Dealing with a variety of data sources

Batch mode

Real-time streaming mode

Solution

Data validation and processing

Choosing the wrong architecture

Neglecting data quality monitoring

Solution

Model training

Using unverified and unstructured data during model training

Solution

Model validation and analysis

Careless preprocessing can introduce train/test leakages during model validation

Solutions

Model deployment

Thinking deployment is the final step

Solution

Read also

Metadata management

Neglecting pipeline metadata management

Solution

Learn more

Metadata management and experiment tracking

Concluding thoughts

Aymane Hachcham

The Best MLOps Tools and How to Evaluate Them

Recommend

About Joyk