5 Tools That Will Help You Setup Production ML Model Testing

Developing a machine learning or a deep learning model seems like a relatively straightforward task. It usually involves research, collecting and preprocessing the data, extracting features, building and training the model, evaluation, and inference. Most of the time is consumed in the data-preprocessing phase, followed by the modeling-building phase. If the accuracy is not up to the mark, we then reiterate the whole process until we find a satisfactory accuracy.

The difficulty arises when we want to put the model into production in the real world. The model often does not perform as well as it did during the training and evaluation phase. This happens primarily because of concept drift or data drift and issues concerning data integrity. Therefore, testing an ML model becomes very important so that we can understand its strengths and weaknesses and act accordingly.

In this article, we will discuss some of the tools that can be leveraged to test an ML model. Some of these tools and libraries are open-source, while others require a subscription. Either way, this article will fully explore the tools which will be handy for your MLOps pipeline.

Why does model testing matter?

Building upon what we just discussed, model testing allows you to pinpoint a bug or area of concern that might cause the prediction capability of the model to degrade. This can happen over time gradually or in an instant. Either way, it is always good to know in which area they might fail and which features can cause them to fail. It exposes flaws, and it can also bring new insights to light. Essentially, the idea is to make a robust model that can efficiently handle uncertain data entries and anomalies.

Some of the benefits of model testing are:

1Detecting model and data drift
2Finding anomalies in dataset
3Checking data and model integrity
4Detect possible root cause for model failure
5Eliminating bugs and errors
6Reducing false positives and false negatives
7Encouraging retraining the model over a certain period of time
8Creating a production-ready model
9Ensuring robustness of ML model
10Finding new insights within the model

Is model testing the same as model evaluation?

Model testing and evaluation are similar to what we call diagnosis and screening in medicine.

Model evaluation is similar to diagnosis, where the performance of the model is checked based upon certain metrics like F1 score or MSE loss. These metrics do not provide a focused area of concern.

Model testing is similar to diagnosis, where a certain test like the invariance test and unit test aims to find a particular issue in the model.

What will a typical ML software testing suite include?

A machine learning testing suite often includes testing modules to detect different types of drifts like concept drift and data drift, which can include covariant drift, prediction drift, and so on. These issues usually occur within the dataset. Most of the time, the dataset’s distribution changes over time, affecting the model’s capability to accurately predict the output. You will find that the frameworks we will discuss will contain tools to detect data drifts.

Apart from testing data, the ML testing suite contains tools to test the model’s capability to predict, as well as overfitting, underfitting, variance and bias et cetera. The idea of the testing framework is to inspect the pipeline in the three major phases of development:

data ingestion,
data preprocessing,
and model evaluation.

Some of the frameworks like Robust Intelligence and Kolena rigorously test the given ML pipeline automatically in these given areas to ensure a production-ready model.

In essence, a machine learning suite will contain:

Unit tests that operate on the level of the codebase,
Regression tests replicate bugs from the previous iteration of the model that is fixed,
Integration tests simulate conditions and are typically longer-running tests that observe model behaviors. These conditions can mirror the ML pipeline, including preprocessing phase, data distribution, et cetera.

The image above depicts a typical workflow of software development | Source

What are the best tools for machine learning model testing?

Now, let’s discuss some of the tools for testing ML models. This section is divided into three parts: open-source tools, subscription-based tools, and hybrid tools.

Open-source model testing tools

1. DeepChecks

DeepChecks is an open-source Python framework for testing ML Models & Data. It basically enables users to test the ML pipeline in three different phases:

Data integrity test before the preprocessing phase.
Data Validation, before the training, mostly while splitting the data into training and testing, and
ML model testing.

5-tools-that-will-help-you-setup-production-ML-model-testing-2.png?resize=1024%2C244&ssl=1

The image above shows the schema of three different tests that could be performed in an ML pipeline | Source

These tests can be performed all at once and even independently. The image above shows the schema of three different tests that could be performed in an ML pipeline.

Installation

Deepchecks can be installed using following the pip command:

pip install deepchecks > 0.5.0

The latest version of Deepcheck is 0.8.0.

Structure of the framework

DeepChecks introduces three important terms: Check, Condition and Suite. It is worth noting that these three terms together form the core structure of the framework.

Check

It enables a user to inspect a specific aspect of the data and models. The framework contains various classes which allow you to check both of them. You can do a full check as well. Here are a couple of such checks:

Data inspectinginvolves inspection around data drift, duplication, missing values, string mismatch, statistical analysis such as data distribution et cetera. You can find the various data inspecting tools within the check module. The check module allows you to precisely design the inspecting methods for your datasets. These are some of the tools that you will find for data inspection:

‘DataDuplicates’,
‘DatasetsSizeComparison’,
‘DateTrainTestLeakageDuplicates’,
‘DateTrainTestLeakageOverlap’,
‘DominantFrequencyChange’,
‘FeatureFeatureCorrelation’,
‘FeatureLabelCorrelation’,
‘FeatureLabelCorrelationChange’,
‘IdentifierLabelCorrelation’,
‘IndexTrainTestLeakage’,
‘IsSingleValue’,
‘MixedDataTypes’,
‘MixedNulls’,
‘WholeDatasetDrift’

In the following example, we will inspect whether the dataset has duplicates or not. We will import the class DataDuplicates from the checks module and pass the dataset as a parameter. This will return a table containing relevant information on whether the dataset has duplicate values or not.

from deepchecks.checks import DataDuplicates, FeatureFeatureCorrelation dup = DataDuplicates() dup.run(data)

An example of inspecting if the dataset has duplicates | Source: Author

As you can see, the table above yields relative information about the number of duplicates present in the dataset. Now let’s see how DeepChecks uses a visual aid to provide the concerning information.

In the following example, we will inspect feature-feature correlation within the dataset. For that, we will import the FeatureFeatureCorrelation class from the checks module.

ffc = FeatureFeatureCorrelation() ffc.run(data)

Inspection of feature-feature correlation

An example of inspecting feature-feature correlation within the dataset | Source: Author

As you can see from both examples, the results can be displayed either in the form of a table or a graph, or even both to give relevant information to the user.

The model inspectioninvolves overfitting, underfitting, et cetera. Similar to data inspection, you can also find the various model inspecting tools within the check module. These are some of the tools that you will find for model inspection:

‘ModelErrorAnalysis’,
‘ModelInferenceTime’,
‘ModelInfo’,
‘MultiModelPerformanceReport’,
‘NewLabelTrainTest’,
‘OutlierSampleDetection’,
‘PerformanceReport’,
‘RegressionErrorDistribution’,
‘RegressionSystematicError’,
‘RocReport’,
‘SegmentPerformance’,
‘SimpleModelComparison’,
‘SingleDatasetPerformance’,
‘SpecialCharacters’,
‘StringLengthOutOfBounds’,
‘StringMismatch’,
‘StringMismatchComparison’,
‘TrainTestFeatureDrift’,
‘TrainTestLabelDrift’,
‘TrainTestPerformance’,
‘TrainTestPredictionDrift’,

Example of a model check or inspection on Random Forest Classifier:

from deepchecks.checks import ModelInfo info = ModelInfo() info.run(RF)

An example of a model check or inspection on Random Forest Classifier | Source: Author

Condition

It is a function or attribute that can be added to a Check. Essentially it contains a predefined parameter that can return a pass, fail, or warning results. These parameters can be modified as well accordingly. Follow the code snippet below to get an understanding.

from deepchecks.checks import ModelInfo info = ModelInfo() info.run(RF)

An example of a bar graph of feature label correlation | Source: Author

The image above shows a bar graph of feature label correlation. It essentially measures the predictive power of an independent feature that can predict the target value by itself. When you add a condition to a check as in the example above, the condition will return additional information mentioning the features which are above and below the condition.

In this particular example, you will find that the condition returned a statement stating that the algorithm “Found 2 out of 4 features with PPS above threshold: {‘petal width (cm)’: ‘0.9’, ‘petal length (cm)’: ‘0.87’}” meaning that features with high PPS are suitable to predict the labels.

Suite

It is a module containing a collection of checks for data and model. It is an ordered collection of checks. All the checks can be found in the suite module. Below is the schematic diagram of the framework and how it works.

The schematic diagram of the suite of checks and how it works | Source

As you can see from the image above, the data and the model can be passed into the suites which contain the different checks. The checks can be provided with the conditions for much more precise testing.

You can run the following code to see the list of 35 checks and their conditions that DeepChecks provides:

from deepchecks.suites import full_suite suites = full_suite() print(suites)

Full Suite: [ 0: ModelInfo 1: ColumnsInfo 2: ConfusionMatrixReport 3: PerformanceReport Conditions: 0: Train-Test scores relative degradation is not greater than 0.1 4: RocReport(excluded_classes=[]) Conditions: 0: AUC score for all the classes is not less than 0.7 5: SimpleModelComparison Conditions: 0: Model performance gain over simple model is not less than …]

In conclusion, Check, Condition, and Suites allow users to essentially check the data and model in their respective tasks. These can be extended and modified according to the requirements of the project and for various use cases.

DeepChecks allows flexibility and instant validation of the ML pipeline with less effort. Their strong boilerplate code can allow users to automate the whole testing process, which can save a lot of time.

An example of distribution checks | Source

Why should you use this?

It is open-source and free, and it has a growing community.
Very well-structured framework.
Because it has built-in checks and suites, it can be extremely useful for inspecting potential issues in your data and models.
It is efficient in the research phase as it can be easily integrated into the pipeline.
If you are mostly working with tabular datasets, then DeepChecks is extremely good.
You can also use it to check data, model drifts, model integrity, and model monitoring.

An example of methodology issues | Source

Key features

1It supports both classification and regression models in both computer vision and tabular datasets.
2It can easily run a large group of checks with a single call.
3It is flexible, editable, and expandable.
4It yields results in both tabular and visual formats.
5It does not require a login dashboard as all the results, including the visualization, are displayed instantly during execution itself. And it has a pretty good UX on the go.

An example of performance checks | Source

Key drawbacks

1It does not support NLP tasks.
2Deep Learning support is in beta version including computer vision. So results can yield errors.

2. Drifter-ML

Drifter ML is an ML model testing tool specifically written for the Scikit-learn library. It can also be used to test datasets similar to DeepChecks. It has five modules, each very specific to the task at hand.

Classification test: It enables you to test classification algorithms.
Regression test: It enables you to test classification algorithms.
Structural test: This module has a bunch of classes that allow testing of clustering algorithms.
Time Series test: This module can be used to test model drifts.
Columnar test: This module allows you to test your tabular dataset. Tests include sanity testing, mean and median similarity, Pearson’s correlation et cetera.