0

5 Tools That Will Help You Setup Production ML Model Testing

 1 year ago
source link: https://neptune.ai/blog/tools-ml-model-testing
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

5 Tools That Will Help You Setup Production ML Model Testing

Developing a machine learning or a deep learning model seems like a relatively straightforward task. It usually involves research, collecting and preprocessing the data, extracting features, building and training the model, evaluation, and inference. Most of the time is consumed in the data-preprocessing phase, followed by the modeling-building phase. If the accuracy is not up to the mark, we then reiterate the whole process until we find a satisfactory accuracy. 

The difficulty arises when we want to put the model into production in the real world. The model often does not perform as well as it did during the training and evaluation phase. This happens primarily because of concept drift or data drift and issues concerning data integrity. Therefore, testing an ML model becomes very important so that we can understand its strengths and weaknesses and act accordingly. 

In this article, we will discuss some of the tools that can be leveraged to test an ML model. Some of these tools and libraries are open-source, while others require a subscription. Either way, this article will fully explore the tools which will be handy for your MLOps pipeline. 

Why does model testing matter?

Building upon what we just discussed, model testing allows you to pinpoint a bug or area of concern that might cause the prediction capability of the model to degrade. This can happen over time gradually or in an instant. Either way, it is always good to know in which area they might fail and which features can cause them to fail. It exposes flaws, and it can also bring new insights to light. Essentially, the idea is to make a robust model that can efficiently handle uncertain data entries and anomalies. 

Some of the benefits of model testing are:

  • 1Detecting model and data drift
  • 2Finding anomalies in dataset
  • 3Checking data and model integrity
  • 4Detect possible root cause for model failure
  • 5Eliminating bugs and errors
  • 6Reducing false positives and false negatives 
  • 7Encouraging retraining the model over a certain period of time
  • 8Creating a production-ready model
  • 9Ensuring robustness of ML model
  • 10Finding new insights within the model

Is model testing the same as model evaluation?

Model testing and evaluation are similar to what we call diagnosis and screening in medicine. 

Model evaluation is similar to diagnosis, where the performance of the model is checked based upon certain metrics like F1 score or MSE loss. These metrics do not provide a focused area of concern. 

Model testing is similar to diagnosis, where a certain test like the invariance test and unit test aims to find a particular issue in the model. 

What will a typical ML software testing suite include?

A machine learning testing suite often includes testing modules to detect different types of drifts like concept drift and data drift, which can include covariant drift, prediction drift, and so on. These issues usually occur within the dataset. Most of the time, the dataset’s distribution changes over time, affecting the model’s capability to accurately predict the output. You will find that the frameworks we will discuss will contain tools to detect data drifts. 

Apart from testing data, the ML testing suite contains tools to test the model’s capability to predict, as well as overfitting, underfitting, variance and bias et cetera. The idea of the testing framework is to inspect the pipeline in the three major phases of development:

  • data ingestion,
  • data preprocessing,
  • and model evaluation.

Some of the frameworks like Robust Intelligence and Kolena rigorously test the given ML pipeline automatically in these given areas to ensure a production-ready model. 

In essence, a machine learning suite will contain:

  1. Unit tests that operate on the level of the codebase,
  2. Regression tests replicate bugs from the previous iteration of the model that is fixed,
  3. Integration tests simulate conditions and are typically longer-running tests that observe model behaviors. These conditions can mirror the ML pipeline, including preprocessing phase, data distribution, et cetera. 
A workflow of software development
The image above depicts a typical workflow of software development | Source

What are the best tools for machine learning model testing?

Now, let’s discuss some of the tools for testing ML models. This section is divided into three parts: open-source tools, subscription-based tools, and hybrid tools. 

Open-source model testing tools

1. DeepChecks

DeepChecks is an open-source Python framework for testing ML Models & Data. It basically enables users to test the ML pipeline in three different phases:

  1. Data integrity test before the preprocessing phase.
  2. Data Validation, before the training, mostly while splitting the data into training and testing, and
  3. ML model testing.
5-tools-that-will-help-you-setup-production-ML-model-testing-2.png?resize=1024%2C244&ssl=1
The image above shows the schema of three different tests that could be performed in an ML pipeline | Source

These tests can be performed all at once and even independently. The image above shows the schema of three different tests that could be performed in an ML pipeline. 

Installation

Deepchecks can be installed using following the pip command:



pip install deepchecks > 0.5.0

The latest version of Deepcheck is 0.8.0. 

Structure of the framework 

DeepChecks introduces three important terms: Check, Condition and Suite. It is worth noting that these three terms together form the core structure of the framework. 

Check

It enables a user to inspect a specific aspect of the data and models. The framework contains various classes which allow you to check both of them. You can do a full check as well. Here are a couple of such checks:

  1. Data inspectinginvolves inspection around data drift, duplication, missing values, string mismatch, statistical analysis such as data distribution et cetera. You can find the various data inspecting tools within the check module. The check module allows you to precisely design the inspecting methods for your datasets. These are some of the tools that you will find for data inspection:
  •  ‘DataDuplicates’,
  •  ‘DatasetsSizeComparison’,
  •  ‘DateTrainTestLeakageDuplicates’,
  •  ‘DateTrainTestLeakageOverlap’,
  •  ‘DominantFrequencyChange’,
  •  ‘FeatureFeatureCorrelation’,
  •  ‘FeatureLabelCorrelation’,
  •  ‘FeatureLabelCorrelationChange’,
  •  ‘IdentifierLabelCorrelation’,
  •  ‘IndexTrainTestLeakage’,
  •  ‘IsSingleValue’,
  •  ‘MixedDataTypes’,
  •  ‘MixedNulls’,
  •  ‘WholeDatasetDrift’

In the following example, we will inspect whether the dataset has duplicates or not. We will import the class DataDuplicates from the checks module and pass the dataset as a parameter. This will return a table containing relevant information on whether the dataset has duplicate values or not. 



from deepchecks.checks import DataDuplicates, FeatureFeatureCorrelation dup = DataDuplicates() dup.run(data)
Inspection of dataset duplicates
An example of inspecting if the dataset has duplicates | Source: Author

As you can see, the table above yields relative information about the number of duplicates present in the dataset. Now let’s see how DeepChecks uses a visual aid to provide the concerning information. 

In the following example, we will inspect feature-feature correlation within the dataset. For that, we will import the FeatureFeatureCorrelation class from the checks module.



ffc = FeatureFeatureCorrelation() ffc.run(data)
Inspection of feature-feature correlation
An example of inspecting feature-feature correlation within the dataset | Source: Author

As you can see from both examples, the results can be displayed either in the form of a table or a graph, or even both to give relevant information to the user.  

  1. The model inspectioninvolves overfitting, underfitting, et cetera. Similar to data inspection, you can also find the various model inspecting tools within the check module. These are some of the tools that you will find for model inspection:
  • ‘ModelErrorAnalysis’,
  •  ‘ModelInferenceTime’,
  •  ‘ModelInfo’,
  •  ‘MultiModelPerformanceReport’,
  •  ‘NewLabelTrainTest’,
  •  ‘OutlierSampleDetection’,
  •  ‘PerformanceReport’,
  •  ‘RegressionErrorDistribution’,
  •  ‘RegressionSystematicError’,
  •  ‘RocReport’,
  •  ‘SegmentPerformance’,
  •  ‘SimpleModelComparison’,
  •  ‘SingleDatasetPerformance’,
  •  ‘SpecialCharacters’,
  •  ‘StringLengthOutOfBounds’,
  •  ‘StringMismatch’,
  •  ‘StringMismatchComparison’,
  •  ‘TrainTestFeatureDrift’,
  •  ‘TrainTestLabelDrift’,
  •  ‘TrainTestPerformance’,
  •  ‘TrainTestPredictionDrift’,

Example of a model check or inspection on Random Forest Classifier:



from deepchecks.checks import ModelInfo info = ModelInfo() info.run(RF)
A model check or inspection on Random Forest Classifier
An example of a model check or inspection on Random Forest Classifier | Source: Author 

Condition 

It is a function or attribute that can be added to a Check. Essentially it contains a predefined parameter that can return a pass, fail, or warning results. These parameters can be modified as well accordingly. Follow the code snippet below to get an understanding. 



from deepchecks.checks import ModelInfo info = ModelInfo() info.run(RF)
A bar graph of feature label correlation
An example of a bar graph of feature label correlation | Source: Author

The image above shows a bar graph of feature label correlation. It essentially measures the predictive power of an independent feature that can predict the target value by itself. When you add a condition to a check as in the example above, the condition will return additional information mentioning the features which are above and below the condition. 

In this particular example, you will find that the condition returned a statement stating that the algorithm “Found 2 out of 4 features with PPS above threshold: {‘petal width (cm)’: ‘0.9’, ‘petal length (cm)’: ‘0.87’}” meaning that features with high PPS are suitable to predict the labels. 

Suite 

It is a module containing a collection of checks for data and model. It is an ordered collection of checks. All the checks can be found in the suite module. Below is the schematic diagram of the framework and how it works. 

Schematic diagram of suite of checks
The schematic diagram of the suite of checks and how it works | Source 

As you can see from the image above, the data and the model can be passed into the suites which contain the different checks. The checks can be provided with the conditions for much more precise testing. 

You can run the following code to see the list of 35 checks and their conditions that DeepChecks provides:



from deepchecks.suites import full_suite suites = full_suite() print(suites)

Full Suite: [ 0: ModelInfo 1: ColumnsInfo 2: ConfusionMatrixReport 3: PerformanceReport Conditions: 0: Train-Test scores relative degradation is not greater than 0.1 4: RocReport(excluded_classes=[]) Conditions: 0: AUC score for all the classes is not less than 0.7 5: SimpleModelComparison Conditions: 0: Model performance gain over simple model is not less than …]

In conclusion, Check, Condition, and Suites allow users to essentially check the data and model in their respective tasks. These can be extended and modified according to the requirements of the project and for various use cases. 

DeepChecks allows flexibility and instant validation of the ML pipeline with less effort. Their strong boilerplate code can allow users to automate the whole testing process, which can save a lot of time. 

Graph with distribution checks
An example of distribution checks | Source
Why should you use this?
  • It is open-source and free, and it has a growing community.
  • Very well-structured framework. 
  • Because it has built-in checks and suites, it can be extremely useful for inspecting potential issues in your data and models.
  • It is efficient in the research phase as it can be easily integrated into the pipeline.
  • If you are mostly working with tabular datasets, then DeepChecks is extremely good. 
  • You can also use it to check data, model drifts, model integrity, and model monitoring.
Methodology issues
An example of methodology issues | Source
Key features 
  • 1It supports both classification and regression models in both computer vision and tabular datasets. 
  • 2It can easily run a large group of checks with a single call. 
  • 3It is flexible, editable, and expandable. 
  • 4It yields results in both tabular and visual formats.
  • 5It does not require a login dashboard as all the results, including the visualization, are displayed instantly during execution itself.  And it has a pretty good UX on the go. 
Performance checks
An example of performance checks | Source
Key drawbacks
  • 1It does not support NLP tasks. 
  • 2Deep Learning support is in beta version including computer vision. So results can yield errors. 

2. Drifter-ML

Drifter ML is an ML model testing tool specifically written for the Scikit-learn library. It can also be used to test datasets similar to DeepChecks. It has five modules, each very specific to the task at hand.

  1. Classification test: It enables you to test classification algorithms.
  2. Regression test: It enables you to test classification algorithms.
  3. Structural test: This module has a bunch of classes that allow testing of clustering algorithms.
  4. Time Series test: This module can be used to test model drifts. 
  5. Columnar test: This module allows you to test your tabular dataset. Tests include sanity testing, mean and median similarity, Pearson’s correlation et cetera. 
Installation


pip install drifter-ml
Structure of the framework

Drifter ML conforms to the Scikit-Learn blueprint for models, i.e., the model must contain a .fit and .predict methods. This essentially means that you can test deep learning models as well since Scikit-Learn has an integrated Keras API. Check the example below.



#Source: https://drifter-ml.readthedocs.io/en/latest/classification-tests.html#lower-bound-classification-measures

from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier import pandas as pd import numpy as np import joblib

# Function to create model, required for KerasClassifier def create_model(): # create model model = Sequential() model.add(Dense(12, input_dim=3, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model

# fix random seed for reproducibility df = pd.DataFrame() for _ in range(1000): a = np.random.normal(0, 1) b = np.random.normal(0, 3) c = np.random.normal(12, 4) if a + b + c > 11: target = 1 else: target = 0 df = df.append({ "A": a, "B": b, "C": c, "target": target }, ignore_index=True)

# split into input (X) and output (Y) variables # create model clf = KerasClassifier(build_fn=create_model, epochs=150, batch_size=10, verbose=0) X = df[["A", "B", "C"]] clf.fit(X, df["target"]) joblib.dump(clf, "model.joblib") df.to_csv("data.csv")

The example above shows the ease with which you can design your ANN model using drifter-ml. Similarly, you can also design a test case as well. In the test defined below, we will try to find the lowest decision boundary by which the model can easily classify the two classes. 



def test_cv_precision_lower_boundary(): df = pd.read_csv("data.csv") column_names = ["A", "B", "C"] target_name = "target" clf = joblib.load("model.joblib")

test_suite = ClassificationTests(clf, df, target_name, column_names) lower_boundary = 0.9 return test_suite.cross_val_precision_lower_boundary( lower_boundary )

Why should you use this?
  • Drifter-ML is specifically written for Scikit-learn, and this library acts as an extension to it. All the classes and methods are written in sync with Scikit-learn, so data and model testing become relatively easy and straightforward. 
  • On a side note, if you like to work on an open-source library, then you can extend the library to other machine learning and deep learning libraries such as Pytorch as well. 
Key features 
  • 1Built on top of Scikit-learn. 
  • 2Offers to test for Deep learning architecture but only for Keras since it is extended in Scikit-learn. 
  • 3Open source library and open to contribution. 
Key drawbacks
  • 1It is not up to date, and its community is not fairly active. 
  • 2It does not work well with other libraries. 

Subscription-based tools

1. Kolena.io

Kolena.io is a Python-based framework for ML testing. It also includes an online platform where the results and insights can be logged. Kolena focuses mostly on the ML unit testing and validation process at scale. 

Kolena.io dashboard
Kolena.io dashboard example | Source
Why you should use this?

Kolena argues that the split test dataset methodology isn’t as reliable as it seems to be. Splitting the datasets provides a global representation of the entire population distribution and fails to capture the local representations at a granular level, this is especially true with label or class. There are hidden nuances of features that still need to be discovered. This leads to the failure of the model in the real world even though the model yields good scores in the performance metrics during training and evaluation. 

One way of addressing that issue is by creating a much more focused dataset that can be achieved by breaking a given class into smaller subclasses for focused results or even creating a subset of the features themselves. Such a dataset can enable the ML model to extract features and representation at a much granular level. This will increase the performance of the model as well by balancing both the bias and variance such that the model generalizes well in the real-world scenario. 

For example, when building a classification model, a given class in the dataset can be broken down into various subsets and those subsets into finer subsets. This can enable users to test the model in various scenarios. In the table below, the CAR class is tested against several test cases to check the model’s performance on various attributes. 

CAR class tested against several test cases
CAR class tested against several test cases to check the model’s performance on various attributes | Source

Another benefit is whenever we face a new scenario in the real-world, a new test case can be designed and tested immediately. Likewise, users can build more comprehensive test cases for a variety of tasks and train or build a model. The users can also generate a detailed report on a model’s performance in each category of test cases and compare it to the previous models with each iteration.

To sum up, Kolena offers:

  • Ease of python framework
  • Automated workflow testing and deployment
  • Faster model debugging
  • Faster model deployment

If you are working on a large-scale deep learning model which will be complex to monitor, then Kolena will be beneficial. 

Key features 
  • 1Supports Deep Learning architectures.
  • 2Kolena Test Case Studio offers to curate customizable test cases for the model. 
  • 3It allows users to prepare quality tests by removing noise and improving annotations.
  • 4It can automatically diagnose failure modes and can find the exact issue concerning the same. 
  • 5Integrates seamlessly into the ML pipeline. 
App Kolena.io
View from the Kolena.io app | Source
Key drawbacks
  • 1Subscription-based model (pricing not mentioned).
  • 2Subscription-based model (pricing not mentioned).
  • 3In order to download the framework, you need a CloudRepo pass. 


pip3 install --extra-index-url "$CR_URL" kolena-client

2. Robust intelligence

It is an E2E ML platform that offers various services in terms of ML integrity. The framework is written in Python and allows customizing your code according to your needs. The framework also integrates into an online dashboard that provides insights into various testing on data and model performance as well as model monitoring. All these services target the ML model and data right from training to the post-production phase. 

Robust intelligence
Robust intelligence features | Source
Why should you use this?

The platform offers services like:

1. AI stress testing, which includes hundreds of tests to automatically evaluate the performance of the model and identify potential drawbacks. 

AI stress testing
Evaluating the performance of the model | Source

2. AI Firewall, which automatically creates a wrapper around the trained model to protect it from bad data in real-time. The wrapper is configured based on the model. It also automatically checks both the data and model, reducing manual effort and time.

AI Firewall
Prevention of model failures in production | Source

3. AI continuous testing, whichmonitors the model and automatically tests the deployed model to check for updates and retraining. The testing involves data drift, error, root cause analysis, anomalies detection et cetera. All the insights gained during continuous testing are displayed on the dashboard. 

AI continuous testing
Monitoring model in production | Source

Robust intelligence enables model testing, model protection during deployment, and model monitoring after deployment. Since it is an e2e-based platform, all the phases can be easily automated with hundreds of stress tests run on the model to make it production ready. If the project is fairly large, then Robust intelligence will give you an edge. 

Key features 
  • 1Supports deep learning frameworks
  • 2Flexible and easy to use
  • 3Customisable
  • 4Scalable
Key drawbacks
  • 1Only for enterprise. 
  • 2Few details are available online. 
  • 3Expensive: One-year subscription costs around $60,000.

(Source)

Hybrid frameworks

1. Etiq.ai

​​Etiq is an AI-observability platform that supports AI/ML lifecycle. Like Kolena and Robust Intelligence, the framework offers ML Model testing, monitoring, optimization, and explainability. 

Etiq.ai
The dashboard of Etiq.ai | Source

Etiq is considered to be a hybrid framework as it offers both offline and online implementation. Etiq has four tiers of usage:

  1. Free and public: It includes free usage of the library as well as the dashboard. Keep in mind the results and metadata will be stored in your dashboard instance the moment you log in to the platform, but you will receive full benefits. 
  2. Free and limited: If you want a free but private testing environment for your project and don’t want to share any information, then you can use the platform without logging into the platform. Keep in mind that you will not receive full benefits as would have received when you logged into the platform.  
  3. Subscribe and private: If you want full benefits of Etiq.ai, then you can subscribe to their plan and make use of their tools in your own private environment. Etiq.ai is already available in the AWS market place which starts at around $3.00/hour or from $25,000.00/year. 
  4. Personalized request: If you require functionality beyond what is provided by Etiq.ai, like explainability, robustness, or team share functionality, then you can contact them and get your own personalized test suite.  
Structure of the framework 

Etiq follows a structure similar to DeepChecks. This structure remains the core of the framework:

  • Snapshot: It is a combination of dataset and model in the pre-production testing phase. 
  • Scan: It is usually a test that is applied to the snapshot.
  • Config: It is usually a JSON file that contains a set of parameters that will be used by the scan for running tests in the snapshot.
  • Custom test: It allows you to customize your tests by adding and editing various metrics to the config file. 

Etiq offers two types of tests: Scan and Root Cause Analysis or RCA, the latter is an experimental pipeline. The scan type offers

  • Accuracy: In some cases, high accuracy can indicate a problem just as low accuracy can. In such cases, an ‘accuracy’ scan can be helpful. If the accuracy is too high, then you might do a leakage scan, or if it is low, then you can do a drift scan. 
  • Leakage: It helps you to find data leakage. 
  • Drift: It can help you to find feature drift, target drift, concept drift, and prediction drift. 
  • Bias: Bias refers to algorithmic bias that can happen because of automated decision making causing unintended discrimination. 
Why should you use this?

Etiq.ai offers a multi-step pipeline, which means you can monitor the test by logging the results of each of the steps in the ML pipeline. This allows you to identify and repair bias within the model. If you are looking for a framework that can do the heavy lifting of your AI pipeline, then Etiq.ai is the one to go. 

Some other reasons why you should use Etiq.ai:

  • 1It is a Python Framework
  • 2Dashboard facility for multiple insights and optimization reporting
  • 3You can manage multiple projects. 

All the points above are valid for free tier usage. 

One key feature of Etiq.ai is that it allows you to be very precise and straightforward in your model building and deploying approaches. It aims to give users the tools that can help them to achieve the desired model. At times, the development process gets drifted away from the original plan mostly because of the lack of tools needed to shape the model. If you want to deploy a model that is aligned with the proposed requirements, then Etiq.ai is the way to go. This is because the framework offers similar tests at each step throughout your ML pipeline. 

Etiq.ai
Steps of the process when to use Etiq.ai | Source
Key features 
  • 1A lot of functionalities in the free tier.
  • 2Test each of the pipelines for better monitoring
  • 3Supports deep learning frameworks like PyTorch and Keras-Tensorflow
  • 4You can request a personalized test library. 
Key drawbacks
  • 1At the moment, in production, they only provide functionality for batch processing.
  • 2To apply tests to tasks pertaining to segmentation, regression, or recommendation engines, who must get in touch with the team. 

Conclusion

The ML testing frameworks that we discussed are directed toward the needs of the users. All of the frameworks have their own pros and cons. But you can definitely get by using any one of these frameworks. ML model testing frameworks play an integral part in defining how the model will perform when deployed to a real-world scenario. 

If you are looking for a free and easy-to-use ML testing framework for structured datasets and smaller ML models, then go with DeepChecks. If you are working with DL algorithms, then Etiq.ai is a good option. But if you can spare some money, then you should definitely inquire about Kolena. And lastly, if you are working in a mid to large-size enterprise and looking for ML testing solutions, then hands-down, it has to be Robust Intelligence. 

I hope this article provided you with all the preliminary information needed for you to get started with ML testing. Please share this article with everyone who needs it. 

Thanks for reading!!!

Reference

Nilesh Barla

Nilesh Barla

I am the founder of a recent startup perceptronai.net which aims to provide solutions in medical and material science through our deep learning algorithms. I also read and think a lot. And sometimes I put them in a form of a painting or a piece of music. And when I need to catch a breath I go for a run.

  • Follow me on

READ NEXT

ML Model Testing: 4 Teams Share How They Test Their Models

10 mins read | Author Stephen Oladele | Updated March 1st, 2022

Despite the progress of the machine learning industry in developing solutions that help data teams and practitioners operationalize their machine learning models, testing these models to make sure they’ll work as intended remains one of the most challenging aspects of putting them into production. 

Most processes used to test ML models for production usage are native to traditional software applications, not machine learning applications. When starting a machine learning project, it’s standard for you to take critical note of the business, tech, and datasets requirements. Still, teams often neglect the testing requirements for later until they are either ready to deploy or altogether skip testing before deployment. 

How do teams test machine learning models?

With ML testing, you are asking the question: “How do I know if my model works?” Essentially, you want to ensure that your learned model will behave consistently and produce the results you expect from it. 

Unlike traditional software applications, it is not straightforward to establish a standard for testing ML applications because the tests do not just depend on the software, they also rely on the business context, problem domain, dataset used, and the model selected. 

While most teams are comfortable with using the model evaluation metrics to quantify a model’s performance before deploying it, these metrics are mostly not enough to ensure your models are ready for production. You also need to perform thorough testing of your models to ensure they are robust enough for real-world encounters.

This article will teach you how various teams perform testing for different scenarios. At the same time, it’s worth noting that this article should not be used as a template (because ML testing is problem-dependent) but rather a guide to what types of test suite you might want to try out for your application based on your use case.

Continue reading ->



About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK