Robustness Measurement of Machine Learning Models with Examples in Python

While all the focus is on maximizing model accuracy while training a machine learning model, enough attention is not paid to model robustness. You may have a perfectly trained model with high accuracy, but how confident are you about the accuracy. The accuracy may not be stable. It may vary across different regions of the feature space. Or the model may be vey sensitive to moderately out of distribution data following production deployment.

The focus of this post is overview of various robustness metrics and then showing some results for a particular metric. The implementation is available in my open source Github repository avenir.

Model Robustness

A model is robust when the accuracy does not change significantly from the base line accuracy under various conditions. The article cited has a broader definition of robustness e.g it includes human bias. The base line accuracy is the accuracy you get for a trained model, where validation is done with data with the same distribution as the training data. Our focus will be on general robustness and not adversarial robustness.

Deep learning models have a peculiar behavior where a small perturbation of data may cause a deep learning model to misclassify. This weakness can be leveraged by cyber criminals e.g the video stream of an autonomous vehicle could be broken into, changes made to some of the frames to turn them adversarial, force the model to miss classify and crash the vehicle.

Lack of robustness in a model could arise out of various reasons e.g inconsistency in training data collection process, some other inadequacy in the process and under specification of the model (typically happens in complex deep learning models)

There are various metrics for model robustness. For all these metric except one , the mean shift of the accuracy with respect to the base accuracy and the variation of the accuracy are measures of the robustness.

Feature space partitioning : Data is partitioned by splitting the features and accuracy is calculated for each partition.
Ensemble of models based on partitioning: Data is partitioned and a separate model is trained for each partition with same parameters. Accuracy is calculated for each model.
Distribution shift : The data distribution is shifted by applying a transformation to the features. Accuracy is calculated for various shifts.
Contrastive shift: Various transformations are applied to a selected data instance. The amount of disagreement in the predictions is a measure of the robustness. Unlike, other metrics, this is a local metric.
Random dropout : This is applicable for neural network only. The model is run repeatedly for some batch of data, each time some random set of neurons dropped. Accuracy is calculated for each case.

Besides contrastive shift, feature space partitioning can be considered to be semi local metric. These two metrics are also useful to validate performance in specific regions of the feature space that you care more about. They also` might encourage you to train separate models for different regions of the feature space.

Underspecification, especially for deep learning models, makes a model less robust. There are various stress tests for underspecified models. They can be used for testing general robustness also.

Heart Disease Data

We will use a neural network model for heart disease prediction for model robustness calculation. The data is synthetic, generated with ancestral sampling. It has the following fields. The last field is the class label.

weight
systolic blood pressure
dialstolic blood pressure
smoker
physical activity per week
education
ethnicity
has heart disease

Robustness Metric with Feature Space Partitions

We will be using feature space partition based robustness metric. To prepare the training data we have portioned the data and added various noise levels between 0.03 and 0.27 for each partition. Noise has been added by sampling the noise level for a partition and flipping the class label based on the sampling result.

The overall error level from the predictions is found to be the average of 0.15, as it’s been verified from the validation result following the training. So we have a base accuracy of 0.85.

A python wrapper class aided with a configuration file has been used. The underlying feed forward network implementation is based on PyTorch. The configuration file has all the training parameters, meta data and other parameters that enable coding free training of a feed forward neural network. Here is the implementation for robustness metric.

Here is the`result for robustness metric. Please refer to the tutorial document for details. It shows the accuracy of all petitions, followed by the mean and std deviation of the accuracies.

[0.7865168539325843, 0.7689243027888446, 0.3333333333333333, 0.8414634146341463, 0.38461538461538464, 0.8133333333333334, 0.2807017543859649, 0.4444444444444444, 0.24, 0.5714285714285714, 0.8636363636363636, 0.8469945355191257, 0.24242424242424243, 0.8363636363636363, 0.22727272727272727, 0.8742331288343558, 0.29411764705882354, 0.2, 0.11764705882352941, 0.6842105263157895, 0.18072289156626506, 0.7647058823529411, 0.8936170212765957, 0.834070796460177, 0.2727272727272727, 0.9, 0.3333333333333333, 0.8137254901960784, 0.2857142857142857, 0.13333333333333333, 0.6363636363636364, 0.2247191011235955, 0.7333333333333333, 0.21348314606741572, 0.6428571428571429, 0.8507462686567164, 0.5625, 0.8144927536231884, 0.40476190476190477, 0.75, 0.275, 0.7934426229508197, 0.14814814814814814, 0.14285714285714285, 0.21138211382113822, 0.5384615384615384, 0.26126126126126126, 0.5]
accuracy mean 0.516  std dev 0.274

Mean accuracy is 0.516, compared to base line accuracy of 0.85. The mean accuracy shift from the base accuracy is significant. The std deviation is also significant. Here is the histogram for accuracy.

We see a bimodal distribution, Overall we can conclude that the model is not very robust. Given a poor robustness like this, would you feel comfortable, accepting the model predictions on it’s face value? Probably not, even though we had an acceptable level of overall accuracy.

I have also implemented robustness metric based on distribution shift. I will publish some results in a future post. I have used accuracy as the performance metric. It could be easily substituted for any other performance metric.

Summing Up

A robust model will give you the confidence needed to deploy a model in production. Mere consideration of model performance is short sighted and there may be serious consequence for ignoring model robustness, especially for mission critical machine learning application as in healthcare, finance etc. For some critical domains, unless the model is robust, more human intervention may be warranted.

Although we want model that is highly performant and robust, sometimes there may be a tradeoff between model performance and robustness. You may prefer a model slightly less performant but with higher robustness, when making deployment decisions.

Robustness Measurement of Machine Learning Models with Examples in Python | Mawa...