Know What You Don’t Know: Getting Reliable Confidence Scores When Unsure of a Pr...

Softmax predicion scores are often used as a confidence score in a multi-class classification setting. In this post, we are going to show that softmax scores can be meaningless when doing regular empirical risk minimization by gradient descent. We are also going to apply the method presented in Deep Anomaly Detection with Outlier Exposure to mitigate this problem and add more meaning to the softmax score.

Discriminative classifiers (models that try to estimate P(y|x) from data) tend to be overconfident in their predictions, even if the input sample looks nothing like anything they have seen in the training phase. This makes it so that the output scores of such models cannot be reliably used as a confidence score since the model is often confident where it should not be.

Example :

In this synthetic example, we have one big cluster of class zero and another one for class one, plus two smaller groups of points of outliers that were not present in the training set.

VNNnemi.jpg!web

Toy example

If we apply a regular classifier to this we get something like this :

AfMRv2q.jpg!web

Confidence scores baseline

We see that the classifier is overly confident everywhere, even the outlier samples are classified with a very high score. The confidence score is displayed using the heat-map .

This is what makes it so it is not a good idea to directly use the softmax scores as confidence scores, if a classifier is confident everywhere without having seen any evidence to support it in the training then it probably means that the confidence scores are wrong.

However, if we use the approach presented in Deep Anomaly Detection with Outlier Exposure we can achieve much more reasonable Softmax scores :

n6jE7zZ.jpg!web

Outlier exposure

This score map is much more reasonable and is useful to see where the model is rightly confident and where it is not. The outlier region has a very low confidence ~0.5 ( Equivalent to no confidence at all in a two-class setting).

Description of the Approach

The idea presented in Deep Anomaly Detection with Outlier Exposure is to use external data that is mostly different from your training/test data and force the model to predict the uniform distribution on this external data.

For example, if you are trying to build a classifier that predicts cat vs dog in images, you can get a bunch of bear and shark images and force the model to predict [0.5, 0.5] on those images.

Data And Model

We will use the 102 Flower as the in-distribution dataset and a subset of the OpenImage dataset as an out-of-distribution dataset. In the paper referenced in the introduction, they show that training on one set of out-of-distribution samples generalizes well to other sets that are out-of-distribution.

We use MobilenetV2 as our classification architecture and initialize the weights with Imagenet.

def get_model_classification(
    input_shape=(None, None, 3),
    weights="imagenet",
    n_classes=102,
):
    inputs = Input(input_shape)
    base_model = MobileNetV2(
        include_top=False, input_shape=input_shape, weights=weights
    )    x = base_model(inputs)
    x = Dropout(0.5)(x)
    out1 = GlobalMaxPooling2D()(x)
    out2 = GlobalAveragePooling2D()(x)
    out = Concatenate(axis=-1)([out1, out2])
    out = Dropout(0.5)(out)
    out = Dense(n_classes, activation="softmax")(out)
    model = Model(inputs, out)
    model.compile(
        optimizer=Adam(0.0001), loss=categorical_crossentropy, metrics=["acc"]
    )    return model

We will use a generator to load the images from the hard drive batch by batch. In the baseline we only load the in-distribution images while in the Anomaly exposure model we load half the batch from in-distribution images with their correct label and the other half from out-of-distribution images with a uniform objective => :

target = [1 / n_label for _ in range(n_label)]

Results

Both training configurations get a little higher than 90% accuracy on in-distribution samples. We choose to predict “Don’t Know” if the softmax score is lower than 0.15 and thus abstain from making a class prediction.

Now let us see how each model behaves!

Regular training :

You can run the web application by doing :

streamlit run main.py

Example :

Description of the Approach

Data And Model

Results

Regular training :

Recommend

Top 20 YouTube Channels for Data Science in 2020

👴开发了 1️⃣🈹7️⃣🧗功能，🦡🦅🦁️用

NoSQL_系统设计笔记12

Open source library to create browser tests 10x faster

干货 | 以太坊 2.0 Phase 0 的奖惩制度

面试：删除链表的节点

Elasticsearch 之聚合分析入门

推荐｜6款免费又好用的远程管理工具

全球新冠抗疫，世界首善比尔·盖茨回答了31个问题

关于苹果与微信的合作版本

About Joyk