Latest Winning Techniques for Kaggle Image Classification with Limited Data

Snapshot Ensembling

Ensemble methodsare very powerful in improving the model’s overall performance. However, it can also be computationally expensive to separately train several different models for ensemble learning. This is why I chose to use snapshot ensembling with cyclic LR scheduling.

Snapshot ensembling saves the model’s parameters periodically during training. The idea is that during cyclic LR scheduling, the model converges to different local minima. Therefore, by saving the model parameters at different local minima, we obtain a set of models that can give different insights for our prediction. This allows us to gather an ensemble of models in a single training cycle .

https://arxiv.org/abs/1704.00109

For each image, we concatenate the class probability predictions of each of the “snapshot” models to form a new data point. This new data is then inputted into an XGBoost model to give a prediction based on the snapshot models.

Subclass decision

Upon inspection of the confusion matrix on the validation set for a single model, we discover that it often confuses one class for the same other one. In fact, we find three subclasses that are often confused together:

“Rooms”: bedroom, kitchen, livingroom, office
“Nature”: coast, forest, mountain, opencountry, highway
“Urban”: insidecity, street, tallbuilding

Also, the model is already very good at differentiating these subclasses (and finding suburbs). All that remains to get a great performance is for the model to accurately identify classifications within the subclasses.

To do so, we train three new separate models on each subclass, using the same approach as before. Some classes have very few training data, so we increase the amount of data augmentation. We also find new parameters adjusted to each subclass.

During prediction, we first use the model trained on the entire dataset. Then, for each prediction obtained, if the class probability is lower than a certain threshold, we take the class predicted by the relevant subclass model instead.

Anti-aliasing

Most modern convolutional networks, such as ResNet18, are not shift-invariant. The network outputs can change drastically with small shifts or translations to the input. This is because the striding operation in the convolutional network ignores the Nyquist sampling theorem and aliases, which breaks shift equivariance .

I decided to apply an anti-aliasing method proposed in the recent April 2019 paper . This is done by simply adding a “BlurPool” layer, that is a blurring filter and a subsampling layer, after the convolution layers of the network. This method has been shown to improve both classification consistency between different shifts of the image, and greater classification accuracy due to better generalization.

https://arxiv.org/abs/1904.11486

I used the pre-trained anti-aliased ResNet18 model to fine-tune on the challenge’s dataset. With anti-aliasing, I hope to overcome overfitting from scarcity of data by having the model generalize to image translation and shifting.

If you want to know more about this anti-aliasing method, I explain the “Making Convolutional Networks Shift-Invariant Again” paper in more detail here:

Summary of Results

The methodology used can be summarized as below:

Fine-tuning a ResNet18 model for 5 epochs on data without any processing except resizing already gives a testing accuracy of 0.91442 . This reveals the remarkable efficiency of transfer learning — with little data and computations, the model can already show good performance on relevant tasks.

Adding data augmentation and training longer for 10 epochs, we obtain a testing accuracy of 0.93076 . This confirms the importance of having a large training dataset and the scalability of augmentation techniques.

Adding class balancing and learning rate scheduling , the testing accuracy goes up to 0.94230 . Moreover, the confusion matrices show that after balancing, the model predicts underrepresented classes with higher accuracy. This also shows that the learning rate is an important parameter in the convergence of the model.

Then, with snapshot ensembling on the model trained on all of the data, the testing accuracy improves to 0.95000 . This illustrates how cyclic LR scheduling allows us to obtain through a single training cycle models with different behaviour, and an XGBoost meta-learner can extract useful information from their predictions.

By contrast stretching all images and also training models on specific subclasses and combining their predictions, the testing accuracy rises to 0.95865 . The confusion matrix shows an improvement in accurately classifying within the subclasses, especially for the “urban” subclass. Developing models that are “experts” on certain classes and using them with a model good at differentiating the subclasses proves to be very efficient.

Finally, after anti-aliasing the ResNet18 network and combining the training and validation sets to use all annotated data available for training, the testing accuracy rises to 0.97115 . Anti-aliasing is a powerful method to improve generalization, which is crucial when the image data is limited.

Yay!

Other ideas

Here are a few other ideas I had to tackle this challenge, that either did not work well or I did not have the means to try.

Single channel images

The images are in greyscale so though they are encoded into three channels when loading them, they can be represented as single-channel matrices. My idea was that this dimension reduction can speed up training while preserving all necessary information, but through experiments, this method showed to lose accuracy without speeding up training significantly.

Other ensemble methods

I also tried ensembling on models retrieved with other ways, such as models trained on images after different processing methods (with/without class balancing, different image enhancement techniques, different data augmentation methods) but these approaches are more computationally expensive and do not give significantly better accuracy.

Generative adversarial networks

Data augmentation and class balancing, as seen previously, plays a key role in the model performance. Besides classic image processing, generative models can be used solely for synthesizing annotated data. For example, DAGAN models can be used for data augmentation while BAGAN can be used for balancing.

Greyscale ImageNet pre-training

The images in the provided dataset have similar contents as the natural images composing the ImageNet dataset, the difference being that our images are black and white. Therefore, a model pre-trained on greyscale images would be even more relevant for this task.

Artificial image colorization

If I cannot obtain a pre-trained model for greyscale images, my next idea was to colorize the images artificially, to hopefully add additional information. Pre-trained models for artificial image colorization do exist and are publically available, let me know if you give this method a go!

Snapshot Ensembling

Subclass decision

Anti-aliasing

Summary of Results

Other ideas

Single channel images

Other ensemble methods

Generative adversarial networks

Greyscale ImageNet pre-training

Artificial image colorization

Recommend

AWS cross-AZ data transfer costs more than they tell you

男子在闲鱼卖淫秽视频网友买后认为有伤风气报警

为啥人类与近亲物种存在差异？迷你大脑或能给出答案

How Ruby Can Surprise You

新华社批惹人烦的智能客服：绕圈圈、冷冰冰、挡箭牌

特斯拉副总裁陶琳：上海超级工厂生产能力是可控的

王海峰：百度AI大生产平台已经开放216项能力

How Standard Industries uses Google Cloud to power homes of the future

Android Automotive OS updates for developers

Hunterdon Healthcare teams work more easily and securely using Chrome Enterprise...

About Joyk