Google Landmark Recognition using Transfer Learning

Picture by Kevin Widholm

Image classification with 15k classes!

Project by Catherine McNabb, Anuraag Mohile, Avani Sharma , Evan David, Anisha Garg

Dealing with a large number of classes with very few images in many classes is what makes this task really challenging!

The problem comes from a famous Kaggle competition, the Google Landmark Recognition Challenge. Training set contains over 1.2 million images spread across 14,951 classes of landmarks , varying from one to thousands of images per class. This problem of extreme classification is something that is very prevalent in the data science community today with the advancement of deep learning. One important thing to note about this dataset is that the test set given included many images that were not landmarks at all, that we refer to as “junk” images.

The huge learning potential with this kind of dataset is what motivated our team to take this up as our course project for Advanced Predictive Modeling. This blog will detail out our attempt at the Google landmark recognition challenge. We will talk about the platform used, go through our approach to the problem, data processing, the models tried out, what our results were, and some of the challenges we faced along the way.

Follow this github repo for the codes.

Outline of Approach

To begin with, we first worked on a data sample of 100 classes on our local systems to set the whole process up.

A lesson from this pilot run — model training was 6 times faster with images of smaller resolution.

The download of images of different resolutions takes approximately the same time (major time is spent in opening the url link). In the interest of time, we decided to work on images of resolution 96x96 rather than the full resolution for our project. Final results were produced on a data sample of 2000 classes with resized images.

Next, we needed to choose a platform to run such an intensive code on a huge dataset. General options are Amazon Web Services (AWS), Google Cloud Platform (GCP), or any other big data processing platform. We chose to use GCP due to a free credit of $300 when the account is started. This blog is an excellent resource for a step-by-step guide of creating virtual instances in GCP for image recognition CNNs specifically. We ended up using 1 GPU — NVIDIA Tesla K80, 8 CPUs with 50GB RAM and disk space of 500 GB.

We needed to start with general data preprocessing before we could implement CNNs from Keras for our image classification purposes. We needed the images to be marked by their classes in training and validation sets hence we set up our data sets segregated accordingly.

Post data pre-processing, our method included using the VGG16 pre-trained model on the data along with DeLF for dealing with difficult test images. DELF matches local features in the images about which we will discuss in detail in the later half. In the end we used a sample test set from the training data to retrieve an accuracy score and our model was able to reach up to 83% accuracy (claps!/gasps!) .

I. Obtaining the Image data

Data came in the form of CSVs with image URLs and landmark IDs. Our sample of 2000 classes contained class labels (Landmark IDs) from ‘1000’ to ‘2999’. Data pre-processing for this project can be broadly classified into following steps:

a) Train-validation-holdout data: Since the files in test folder were majorly junk and did not contain any landmarks, we had to make our own holdout set from training.csv for testing our final models. We took 1% images from each class to make the holdout dataset. Next was the train validation split on the remaining 99% data. 20% images from each class were labelled as validation set and the remaining 80% were used for training. At the end of this step, this was the data distribution

Test set had 1183 rows. Validation data had 31,651 rows and training data had 130,551 rows.

b) Fetch the image files:After having these splits ready with image urls and unique IDs, next step was image download in appropriate folders. Three separate folders were created, one for each train, validation and holdout set. Resized images (96x96) were downloaded to respective folders (on GCP) which took about 9 hrs to complete.

c) Make the directory structure: The train and validation data had to be in a specific directory format for us to be able to use Keras functionalities. We transformed our data to this structure where each class is a subfolder inside Train/Validation folder. This subfolder contains all the image files belonging to that class.

II. Data Preprocessing

Data Cleaning

Some of the URL links were broken and the downloaded file was corrupt. All such files were removed using a filter on file size for >1000 bytes before moving the files to specific directories. Next we found that some of the classes went missing in validation folder. This was because of the 80–20 split that was done. All the classes that had 4 or fewer images did not contribute anything to the validation set (because 20% of 4 < 1). Missing folders in the validation data created problems with model training. Blank folders were created to take care of this.

Image Augmentation

The images in our data were taken at various angles, horizon shifts, zoom etc. Similar variations are expected to be found in test data as well. To account for these variations, we did image augmentation with non-zero values for height and width shift, zoom, brightness, rotation and shear. We did not store the augmented images in our system, rather used them directly as an input to the model. The kind of parameters to use were chosen by eyeballing the images and specific values for these parameters were decided based on the accuracy received upon model training.

III. Model Training

Different CNN architectures could be found in the Keras library — for example VGG16, VGG19, Inception, etc. While VGG19 and Inception are more heavier architectures (number of parameters to train), training VGG16 was doable in the time frame that we had.

An image classified as triumphal_arch using ImageNet

Starting simple, we used the VGG16 pre-trained on Google ImageNet dataset to predict the landmarks. We observed that ImageNet weights are successfully able to capture the generic features from our images. This made VGG16 (with ImageNet weights) our model of choice in the process of transfer learning .

Our approach was to use the bottleneck layers of VGG16 as feature extractors and train the top 3 layers to classify the landmarks using more specific features. After reading some blogs on best practices in transfer learning, we incorporated an additional step of initializing the weights on top three layers before training on the entire network.

So our final process looked like this:

Convert images (96x96) to vectors using ImageNet weights on bottleneck layers of VGG16.
Initialize the weights on top three layers. For this, a model having just three dense layers (2 ReLU activation, 1 softmax) was trained. Input was the image vectors obtained in previous step. The weights thus obtained were saved to be used later.
Now the entire VGG16 model was compiled — the top three initialized layers from previous step were added on top of VGG16 bottom layers. We chose to freeze only bottom 16 layers as opposed to 19 so the network can learn better for the images outside the ImageNet images. This gave us a good accuracy jump.

Step 1: Yellow, Step 2: Blue, Step 3: Red

The variabilities in the compiled model that we explored were — number of layers to be trained, type of optimizers, hyperparameters to the chosen optimizer, parameters to image augmentation, batch size and number of epochs. Details on these are discussed below. Loss function was fixed as the ‘Categorical Cross-entropy’.

IIIa. Number of layers to train

We take VGG16 architecture only up to the last convolutional layer, and added the fully-connected layers on top as sequential layers. This is important as picking up fully connected layers as is from VGG16 forces us to use a fixed input size for the model (224 X 224, the original ImageNet format). By only keeping the convolutional modules, our model can be adapted to arbitrary input sizes. In our case it was 96 X 96 (image height X image width)

With the newly compiled VGG16 architecture, we varied the number of layers with frozen weights while training on our data. In intermediate runs, we also experimented with added dropout layers to test results of our model. The accuracy would tend to drop with dropout layer for both validation and train dataset and hence was removed.

The best accuracies were achieved with only three dense layers (2 ReLU activation, 1 softmax) and weights from ImageNet frozen on bottom 16 layers.

IIIb. Optimizers

We used varied optimizers including momentum SGD, Adam, Adagrad. The hyperparameters were first given in standard ranges and then varied in the direction of improving accuracy. No specific learning rate worked for all optimizers. We observed that time to train grew linearly with data size and same learning rate worked on a bigger data as well. The hyperparameters we tuned were learning rate, momentum, decay for these optimizers. After trials with few adaptive learning rate gradient descent algorithms we observed Adam to be working the best for our problem. This article was of great help in hyperparameter tuning process.

IIIc. Image Augmentation

After observing the effect of each parameter on images we looked at the cumulative effect of each on our model results. The ranges confirmed to our observation on eyeballing the pictures which were sometimes selfies taken by people visiting the monument, for instance an angle of 15–30 would yield good accuracy levels.

IIId. Batch Size

There were two instances of batch sizes to be used.

One was for applying image augmentation when converting images to feature vectors (using VGG16 bottleneck layers) where the images are taken in batches. This implied that number of train and validation images should be a multiple of this batch size. Our final model had a batch size of 240.

Second is during the model training. Increasing the batch size is a tradeoff between increasing training time and increasing model accuracy. Batch size of 1 will essentially do a stochastic gradient descent taking one observation at a time. With such a huge dataset, batch size had to be decided keeping in mind the memory constraints of our systems. We were able to go up to batch size of 240 with reasonable accuracy and run times.

IIIe. Number of Epochs

Once good accuracy levels were achieved with a specific setting of above hyperparameters, we increased the epochs from 15 to 24 or 30 which would yield better model results. Each epoch took roughly 15 minutes with a single Nvidia Tesla K80 GPU and 8 CPUs on GCP (and batch size of 240), hence the increment in epochs would be selective.

Final Model

Our model trained with VGG16 architecture could give us an accuracy of 78.60% on validation set, we worked exhaustively on increasing accuracy looking at the effects of varying parameters before moving onto other models like Inception and Resnet. For our final model, after 30 epochs the accuracy saturated to closeby value ranges. The final model had the parameters as given below:

So now the model is trained, we are getting a good validation accuracy of 78.6%, next what?? Let us move on to predictions on test images. BUT…

Test data mostly comprised of unrelated images which were not even landmarks

This was a problem because deep neural networks can be fooled easily i.e. even unrelated images, are classified, with confidence, as one of the classes. So, how do we tell if the prediction of the neural network is correct or not?

IV. DeLF (DEep Local Features)

To deal with this problem, we utilized a local feature descriptor for large-scale image retrieval called DeLF. It extracts local features from images and matches them. We used it for matching local features of test images to images known to be landmarks.

DeLF was recently developed at Google, the details of which can be found in this paper .

DeLF Pipeline

DeLF architecture is such that it selects the features with the highest score and then the query image is passed through and its features are matched with those of the database images after geometric verification. The matching of features is done through Ransac (Random Sample Consensus) and the number of inliers is used to make a decision.

This threshold to the number of inliers (to do a ‘landmark’ — ‘no landmark’ classification) is decided by us and can be varied intuitively depending on the resolution of images. Higher the resolution, more the number of local features available for matching. The predictions obtained from the deep network are then sent through DELF and the test image is compared to the database images from the predicted class.

There are 3 possible situations of matching:

The first case has the most inliers, the second has a fewer and in case of no landmark, none to only a few inliers (again, the actual number depends on the resolution of images we decide to work with).

Thus, setting an appropriate threshold, we were able to decide if the test image was actually a landmark or not.

V. Results

Test set is the holdout set that was kept aside in the beginning

To have a better understanding of model behavior, we checked some misclassified images and gathered interesting observations. Couple of misclassified test images seemed similar and were reasonably classified to the wrong class.

Images that confused our model

VI. Conclusion

Summary

We attempted to train CNNs to achieve a good accuracy in Google Landmark Recognition Challenge. For this purpose, Google Cloud Platform was used to use machines with the required capabilities. First, the training image dataset was split into train, validation and test images and images were resized before downloading to make the challenge more tractable. For model training, VGG 16 Neural Network was used with Transfer Learning from ImageNet. We tried variations in #layers, optimizers, hyperparameters, image augmentation, batch size and #epochs to improve upon the validation accuracy. Finally, we used DeLF to deal with the actual test dataset which comprised mostly of unrelated images.

Next Steps

For this project, we faced many constraints due to short period of time and available computer resources. Definite next steps will be to try heavier models like Inception Net (which kept on throwing OOM Error with our configured systems), increase the batch size and work on a bigger sample (or the entire dataset). We also tried increasing the image resolutions to 224x224 but this increased the training time for one epoch to 90 minutes. Given more time, we would definitely explore the impact of image resolution on accuracy.

Other possible steps to improve or build upon our current work could be:

Creating a pipeline of DeLF and CNN for test images to ensure that only valid landmarks are fed to CNN for classification.
Once more CNNs like Inception Net and ResNet are trained, trying ensembles of these models.

VII. Learnings from the process

As we look back, our team had numerous learnings from this entire process. Here we jot down a few key takeaways which we wish we had known before starting the project.

Make use of pre-built disk images for image recognition. Trying to configure your own system by installing all dependencies is a nightmare!
Check if tensorflow is actually using GPU acceleration before waiting for hours for your epochs to run.
ALWAYS do a pilot run (from the very beginning to the end) of the entire process on a smaller dataset.

Many thanks to our professor Dr. Joydeep Ghosh who had been a great help in this journey . Let us know if you have any comments or suggestions. Hope this blog helps you in designing your own Image Classification project!

Google Landmark Recognition using Transfer Learning