25

What Is Deep Transfer Learning and Why Is It Becoming So Popular?

 4 years ago
source link: https://towardsdatascience.com/what-is-deep-transfer-learning-and-why-is-it-becoming-so-popular-91acdcc2717a?gi=87c55c881e1c
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

VjQvEz2.jpg!web

A man sitting on a bridge in Austria

Introduction

As we already know, large and effective deep learning models are data-hungry. They require training with thousands or even millions of data points before making a plausible prediction.

Training is very expensive, both in time and resources. For example, the popular language representation model BERT, developed by Google, has been trained on 16 Cloud TPUs (64 TPU chips total) for 4 days . Put in perspective, this is ~60 desktop computers running non-stop for 4 days.

The biggest problem, though, is that models like this one are performed only on a single task. Future tasks require a new set of data points as well as equal or more amount of resources.

6zaInqU.jpg!web

Photo by Rachel on Unsplash

However, the human brain does not work that way. It does not discard previously obtained knowledge when solving a new task. Instead, it makes decisions based on things learnt from the past.

Transfer learning aims to mimic this behaviour.

What is transfer learning?

Transfer learning is an approach in deep learning (and machine learning) where knowledge is transferred from one model to another.

Def: Model A is successfully trained to solve source task T.a using a large dataset D.a. However, the dataset D.b for a target task T.b is too small, preventing Model B from training efficiently. Thus, we use part of model A to predict results for task T.b.

A common misconception is that training and testing data should come from the same source or be with the same distribution.

Using transfer learning, we are able to solve a particular task using full or part of an already pre-trained model on a different task.

The renowned AI leader, Andrew Ng, explains this concept very well in the video below.

When to use transfer learning?

Transfer learning is becoming the go-to way of working with deep learning models. The reasons are explained below.

Lack of data

Deep learning models require a LOT of data for solving a task effectively. However, it is not often the case that so much data is available. For example, a company may wish to build a very specific spam filter to its internal communication system but does not possess lots of labelled data.

In that case, a specific target task can be solved using a pre-trained model for a similar source task.

The tasks can be different but their domains should be the same.

In other words, you are unable to do transfer learning between speech recognition and image classification tasks since the input datasets are of different types.

What you can do is using a pre-trained image classifier on dog photos to predict cat photos.

3aIJN3V.jpg

Source: “How to build your own Neural Network from scratch in Python” by James Loy

Speed

Transfer learning cuts a large percentage of training time and allows for building various solutions instantly. In addition, it prevents from setting up a complex and costly Cloud GPU/TPU.

Social good

Using transfer learning positively impacts the environment.

According, to a study in the MIT Technology Review , a large neural network (200M+ parameters) trained on a Cloud TPU produces CO2 equivalent to ~6 cars during their lifetime. Transfer learning can prevent the extensive usage of these powerful processing units.

Deep Transfer Learning Strategies

Transfer learning can be applied through several different strategies both in the deep and machine learning space. In this article, I will cover deep learning only techniques called Deep Transfer Learning Strategies .

There are 3 main strategies for doing transfer learning on deep learning models.

Direct use of pre-trained models

The simplest strategy is to solve a target task by directly applying a model from a source task.

Such models are typically large (millions of parameters) neural networks, trained for days or even weeks on state-of-the-art machines.

Big corporations (companies, universities etc.) tend to release such models to the public, aiming to enhance the development of this field.

Some pre-trained models used directly include the aforementioned BERT as well as YOLO (You Only Look Once) , GloVe , UnsupervisedMT and more.

Leveraging feature extraction from pre-trained models

Instead of using the model end-to-end as in the previous example, we can treat the pre-trained neural network as a feature extractor by discarding the last fully-connected output layer .

This approach allows us to directly apply new dataset to solve an entirely different problem.

It brings 2 main advantages:

  • Allows for specifying the dimensions of the last fully-connected layer.

For example, the pre-trained network may have 7 x 7 x 512 output from the layer prior to the last fully-connected. We can flatten this to 21,055 which results in a new N x 21,055 network output ( N — number of data points).

  • Allows for using a lightweight linear model (e.g. Linear SVM, Logistic Regression).

Since the pre-trained complex neural network model is used as features for the new task, we are allowed to train a simpler and faster linear model to modify the output based on the new dataset.

The feature extraction strategy is best suited for situations when the target task dataset is very small.

Fine-tuning last layers of pre-trained models

We can go one step further by not only training the output classifier but also fine-tune weights in some layers of the pre-trained model.

Typically, the earlier layers of the network (especially for CNN) are frozen, whereas the last ones are freed up for tuning.

This allows us to perform full training on the existing model and modify the parameters at the very last layers.

We choose to modify only the last layers since it has been observed that earlier layers in a network capture more generic features while later ones are very dataset-specific.

Let’s say our initial pre-trained model recognizes Mercedes cars with very high accuracy. The initial layers of this model tend to capture information about the position of wheels, the form of the car, curves etc. We can keep those when working on the next task of recognizing Ferrari cars. However, for the more specific Ferrari features, we should retrain the last layers with the new dataset.

Having said that, the fine-tuning strategy is best used when the target task dataset is significantly big as well as share a similar domain to the source one.

Resources

This article was inspired by a collection of papers and tutorials, some of which include:

Thank you for reading. Hope you enjoyed this article. :heart:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK