What Is Deep Transfer Learning and Why Is It Becoming So Popular?

VjQvEz2.jpg!web

A man sitting on a bridge in Austria

Introduction

As we already know, large and effective deep learning models are data-hungry. They require training with thousands or even millions of data points before making a plausible prediction.

Training is very expensive, both in time and resources. For example, the popular language representation model BERT, developed by Google, has been trained on 16 Cloud TPUs (64 TPU chips total) for 4 days . Put in perspective, this is ~60 desktop computers running non-stop for 4 days.

The biggest problem, though, is that models like this one are performed only on a single task. Future tasks require a new set of data points as well as equal or more amount of resources.

6zaInqU.jpg!web

Photo by Rachel on Unsplash

However, the human brain does not work that way. It does not discard previously obtained knowledge when solving a new task. Instead, it makes decisions based on things learnt from the past.

Transfer learning aims to mimic this behaviour.

What is transfer learning?

Transfer learning is an approach in deep learning (and machine learning) where knowledge is transferred from one model to another.

Def: Model A is successfully trained to solve source task T.a using a large dataset D.a. However, the dataset D.b for a target task T.b is too small, preventing Model B from training efficiently. Thus, we use part of model A to predict results for task T.b.

A common misconception is that training and testing data should come from the same source or be with the same distribution.

Using transfer learning, we are able to solve a particular task using full or part of an already pre-trained model on a different task.

The renowned AI leader, Andrew Ng, explains this concept very well in the video below.

When to use transfer learning?

Transfer learning is becoming the go-to way of working with deep learning models. The reasons are explained below.

Lack of data

Deep learning models require a LOT of data for solving a task effectively. However, it is not often the case that so much data is available. For example, a company may wish to build a very specific spam filter to its internal communication system but does not possess lots of labelled data.

In that case, a specific target task can be solved using a pre-trained model for a similar source task.

The tasks can be different but their domains should be the same.

In other words, you are unable to do transfer learning between speech recognition and image classification tasks since the input datasets are of different types.

What you can do is using a pre-trained image classifier on dog photos to predict cat photos.

Source: “How to build your own Neural Network from scratch in Python” by James Loy

Speed

Transfer learning cuts a large percentage of training time and allows for building various solutions instantly. In addition, it prevents from setting up a complex and costly Cloud GPU/TPU.

Social good

Using transfer learning positively impacts the environment.

According, to a study in the MIT Technology Review , a large neural network (200M+ parameters) trained on a Cloud TPU produces CO2 equivalent to ~6 cars during their lifetime. Transfer learning can prevent the extensive usage of these powerful processing units.

Deep Transfer Learning Strategies

Transfer learning can be applied through several different strategies both in the deep and machine learning space. In this article, I will cover deep learning only techniques called Deep Transfer Learning Strategies .

There are 3 main strategies for doing transfer learning on deep learning models.

Direct use of pre-trained models

The simplest strategy is to solve a target task by directly applying a model from a source task.

Such models are typically large (millions of parameters) neural networks, trained for days or even weeks on state-of-the-art machines.

Big corporations (companies, universities etc.) tend to release such models to the public, aiming to enhance the development of this field.

Some pre-trained models used directly include the aforementioned BERT as well as YOLO (You Only Look Once) , GloVe , UnsupervisedMT and more.

Leveraging feature extraction from pre-trained models

Instead of using the model end-to-end as in the previous example, we can treat the pre-trained neural network as a feature extractor by discarding the last fully-connected output layer .

This approach allows us to directly apply new dataset to solve an entirely different problem.

It brings 2 main advantages:

Allows for specifying the dimensions of the last fully-connected layer.

For example, the pre-trained network may have 7 x 7 x 512 output from the layer prior to the last fully-connected. We can flatten this to 21,055 which results in a new N x 21,055 network output ( N — number of data points).

Allows for using a lightweight linear model (e.g. Linear SVM, Logistic Regression).

Since the pre-trained complex neural network model is used as features for the new task, we are allowed to train a simpler and faster linear model to modify the output based on the new dataset.

The feature extraction strategy is best suited for situations when the target task dataset is very small.

Fine-tuning last layers of pre-trained models

We can go one step further by not only training the output classifier but also fine-tune weights in some layers of the pre-trained model.

Typically, the earlier layers of the network (especially for CNN) are frozen, whereas the last ones are freed up for tuning.

This allows us to perform full training on the existing model and modify the parameters at the very last layers.

We choose to modify only the last layers since it has been observed that earlier layers in a network capture more generic features while later ones are very dataset-specific.

Let’s say our initial pre-trained model recognizes Mercedes cars with very high accuracy. The initial layers of this model tend to capture information about the position of wheels, the form of the car, curves etc. We can keep those when working on the next task of recognizing Ferrari cars. However, for the more specific Ferrari features, we should retrain the last layers with the new dataset.

Having said that, the fine-tuning strategy is best used when the target task dataset is significantly big as well as share a similar domain to the source one.

Resources

This article was inspired by a collection of papers and tutorials, some of which include:

Transfer Learning by Andrej Karpathy @ Stanford.
A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning byDipanjan (DJ) Sarkar.
Keras: Feature extraction on large datasets with Deep Learning by Adrian Rosebrock.
Fine-tuning with Keras and Deep Learning by Adrian Rosebrock.
A Survey on Transfer Learning.
A Gentle Introduction to Transfer Learning for Deep Learning by Jason Brownlee.
Transfer Learning in Keras with Computer Vision Models by Jason Brownlee.

Introduction

What is transfer learning?

When to use transfer learning?

Lack of data

The tasks can be different but their domains should be the same.

Speed

Social good

Deep Transfer Learning Strategies

Direct use of pre-trained models

Leveraging feature extraction from pre-trained models

Fine-tuning last layers of pre-trained models

Resources

Thank you for reading. Hope you enjoyed this article. :heart:

Recommend

Generating Synthetic Images from textual description using GANs

A Markdown Notepad App

Recurrent Neural Networks (RNN) Explained — the ELI5 way

OPPO Jumps to Android 10 with ColorOS 6.7 [Review]

Comments to describe - Imgur

华为全新折叠屏专利曝光采用上下翻折

华为邀请全球黑客找系统漏洞：或为鸿蒙手机铺路

Using multi-arch Docker images to support apps on any architecture

Disney + 和《曼达洛人》促使网民重返 BT 下载

上海滩黄浦江上的“麦田怪圈”

About Joyk