

Language ModelingII: ULMFiT and ELMo
source link: https://towardsdatascience.com/language-modelingii-ulmfit-and-elmo-d66e96ed754f?gi=1613bf1960bb
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

This is Part 2 of the 4 part series on language modeling.
Introduction
In the previouspost, we understood the concept of language modeling and the way it differs from regular pre-trained embeddings like word2vec and GloVe.
On our journey to towards REALM (Retrieval-Augmented Language Model Pre-Training), we will briefly walk through these seminal works on language models:
ELMo: Embeddings from Language Models (2018)
Pre-trained word embeddings like word2vec and GloVe are a crucial element in many neural language understanding models. If we stick to using GloVe embeddings for our language modeling task, then the word ‘major’ would have the same representation irrespective of whether it appeared in any context. Context plays a major role for humans to perceive what a said word means.
Eg. ‘major: an army officer of high rank’ and ‘major: important, serious, or significant’ would have the same embedding for the word ‘major’ being used according to GloVe vectors.
The task of creating such high-quality representations is hard. To make it concrete, any word representation should model:
- Syntax and Semantics: complex characteristics of word use
- Polysemy : the coexistence of many possible meanings for a word or phrase across linguistics contexts
ELMo introduces a deep contextualized word representation that tackles the tasks we defined above while still being easy to integrate into existing models. This achieved the state of the art results on a range of demanding language understanding problems like question answering, NER, Coref, and SNLI.
Contextualized word embeddings
Representations that capture both the word meaning along with the information available in the context are referred to as contextual embeddings. Unlike word2vec or GloVe which utilizes a static word representation, ELMo utilizes bi-directional LSTM for specific tasks to look at the whole sentence before encoding a word.Much like we observed in the [previous article](insert link), ELMo’s LSTM is trained on an enormous text dataset (in the same language as our downstream task). Once thispre-training has been done, we can reuse these distilled word embeddings as a building block for another language or NLP task.
How do we train the model on this huge dataset?
We simply train our model to predict the next word given a sequence words i.e. language modeling itself. Furthermore, we can easily do this because we already have this dataset without needing explicit labels as needed in other supervised learning tasks.ELMo Architecture
Consisting of one forward and one backward language model, ELMo’s hidden states have access to both the next word and the previous world. Each hidden layer is a bidirectional LSTM, such that its language model can view hidden states from either direction. You can look at the figure above to understand how this LSTM has access to other hidden states.
Once the forward and backward language models have trained, ELMo concatenates the hidden layer weights together into a single embedding. Furthermore, each such weight concatenation is multiplied with a weight based on the task being solved.
As you can see above ELMo takes a summation of these concatenated embeddings and assigns it to a particular token being processed from the input text. ELMo represents a token t_k as a linear combination of corresponding hidden layers (including its embedding). This means that each token in the input text has a personalized embedding being assigned by ELMo.
Once ELMo’s biLMs (bi-directional language models) have been trained on a huge text corpus, it can be integrated into almost all neural NLP tasks with simple concatenation to the embedding layer.
The higher layers seem to learn semantics while the lower layer probably captures syntactic features. Additionally, ELMo-enhanced models can make use of small datasets more efficiently.
You can read more about ELMo here .
ULMFiT (2018)
Before ULMFiT, inductive transfer learning was widely being used in computer vision, but existing approaches in NLP still required task-specific modifications and training from scratch. ULMFiT proposed an effective transfer learning method that can be applied to any NLP task and further demonstrated techniques that are key to fine-tuning a language model.
Instead of random initialization of model parameters, we can reap the benefits of pre-training and speed up the learning process.
Regular LSTM units are used for the 3 layer architecture of ULMFiT, taking a cue from AWD-LSTM .
The three stages of ULMFiT comprise of:
- General Domain LM Pre-Training: The language model is trained on a general-domain corpus to capture general features of language in different layers
- Target task Discriminative Fine-Tuning: The trained language model is fine-tuned on a target task dataset using discriminative fine-tuning and learning rate schedules (slanted triangular learning rate) to learn task-specific features
- Target task Classifier Fine-Tuning: Fine-tuning the classifier on the target task using gradual unfreezing and repeating stage 2. This helps the network to preserve low-level representations and adapt to high-level ones.
As we can see above Stage 1 uses the same learning rate across all layers, whereas Stage 2 and 3 have layer-wise triangular learning rate schedules. Also, note how the layer weights are gradually reaching the optimal values across the three-stage process. (the darker color is optimal for representation purposes)
Discriminative fine-tuning(learning schedule for stage 2/3 with slanted triangular learning rate) is a major revelation of this paper as it draws from the intuition that different layers in a model capture different types of features. Therefore it makes sense to have different learning rates for each of them. Like computer vision, even in language modeling tasks, the initial layers capture the most general information about the language, and hence once pre-trained would require the lowest amount of fine-tuning.
After Stage 2 of the process, the model is already very close to the optimal weights needed for the specified task, hence target task classifier fine-tuning is said to be very sensitive. If the fine-tuning process changes the weights significantly at this stage, then all the benefits of model pre-training would be lost. To take care of this issue, gradual unfreezing is proposed in the paper:
- To start, the last LSTM layer is unfrozen and model is fine-tuned just for one epoch
- Next, layer before the last is unfrozen and fine-tuned
- A similar process is repeated for each layer until convergence
You can read the paper here .
Hopefully, this blog was useful for you to build a basic understanding around this exciting field of pre-trained language models!
In the next blog, we will be discussing Transformers and BERT for learning fine-tunable pre-trained models.
Recommend
-
22
Evolution of NLP — Part 3 — Transfer Learning Using ULMFit Introduction to Transfer Learning for NLP using fast.ai This is the third part of a series of posts showing the improvements in NLP modeling a...
-
11
#embed was reviewed by Evolution Working Group (EWG) September 2nd, 2020 and good progress was made to get the paper in shape to get it into C++23. The resulting directive, however, is a lot less powerful than people might have w...
-
8
Understanding and Evolving the Rust Programming Language Rust is a young systems programming language that aims to fill the gap between high-level languages—which provide strong static guarantees like memory and thr...
-
8
266 members MEL Magazine A lifestyle and culture magazine, we cover sex, relationships, health, money, work and culture fr...
-
9
动态词向量(CoVe)是未来的发展趋势,本篇介绍ELMo~ 静态词向量与动态词向量无论是VSM、LSA还是skip-gram、CBOW所获得的词向量都是静态的,并没有考虑不同上下中词的不同语义。例如我们使用gensim训练好word2vec向量后,固定的词,不论该词所...
-
8
testmarkdown.md Header 2 -ELMO注意事项 电机模式枚举体因ELMO的电流、速度、位置为固定1,2,5 enab...
-
9
预训练语言模型进化史 预训练语言模型的前世今生(续) https://mp.weixin.qq.com/s/uAJ_05g0Zo33mgygTnow1Q 银色独角兽GPT家族 https://mp.weixin.qq.com/s/2_MXIEk5-JP5KwsV6al9XQ BERT,开启NLP新时代的王者
-
12
News » Topics »
-
8
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) 由 David Gao 发布于: 2023-10-30
-
7
Elmo asked how everyone’s doing and, um, they’re not great!Close
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK