35

Deep Learning for NLP: Word Embeddings

 3 years ago
source link: https://towardsdatascience.com/deep-learning-for-nlp-word-embeddings-4f5c90bcdab5?gi=6a22412d3cbb
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Simple and intuitive Word Embeddings

Introduction

Word embeddings have become one of the most used tools and main drivers of the amazing achievements of Artificial Intelligence tasks that require processing natural languages like speech or texts.

In this post, we will unveil the magic behind them, see what they are, why they have become a standard in the Natural Language Processing (NLP hereinafter) world, how they are built, and explore some of the most used word embedding algorithms.

Everything will be explained in a simple and intuitive manner , avoiding complex maths and trying to make the content of the post as accessible as possible.

It will be broken down in the following subsections:

  1. What are word embeddings?
  2. Why should we use word embeddings?
  3. How are word embeddings built?
  4. What are the most popular word embeddings?

Once you are ready, let's start by seeing what word embeddings are.

1) What are word embeddings?

Computers break everything down to numbers.Bits (zeros and ones) more specifically. What happens when a software inside a computer (like a Machine Learning algorithm for example) has to operate or process a word? Simple, this word needs to be given to the computer as the only thing it can understand: as numbers.

In NLP, the most simple way to do this is by creating a vocabulary with a huge amount of words (100.000 words let’s say), and assigning a number to each word in the vocabulary.

The first word in our vocabulary (‘ apple ’ maybe) will be number 0. The second word (‘ banana ’) will be number 1, and so on up to number 99.998, the previous to last word (‘ king ’) and 999.999 being assigned to the last word (‘ queen ’).

Then we represent every word as a vector of length 100.000 , where every single item is a zero except one of them, corresponding to the index of the number that the word is associated with.

yyyyAnU.png!web

Vector representations of some of the examples from the previous paragraphs.

This is called one-hot encoding for words.

The one-hot encoding have various different issues related with efficiency and context, that we will see in just a moment.

Word embeddings are just another form representing words through vectors , that successfully solve many of the issues derived from using a one-hot encoding by somehow abstracting the context or high-level meaning of each word.

The main takeaway here is that word embeddings are vectors that represent words, so that similar meaning words have similar vectors.

2) Why should we use word embeddings?

Consider the previous example but with only three words in our vocabulary: ‘apple’, ‘banana’ and ‘king’. The one hot encoding vector representations of these words would be the following.

bUbmyaA.png!web

One hot vector representations of our vocabulary

If we then plotted these word vectors in a 3 dimensional space , we would get a representation like the one shown in the following figure, where each axis represents one of the dimensions that we have, and the icons represent where the end of each word vector would be.

y2QFbqA.png!web

Representation of our one hot encoded word vectors in a 3 dimensional space.

As we can see, the distance from any vector (position of the icons) to all the other ones is the same : two size 1 steps in different directions. This would be the same if we expanded the problem to 100.000 dimensions, taking more steps but maintaining the same distance between all the word vectors.

Ideally, we would want vectors for words that have similar meanings or represent similar items to be close together, and far away from those that have completely different meanings: we want apple to be close to banana but far away from king .

Also, one hot encodings are very inefficient . If you think about it, they are huge empty vectors with only one item having a value different than zero. They are very sparse, and can greatly slow down our calculations.

In conclusion:one hot encodings don’t take into account the context or meaning of the words, all the words vectors have the same distance in between them, and are highly inefficient.

Word embeddings solve these problemsby representing each word in the vocabulary by a fairly small (150, 300, 500 dimensional) fixed size vector, called an embedding, which is learned during the training.

These vectors are created in a manner so that words that appear in similar contexts or have similar meaning are close together , and they are not sparse vectors like the ones derived from one-hot embeddings.

If we had a 2 dimensional word embedding representation of our previous 4 words, and plotted it on a 2D grid, it would look something like the following figure.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK