26

Detailed overview of spaCy: the NLP library built for production

 4 years ago
source link: https://www.tuicool.com/articles/f6n6BjQ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

NLP Chronicles: spaCy, the NLP Library Built for Production

The Up-and-Coming Champion of Natural Language Processing

aaqiUbQ.jpg!web7Bjqqqu.jpg!web

If you’re familiar with natural language processing or starting to learn about it, you might have come acrossNLTK (Natural Language Tool Kit), Stanford Core NLP, etc.

But have you heard about spaCy?

Think of it this way. Natural language processing (in Python) is a kingdom, and NLTK is the king. Everyone admires the king and respects what he has done for the kingdom. But there comes a time when every king should step down and make way for the next generation

spaCy, the prince, is an emerging champion built to succeed the reigning king. This library has gained popularity over the past couple of years and is steadily gaining the admiration of NLP practitioners.

In this article, I’ll show you what spaCy is, what makes it special, and how you can use it for NLP tasks.

eaaIby3.jpg!webn2InErn.jpg!web
Image Credits: https://explosion.ai/blog/introducing-spacy

What is spaCy?

“SpaCy is a n Industrial-Strength Natural Language Processing library” — spaCy.io

spaCy is a fairly new library to join the NLP world. But it’s gaining popularity quite steadily, and there are some really good reasons for this momentum.

What makes spaCy special?

spaCyclaims to be an industrial-strength NLP library. This means a few things:

  • You can use spaCy in production environments , and it will work as efficiently as expected.
  • It’s very fast. Of course, it should be. After all, it is written in Cython (Cython is a superset of Python that enhances the performance of Python to C language level).
  • It’s very accurate. In fact, it’s one of the most accurate NLP libraries to date.

It’s syntactic parser is the fastest available, and its accuracy is within 1% of the best available .

These are not just idle claims—there are facts and figures to back up them. Head over to the spaCy benchmarks for more information.

  • It’s minimalistic and opinionated. spaCy doesn’t bombard you with many options to choose from. It just provides one algorithm for each task. And that algorithm is often the best (and it constantly gets perfected and improved). So instead of choosing what algorithm to use, you can be productive and just get your work done.
  • It’s highly extensible. With the bloom of machine learning (ML) and deep learning (DL), text data comes into play for many of its applications. spaCy can be used alongside other popular ML and DL libraries such as scikit-learn, TensorFlow, and more.
  • It supports several languages. Currently spaCy supports: German, Spanish, Greek, French, Italian, Dutch, and Portuguese, apart from English. For a complete list, follow this link .
  • It’s customizable . You can add custom components or add your own implementation where needed with spaCy.

How is spaCy different from NLTK?

Purpose

The primary difference between spaCy and NLTK is the purposes that they were built for.

NLTK was built with learning in mind. It is a great toolkit for teaching, learning, and experimenting with NLP. But spaCy was built with production-readiness in mind, focusing more on efficiency and performance.

Ease of use and learning time

NLTK provides several choices of algorithms for tasks, and you can choose which algorithm suits a given task. This algorithm selection process can be a time-consuming and sometimes unwanted thing to do.

However spaCy doesn’t make you choose the correct algorithm. Instead, it often provides the best and most efficient algorithm for a particular task without wasting any time.

Approach to handling text

NLTK processes and manipulates strings to perform NLP tasks. It has methods for each task— sent_tokenize for sentence tokenizing, pos_tag for part-of-speech tagging, etc. You have to select which method to use for the task at hand and feed in relevant inputs.

On the other hand, spaCy follows an object-oriented approach in handling the same tasks. The text is processed in a pipeline and stored in an object, and that object contains attributes and methods for various NLP tasks. This approach is more versatile and in alignment with modern Python programming.

How to use spaCy for NLP tasks

Now comes the exciting part where you get to see spaCy in action. In this section, I’ll demonstrate how to perform basic NLP tasks with spaCy using practical examples.

But before we get started, you might want to brush up your basics of natural language processing.

If you need a refreshers, head on over to the first article in this series: NLP Chronicles: Intro to NLP with NLTK:

Here’s what we’ll cover (you can jump to a given section by following its link):

  • Sentence Boundary Detection
  • Part-of-Speech Tagging
  • Named Entity Recognition
  • Word Vectors Similarity

Prerequisites

You need to have the following libraries installed on your machine:

  • Python 3.+
  • Jupyter Notebook

Also, you can use Google Colab instead of setting up the machine on your own. Here’s a getting started guide for Colab .

spaCy Installation

spaCy installation is quite easy. Just run the following commands, and you’ll have spaCy installed in no time.

# pip installation
pip install -U spaCy
# anaconda installation
conda install -c conda-forge spacy

But installing just spaCy won’t be enough. Because in order to work with spaCy, you’ll need to download language-specific models manually.

The following code will download the English language model. In the same way, you can download models for other available languages:

python -m spacy download en

NOTE : I’ll be explaining aspects related to spaCy’s architecture and design in quote sections, where necessary.

A newsletter for machine learners — by machine learners. Sign up to receive our weekly dive into all things ML, curated by our experts in the field.

Tokenization

Tokenization is the process of segmenting text into words, punctuation etc. spaCy tokenizes the text, processes it, and stores the data in the Doc object.

Doc object contains all the information about the text—the attributes, methods, and properties that give access to the requested linguistic information of the text.

The following figure shows the process of tokenization in spaCy.

jmYn63Y.png!web7JNJJnY.png!web
Tokenization process — Image Credits: https://spacy.io/usage/spacy-101#annotations-token

In the following code snippet, you can observe that it’s possible to access tokens using token.text in the Token object.

Token object acts as a view that points to the Doc object. It contains details and features about a token.

First, you have to import spaCy first. The nlp method requires the model we downloaded earlier (details about the model will be explained in the next section).

7RJVfyU.jpg!web
Tokenization

Here’s something interesting—After processing the text, spaCy keeps all the information about the original text intact within the Doc object.

7RJVfyU.jpg!web
Tokenization with reference to original text

token.idx keeps the position of the characters of the text. As you can see from the code snippet above, all the extra and trailing spaces are also preserved.

Because of this clever design, you can reconstruct the original text, white spaces included. Also, it helps in situations where you need to replace words in the original text or when annotating the text.

For more information about tokenization, follow this link .

Dependency Parsing

Dependency parsing is the process of assigning syntactic dependency labels that describe the relationships between individual tokens, like subject or object.

After you call nlp in spaCy, the input text is first tokenized and the Doc object is created.

The Doc object goes through several phases of processing in a pipeline. This pipeline, unsurprisingly, is called the processing pipeline .

The processing pipeline uses a statistical model . The default model consists of a tagger, parser, and entity recognizer. Each component does its own processing and returns the same Doc object, then its passes to the next component. We can add our own component as well.

rMVBz2E.png!webMbQn2aY.png!web
Processing Pipeline — Image Credits: https://spacy.io/usage/processing-pipelines#_title

Statistical models are used for most tasks in spaCy. It’s a neural network trained on large amounts of text data. Tagging, parsing, and entity recognition components use the model for prediction. spaCy provides different types of models for different languages and for different usages.

The dependency parser also uses the statistical model in order to compute the dependency labels for the tokens.

7RJVfyU.jpg!web
Dependency parsing
  • text → text: The original token text.
  • dep_ → dep: The syntactic relation connecting child to head.
  • head.text → head text: The original text of the token head.
  • head.pos_ → head POS: The part-of-speech tag of the token head.
  • children tokens: The immediate syntactic dependents of the token.

spaCy provides a convenient way to view the dependency parser in action, using its own visualization library called displaCy .

7RJVfyU.jpg!web
Visualising dependency parsing

Dependency parser is also used in sentence boundary detection, and also lets you iterate over computed noun chunks.

Follow this link for more information about dependency parsing in spaCy.

Chunking

Chunking is the process of extracting noun phrases from the text.

spaCy can identify noun phrases (or noun chunks), as well. You can think of noun chunks as a noun plus the words describing the noun. It’s also possible to identify and extract the base-noun of a given chunk.

eg: In example 01 in the following code snippet, “Tall big tree is in the vast garden” → The words “tall” and “big” describe the noun “tree”, and “vast” describes the noun “garden”.

7RJVfyU.jpg!web
Chunking
  • text → text: The original noun chunk text.
  • root.text → root text: The original text of the word connecting the noun chunk to the rest of the parse.
  • root.dep_ → root dep: Dependency relation connecting the root to its head.
  • root.head.text → root head text: The text of the root token’s head.

You can find more information about chunking in this link .

Sentence Boundary Detection

This is the process of identifying and splitting text into individual sentences.

Typically, most NLP libraries use a rule-based approach when obtaining sentence boundaries. However spaCy follows a different approach for this task.

spaCy uses dependency parsing in order to detect sentences using the statistical model. This is more accurate than the classical rule-based approach.

Traditional rule-based sentence splitting will work on general purpose text, but may not work as intended when it comes to social media or conversational text. Since spaCy uses a prediction-based approach, the accuracy of sentence splitting tends to be higher.

By accessing the Doc.sents property of the Doc object, we can get the sentences as in the code snippet below.

7RJVfyU.jpg!web
Sentence boundary detection

In the second example, I’ve added a few emojis to the text. As you can see, spaCy can identify the emojis in the text correctly and split into sentences as intended.

Part-of-Speech (POS) Tagging

POS tagging is done by assigning word types to tokens, like a verb or noun.

After tokenization, the text goes through parsing and tagging. With the use of the statistical model, spaCy can predict the most likely tag/label for a token in a given context.

7RJVfyU.jpg!web
POS tagging

From the above code snippet, the attributes of the token object represent the following.

  • text → text: The original word text.
  • pos_ → POS: The simple part-of-speech tag.
  • tag_ → tag: The detailed part-of-speech tag.
  • shape_ → shape: The word shape — capitalization, punctuation, digits.
  • is_alpha → is alpha: Is the token of an alpha character.
  • is_stop → is stop: Is the token part of a stop list, i.e. the most common words of the language.

You can get a description of the pos_ or tag_ by using the following command:

spacy.explain("NNP")
# OUTPUT --> 'noun, proper singular'
spacy.explain("VBD")
# OUTPUT --> 'verb, past tense'

Named Entity Recognition

NER is done by labeling words/tokens—named “real-world” objects—like persons, companies, or locations.

spaCy’s statistical model has been trained to recognize various types of named entities, such as names of people, countries, products, etc.

The predictions of these entities might not always work perfectly because the statistical model may not be trained on the examples that you require. In such a case, you can tune the model to suit your needs.

Follow this link for a full list of named entities supported by spaCy.

7RJVfyU.jpg!web
Named entity recognition

As you can see, spaCy can accurately identify most entities. Using displaCy you can view the identified entities:

7RJVfyU.jpg!web
Visualizing named entity recognition

Lemmatization

Lemmatization is the assigning of the base forms of words. For example: “was” → “be” or “cats” → “cat”

To perform lemmatization, the Doc object needs to be parsed. The processed Doc object contains the lemma of words.

7RJVfyU.jpg!web
Lemmatization

Word Vectors Similarity

Word vectors similarity is determined by comparing word vector representations of a word. Word vectors can be generated by using an algorithm like word2vec .

This feature also needs the statistical model. However, the default model doesn’t come with word vectors. So you’ll have to download a larger model for that.

The following command downloads the model.

python -m spacy download en_core_web_lg

You can access a vector of a word as follows:

print(nlp.vocab['man'].vector)

The model’s vocabulary contains vectors for most words in the language. Words like “man”, “vehicle”, and “school” are fairly common words, and their vectors can be accessed as shown below

If the word vector isn’t in the vocabulary, then it doesn’t have a vector representation. In the above example, the word “jfido” is such a word.

We can identify if a word is out of the vocabulary using the is_oov attribute.

7RJVfyU.jpg!web

spaCy can also compare the similarity between two objects and predict the similarity between them.

Doc , Span , and Token objects contain a method called  .similarity to compute similarity.

As you can see from the snippet below, the similarity between “laptop” and “computer” is 0.677216, while the similarity between “ bus” and “laptop” is 0.2695869.

Related objects have a greater similarity score, while less related objects have a lower similarity score.

7RJVfyU.jpg!web

In a similar way, we can also find the similarity of sentences:

7RJVfyU.jpg!web

Conclusion

Even though NLTK and other NLP libraries are great, spaCy is likely to emerge as a favored library because it shows amazing potential, especially for production-level tasks and applications.

As a recap. spaCy provides:

  • Fast performance and efficiency
  • Industrial-grade usage
  • Better accuracy
  • State-of-the-art algorithms for NLP tasks

spaCy keeps evolving and improving, which makes it more exciting to work with. I personally have fallen in love with spaCy and its capabilities.

If you are an NLP practitioner who hasn’t tried spaCy yet, you should definitely give it a try. I’m sure you will start loving it as well.

In this article, we barely scratched the surface of spaCy’s abilities. We can do much more with spaCy, and I plan to discuss these more advanced features and usages in a future article.

If you have any problems or questions regarding this article, please do not hesitate to leave a comment below or drop me an email ([email protected]).

If you guys enjoyed this article, don’t forget to give a clap.

References

Did you know: Machine learning isn’t just happening on servers and in the cloud. It’s also being deployed to the edge.Learn more about how Fritz is making this transition possible.

Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for all the latest content, news, and more in machine learning, mobile development, and where the two intersect.

resize?url=http%3A%2F%2Fapi.screenshotlayer.com%2Fapi%2Fcapture%3Faccess_key%3Dfe59908dad3baab69ffab249a2224b03%26viewport%3D1024x612%26width%3D1000%26url%3Dhttps%253A%252F%252Fupscri.be%252Fdfdb75%253Fscreenshot&key=4fce0568f2ce49e8b54624ef71a8a5bd&width=40

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK