Text Preprocessing With NLTK

JfimiiR.jpg!web

Photo by Carlos Muza on Unsplash

Intro

Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. Deep learning models cannot use raw text directly, so it is up to us researchers to clean the text ourselves. Depending on the nature of the task, the preprocessing methods can be different. This tutorial will teach the most common preprocessing approach that can fit in with various NLP tasks using NLTK (Natural Language Toolkit) .

Why NLTK?

Popularity : NLTK is one of the leading platforms for dealing with language data.
Simplicity : Provides easy-to-use APIs for a wide variety of text preprocessing methods
Community : It has a large and active community that supports the library and improves it
Open Source : Free and open-source available for Windows, Mac OSX, and Linux.

Now you know the benefits of NLTK, let’s get started!

Tutorial Overview

Lowercase
Removing Punctuation
Tokenization
Stopword Filtering
Stemming
Part-of-Speech Tagger

All code displayed in this tutorial can be accessed in my Github repo .

Import NLTK

Before preprocessing, we need to first download the NLTK library .

pip install nltk

Then, we can import the library in our Python notebook and download its contents.

Lowercase

As an example, we grab the first sentence from the book Pride and Prejudice as the text. We convert the sentence to lowercase via text.lower() .

Removing Punctuation

To remove punctuation, we save only the characters that are not punctuation, which can be checked by using string.punctuation .

Tokenization

Strings can be tokenized into tokens via nltk.word_tokenize .

Stopword Filtering

We can use nltk.corpus.stopwords.words(‘english’) to fetch a list of stopwords in the English dictionary. Then, we remove the tokens that are stopwords.

Stemming

We stem the tokens using nltk.stem.porter.PorterStemmer to get the stemmed tokens.

POS Tagger

Lastly, we can use nltk.pos_tag to retrieve the part of speech of each token in a list.

The full notebook can be seen here .

Combining all Together

We can combine all the preprocessing methods above and create a preprocess function that takes in a .txt file and handles all the preprocessing. We print out the tokens, filtered words (after stopword filtering), stemmed words, and POS, one of which is usually passed on to the model or for further processing. We use the Pride and Prejudice book (accessible here ) and preprocess it.

This notebook can be accessed here .

Conclusion

Text preprocessing is an important first step for any NLP application. In this tutorial, we discussed several popular preprocessing approaches using NLTK: lowercase, removing punctuation, tokenization, stopword filtering, stemming, and part-of-speech tagger.

Intro

Tutorial Overview

Import NLTK

Lowercase

Removing Punctuation

Tokenization

Stopword Filtering

Stemming

POS Tagger

Combining all Together

Conclusion

Recommend

LSTM Gradients

GitHub goes down, affecting thousands of software developers

Structural type system and polymorphism in TypeScript. Type guards with predicat...

你知道如何在Go语言中愉快的使用环境变量吗？(上)

Polymorphic Allocators from C++17, std:vector Growth and Hacking

分布式图数据库在贝壳的应用实践

LaTeX typesetting part 2 (tables)

可算是有文章，把Linux零拷贝讲透彻了！

Solving the Travelling Salesman Problem with MiniSom

我用Python展示Excel中常用的20个操作

About Joyk