Deep Learning for Natural Language Processing Using word2vec-keras

A deep learning approach for NLP by combining Word2Vec with Keras LSTM

Natural language processing (NLP) is a common research subfield shared by many research fields such as linguistics, computer science, information engineering, and artificial intelligence, etc. NLP is concerned with the interactions between computers and human natural languages in general and in particular how to use computers to process and analyze natural language data (e.g., text, voice, etc.). Some of the major challenges in NLP include speech recognition, natural language understanding, and natural language generation.

Text is one of the most widespread forms of NLP data. It can be treated as either a sequence of characters or a sequence of words, but with the advance of deep learning, the trend is to work at the level of words. Given a sequence of words, it must be somehow converted into numerical numbers before it can be understood by a machine learning or deep learning algorithm/model such as LSTM. One straight forward way is to use One-hot encoding to map each word to a sparse vector of the length of vocabulary. The other method (e.g., Word2vec ) uses word embedding to convert a word into a compact vector.

In NLP for traditional machine learning [1], both textual data preprocessing and feature engineering are required. Recently a new deep learning model Word2Vec-Keras Text Classifier [2] is released for text classification. It combines the Word2Vec model of Gensim [3] (a Python library for topic modeling, document indexing and similarity retrieval with large corpora) with Keras LSTM through an embedding layer as input.

In this article, similarly to [1], I use the public Kaggle SMS Spam Collection Dataset [4] to evaluate the performance of the Word2VecKeras model in SMS spam classification without feature engineering. The following two scenarios are covered:

SMS spam classification with data preprocessing
SMS spam classification without data preprocessing

The following code is to import all the necessary Python libraries:

from sklearn.datasets import fetch_20newsgroups
from word2vec_keras import Word2VecKeras
from pprint import pprint
import pandas as pd
import matplotlib.pyplot as plt
import itertools
import numpy as np
import nltk
import string
import re
import ast # abstract syntax tree: https://docs.python.org/3/library/ast.html
from sklearn.model_selection import train_test_split
import mlflow
import mlflow.sklearn%matplotlib inline

Once the SMS dataset file spam.csv is downloaded onto a computer, the following code can load the local dataset file into Pandas DataFrame as follows on Mac:

column_names = ['label', 'body_text', 'missing_1', 'missing_2', 'missing_3']
 raw_data = pd.read_csv('./data/spam.csv', encoding = "ISO-8859-1")
 raw_data.columns = column_names
 raw_data.drop(['missing_1', 'missing_2', 'missing_3'], axis=1, inplace=True)
 raw_data = raw_data.sample(frac=1.0)
 raw_data.head()

Note that loading this dataset needs to use the encoding format ISO-8859–1 rather than the default encoding format UTF-8.

1. Spam classification with data preprocessing

In this section, first, a data preprocessing procedure similar to [1] is applied to clean the SMS dataset. Then the resulting clean dataset is fed into the Word2VecKeras model for model training and prediction of spam SMS. the mlflow [5][6] is used to tract the history of model executions.

1.1 Data Preprocessing

The preprocessing () method of the Preprocessing class is to preprocess the SMS raw data as follows:

remove punctuation
tokenization
remove stopwords
apply stemming
apply lemmatizing
join tokens into string
drop intermediate data columns

class Preprocessing(object):
 def __init__(self, data, target_column_name='body_text_clean'):
 self.data = data
 self.feature_name = target_column_name

 def remove_punctuation(self, text):
 text_nopunct = "".join([char for char in text if char not in string.punctuation])# It will discard all punctuations
 return text_nopunct

 def tokenize(self, text):
 # Match one or more characters which are not word character
 tokens = re.split('\W+', text) 
 return tokens

 def remove_stopwords(self, tokenized_list):
 # Remove all English Stopwords
 stopword = nltk.corpus.stopwords.words('english')
 text = [word for word in tokenized_list if word not in stopword]
 return textdef stemming(self, tokenized_text):
 ps = nltk.PorterStemmer()
 text = [ps.stem(word) for word in tokenized_text]
 return text

 def lemmatizing(self, tokenized_text):
 wn = nltk.WordNetLemmatizer()
 text = [wn.lemmatize(word) for word in tokenized_text]
 return text

 def tokens_to_string(self, tokens_string):
 try:
 list_obj = ast.literal_eval(tokens_string)
 text = " ".join(list_obj)
 except:
 text = None
 return text

 def dropna(self):
 feature_name = self.feature_name
 if self.data[feature_name].isnull().sum() > 0:
 column_list=[feature_name]
 self.data = self.data.dropna(subset=column_list)
 return self.data

 def preprocessing(self):
 self.data['body_text_nopunc'] = self.data['body_text'].apply(lambda x: self.remove_punctuation(x))
 self.data['body_text_tokenized'] = self.data['body_text_nopunc'].apply(lambda x: self.tokenize(x.lower())) 
 self.data['body_text_nostop'] = self.data['body_text_tokenized'].apply(lambda x: self.remove_stopwords(x))
 self.data['body_text_stemmed'] = self.data['body_text_nostop'].apply(lambda x: self.stemming(x))
 self.data['body_text_lemmatized'] = self.data['body_text_nostop'].apply(lambda x: self.lemmatizing(x))

 # save cleaned dataset into csv file and load back
 self.save()
 self.load()

 self.data[self.feature_name] = self.data['body_text_lemmatized'].apply(lambda x: self.tokens_to_string(x))

 self.dropna()

 drop_columns = ['body_text_nopunc', 'body_text_tokenized', 'body_text_nostop', 'body_text_stemmed', 'body_text_lemmatized'] 
 self.data.drop(drop_columns, axis=1, inplace=True)
 return self.data

 def save(self, filepath="./data/spam_cleaned.csv"):
 self.data.to_csv(filepath, index=False, sep=',') 

 def load(self, filepath="./data/spam_cleaned.csv"):
 self.data = pd.read_csv(filepath)
 return self.data

The resulting data is saved in a new column body_text_clean as shown below:

In the above data preprocessing, both the stopwords and the wordnet data files of the Natural Language Toolkit (NLTK) are required and need to be downloaded manually (available here ) on Mac. The nltk.download() method does not work appropriately.

1.2 Modeling

The method prepare_data () of the SpamClassifier class is to get the SMS data prepared for modeling as follows:

load the dataset file spam.csv into Pandas DataFrame
use the Preprocessing class to preprocess the raw data (see the body_text_clean column)
split the clean data after data preprocessing into training and testing datasets
reformat the training and testing datasets as Python lists to be aligned with the model Word2VecKeras API [2]

Once the data is prepared for modeling, the train_model () method can be called to train the Word2VecKeras model. Then the methods evaluate () and predict () can be called to obtain model performance metrics (e.g., accuracy) and perform prediction respectively.

The mlFlow () method combines the above method calls, tracking of model execution results, and logging trained model into file into one work flow.

class SpamClassifier(object):
 def __init__(self):
 self.model = Word2VecKeras()

 def load_data(self):
 column_names = ['label', 'body_text', 'missing_1', 'missing_2', 'missing_3']
 data = pd.read_csv('./data/spam.csv', encoding = "ISO-8859-1")
 data.columns = column_names
 data.drop(['missing_1', 'missing_2', 'missing_3'], axis=1, inplace=True)
 self.raw_data = data.sample(frac=1.0) 

 return self.raw_data

 def split_data(self):
 self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(self.x, self.y, test_size=0.25, random_state=42)

 def numpy_to_list(self):
 self.x_train = self.x_train.tolist()
 self.y_train = self.y_train.tolist()
 self.x_test = self.x_test.tolist()
 self.y_test = self.y_test.tolist()

 def prepare_data(self, feature, label='label'):
 self.load_data()
 pp = Preprocessing(self.raw_data)
 self.data = pp.preprocessing()
 self.x = self.data[feature].values
 self.y = self.data[label].values
 self.split_data()
 self.numpy_to_list()

 return self.data

 def train_model(self):
 self.w2v_size = 300
 self.w2v_min_count = 1 # 5
 self.w2v_epochs = 100
 self.k_epochs = 5 # 32
 self.k_lstm_neurons = 512
 self.k_max_sequence_len = 1000

 self.model.train(self.x_train, self.y_train, 
 w2v_size=self.w2v_size, 
 w2v_min_count=self.w2v_min_count, 
 w2v_epochs=self.w2v_epochs, 
 k_epochs=self.k_epochs, 
 k_lstm_neurons=self.k_lstm_neurons, 
 k_max_sequence_len=self.k_max_sequence_len, 
 k_hidden_layer_neurons=[])

 def evaluate(self):
 self.result = self.model.evaluate(self.x_test, self.y_test)
 self.accuracy = self.result["ACCURACY"]
 self.clf_report_df = pd.DataFrame(self.result["CLASSIFICATION_REPORT"])
 self.cnf_matrix = self.result["CONFUSION_MATRIX"]
 return self.result

 def predict(self, idx=1):
 print("LABEL:", self.y_test[idx])
 print("TEXT :", self.x_test[idx])
 print("/n============================================")
 print("PREDICTION:", self.model.predict(self.x_test[idx]))

 def mlFlow(self, feature='body_text_clean'):
 np.random.seed(40) 
 with mlflow.start_run():
 self.prepare_data(feature=feature) # feature should be 'body_text' if no need to preprocessing
 self.train_model()
 self.evaluate()
 self.predict()
 mlflow.log_param("feature", feature) 
 mlflow.log_param("w2v_size", self.w2v_size) 
 mlflow.log_param("w2v_min_count", self.w2v_min_count)
 mlflow.log_param("w2v_epochs", self.w2v_epochs)
 mlflow.log_param("k_lstm_neurons", self.k_lstm_neurons)
 mlflow.log_param("k_max_sequence_len", self.k_max_sequence_len)
 mlflow.log_metric("accuracy", self.accuracy)
 mlflow.sklearn.log_model(self.model, "Word2Vec-Keras")

The following code shows how to instantiate a SpamClassifier object and call the mlFlow () method for modeling and prediction with data preprocessing:

spam_clf = SpamClassifier()
spam_clf.mlFlow(feature='body_text_clean')

1.3 Comparison

In [1], a similar data preprocessing procedure was applied to the same Kaggle SMS spam dataset first. Then feature engineering was performed on the preprocessed dataset to obtain modeling features such as text message length and percentage of punctuations in text. Then the scikit-learn RandomForestClassifier model was trained for prediction. The obtained accuracy is about 97.7%.

In this article, after data preprocessing, the Word2VecKeras model is directly trained on the preprocessed dataset for prediction without any feature engineering. The achieved accuracy is about 98.4%.

2. Spam classification without data preprocessing

In this section, once the Kaggle SMS spam dataset is loaded, the raw data (see the body_text column) is directly fed into the Word2Vec-Keras model for model training and prediction of spam SMS. Neither data preprocessing nor feature engineering is used. Similarly to previous section, the mlflow [5][6] is used to tract the history of model executions. This is achieved as follows:

spam_clf = SpamClassifier()
spam_clf.mlFlow(feature='body_text')

The obtained accuracy is about 98.6%, which is competitive with the Word2VecKeras model performance in accuracy of spam classification with data preprocessing in the previous section.

Summary

In this article, the public Kaggle SMS Spam Collection Dataset [4] was used to evaluate the performance of the new Word2VecKeras model in SMS spam classification without feature engineering. Two scenarios were covered. One applied the common textual data preprocessing to clean the raw dataset and then used the clean dataset to train the model for prediction. The other directly used the raw dataset without any data preprocessing for model training and prediction. The results of model performance in accuracy show that the Word2VecKeras model outperformed the traditional NLP method in [1] and performed similarly in both of the two scenarios above. This indicts that the new Word2VecKeras model has the potential of being directly applied to raw textual data for text classification without either textual data preprocessing or feature engineering.

All of the pieces of source code in this article are available in Github [7].