Text Mining with the Democratic Debates

7J7Bzi2.png!web

T here have been 6 official democratic debates so far and more in the future. The debates are long and frankly very boring. The argument for the debates is to introduce America the candidates and understand their policy positions. However, America has a short attention span and with a 3-hour run-time, that’s a lot of time spent watching tv. In total, a person would spend 18 hours watching the debates. That is not realistic at all and there must be a better way to summarize the debates in an objective manner.

Introducing Text Mining

Text mining is the practice of extracting information out of text. Techniques such as n-grams, bag of words, and topic modeling are some basic techniques in text mining. I believe text mining will be a more important field than machine learning because of the rising integration of social media with business. We live in an age where collecting and storing data is extremely easy and cheap. Businesses are now exploring ways to process and understand the data which has resulted in machine learning, deep learning, and AI. However, unstructured data is growing larger specifically with text. Consumer reviews, tweets, and emails are examples of unstructured data that businesses want to capitalize on. All of this sounds extremely complicated; machine learning with unstructured data? neural networks for understanding language? How is it possible? Believe it or not, extracting insights from text can be as simple as counts of words. The hardest part of text mining is collecting the data.

Luckily, after every debate a news organization will post the debate transcripts. I downloaded the transcripts and cleaned it into a simple csv file that can be found here. Both R and Python have a lot of text mining packages. I will be using TidyText which has a great textbook to learn the basics of text and this article is basically a summary of the techniques.

N-grams and Bag-of-words

N-grams and Bag-of-words is the foundation of most text mining techniques. The idea behind them is to count each word. Given the sentence “I like dogs.” a bag-of-words feature would look like this.

Now let’s see what two sentences would look like with the bag-of-words features.

Note how for “I like cats”, the dogs feature counts 0. Bag-of-words features help map out text data into a traditional numerical format.

N-grams are like bag-of-words except with another step. Instead of counting one word like bag-of-words, N-grams specifies how many words. N is the number of words. So, 2-grams or “bigrams” is two words. Let’s see what a bigram features look like with the previous sentences.

When looking at the table, it doesn’t look any different than the bag-of-words table. However, bigrams add more context, we can see that dogs and cats are paired with the word like . This means we can see that both sentences are positive to those two animals. You can even go farther with n-grams such as trigrams, four-grams, five-grams, etc. N-grams add context, however, can start to get very specific. This can be a good or a bad thing.

N-grams and Bag-of-Words feature engineering in R

N-gram and Bag-of-Words can be generated using regex functions, but regex is messy looking and there are packages that can do most of the legwork. Tidytext contains a function called unnest_tokens. Tokens are mentioned a lot in text mining. Tokens are inputs for text mining so bag-of-words and n-grams would all be considered tokens.

First, we’re going to load the packages and do some formatting. There are three columns in the dataset; Debate, Character, and Text. I filtered out the moderators and other unknown people from the transcripts so that the data only contains presidential candidates.

library(tidyverse)
library(tidytext)df <- read.csv("DemDebates.csv", colClasses = c("factor","factor","character"))df <- df %>% 
  filter(!character %in% c("(Unknown)", "Announcer","Bash","Bridgewater","Burnett","Cooper","Davis","Diaz-Balart","Guthrie","Holt","Jose Diaz-Balart","Lacey","Lemon","Lester Holt", "Maddow","Muir","Protesters","Protestor","Ramos","Savannah Guthrie","Stephanopoulos","Tapper","Todd","Unknown"))

Now we’ll use the unnest_tokens function to extract the bag of words. The first input names the token, the second input is column containing the text, the third input is specifying the type of token, and the last input is the n for the n-gram. So, for bag of words we would specify n = 1.

df %>% unnest_tokens(word, "text", token = "ngrams", n =1)

Now that we have the words, we need to count them. This is a simple count function where we will count by character. I will also filter it to the top 5 words for each candidate and sort by candidate and by word frequency. All of this will be in one command however, you can assign the bag-of-words to a variable.

df %>% 
  unnest_tokens(word, "text", token = "ngrams", n =1) %>% 
  count(character, word) %>% 
  group_by(character) %>% 
  top_n(n, n = 5) %>% 
  arrange(character, -n)

amUZzuV.png!web

As you can see the top 5 words for each candidate is not very useful. The list is full of words that have little to no meaning. These words are known as stop words and are usually filtered out. The tidytext package contains a list of frequent stop words we can use to filter the bag-of-words list. This will be done with anti_join from dplyr although you can also use the filter function. Tidytext has integrated many of its functions with the tidyverse making it very intuitive.

df %>% 
  unnest_tokens(word, "text", token = "ngrams", n =1) %>% 
  anti_join(stop_words) %>% 
  count(character, word) %>% 
  group_by(character) %>% 
  top_n(n, n = 5) %>% 
  arrange(character, -n)

7fEz22y.png!web

Now some interesting things are happening. The words look to be somewhat relevant and could give some insights to each candidate. Let’s plot the words with the candidates. We’re going to use ggplot and make a simple bar chart showing each candidates top 5 words and their frequencies. I will also use facet_wrap which creates a separate plot for each value in a column. This will create separate graphs for each candidate. A common problem with plotting words or categorical variables is arranging them, but tidytext comes with functions that help with this process. reorder_within arranges the words and will help keep it sorted when using facet_wrap. Additionally scale_x_reordered will need to be used with it to format the x axis.

df %>% 
  unnest_tokens(word, "text", token = "ngrams", n =1) %>% 
  anti_join(stop_words) %>% 
  count(character, word) %>% 
  group_by(character) %>% 
  top_n(n, n = 5) %>% 
  arrange(character, -n) %>% 
  ggplot(aes(x = reorder_within(word, n, character),#Reorders word by freq
             y  = n,
             fill = character)) + 
  geom_col() + 
  scale_x_reordered() + #Reorders the words 
  facet_wrap(~character, scales ="free") + #Creates individual graphs
  coord_flip() + 
  theme(legend.position = "None")

zYb2e2j.png!web

This graph is very nice and reveals a lot about the current pool of candidates. Democrats are talking about people, president, Trump, America, government, healthcare, climate, etc. All of this makes sense however, all of the candidates share the same words and doesn’t explain the details of each candidate. Let’s see the bigrams for each candidate.

Although you should rename the token to bigram, we can just change n = 1 to n = 2. See how simple it is to look at n-grams with tidytext?

df %>% 
  unnest_tokens(word, "text", token = "ngrams", n =2) %>% 
  anti_join(stop_words) %>% 
  count(character, word) %>% 
  group_by(character) %>% 
  top_n(n, n = 5) %>% 
  arrange(character, -n) %>% 
  ggplot(aes(x = reorder_within(word, n, character),#Reorders the words by freq
             y  = n,
             fill = character)) + 
  geom_col() + 
  scale_x_reordered() + #Reorders the words 
  facet_wrap(~character, scales ="free") + #Creates individual graphs
  coord_flip() + 
  theme(legend.position = "None")

JVJ7nqn.png!web

Okay, maybe it isn’t that easy. The anti_join command is looking for unigrams or just words instead of bigrams. We can filter by using the separate and unite functions. We are just creating two columns containing word1 and word2 of the bigram and filtering out the bigram if one of the words contains a stop word.

df %>% 
  unnest_tokens(word, "text", token = "ngrams", n =2) %>% 
  separate(word, c("word1","word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word | !word2 %in% stop_words$word) %>% 
  unite("bigram", c(word1, word2), sep = " ") %>% 
  count(character,bigram) %>% 
  group_by(character) %>% 
  top_n(n, n = 5) %>% 
  arrange(character, -n) %>% 
  ggplot(aes(x = reorder_within(bigram, n, character),#Reorders the words by freq
             y  = n,
             fill = character)) + 
  geom_col() + 
  scale_x_reordered() + #Reorders the words 
  facet_wrap(~character, scales ="free") + #Creates individual graphs
  coord_flip() + 
  theme(legend.position = "None")

RVbYV3u.png!web

As you can see the bigrams are more informative of each candidate. Beto talks about El Paso, Sanders talks about Medicare for all (as implied by Medicare for). We can also evaluate the debates and see if there were any major themes between them.

df %>% 
  unnest_tokens(bigram, "text", token = "ngrams", n =2) %>% 
  separate(bigram, c("word1","word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word | !word2 %in% stop_words$word) %>% 
  unite("bigram", c("word1", "word2"), sep = " ") %>% 
count(debate, bigram) %>% 
  group_by(debate) %>% 
  top_n(n, n =5) %>% 
  ggplot(aes(x = reorder_within(bigram, n, debate), 
             y = n, 
             fill = debate)) + 
  geom_col() + 
  scale_x_reordered() +
  facet_wrap(~debate, scales = "free") + 
  coord_flip() +
  theme(legend.position = "None")

JfARNzV.png!web

At first, the graph shows there are no differences, in fact, the bigrams and rankings are very similar. This is one of the faults with simply counting n-grams or bag-of-words. It doesn’t take into account popular words. You could create a separate list of popular words for each document and filter them out but that takes a lot of time. Here is where TF-IDF shines.

Term Frequency and Inverse Document Frequency (TF-IDF)

Term Frequency and Inverse Document Frequency is my preferred way of counting n-grams or bag-of-words within groups. TF-IDF finds unique words for each group by comparing a group’s words with the entire set of documents. The math is pretty simple too.

https://skymind.ai/wiki/bagofwords-tf-idf

Tidytext contains a Tf-idf function so you won’t need to manually create the formula. What’s great about Tf-idf is you technically don’t need to remove stop words because stop words will be shared across all of the documents. Let’s see it in action.

df %>% 
  unnest_tokens(word, "text", token = "ngrams", n =1) %>% 
  count(debate, word) %>% 
  bind_tf_idf(word, debate, n) %>% 
  group_by(debate) %>% 
  top_n(tf_idf, n = 5) %>% 
  ggplot(aes(x = reorder_within(word, tf_idf, debate), y = tf_idf, fill = debate)) + 
  geom_col() + 
  scale_x_reordered() + 
  facet_wrap(~debate, scales = "free") + 
  coord_flip() + 
  theme(legend.position = "none")

Bind_tf_idfgoes after the count function. Bind_tf_idf will then take a term, document, and n.

eqa6Bbq.png!web

So, the graph shows a lot of different words and some of the words don’t make any sense. Tf-idf did help however there are some words that simply don’t add any value like in debates 4 and 2a. Why is there an “h”? Probably from how the data was cleaned where there might have been an unwanted space. But we still can see very distinct words. Although Tf-idf did remove a lot of stop words, it is still better to remove them. Let’s see what tf-idf shows us for each candidate.

df %>% 
  unnest_tokens(word, "text", token = "ngrams", n =1) %>% 
  count(character, word) %>% 
  bind_tf_idf(word, character, n) %>% 
  group_by(character) %>% 
  top_n(tf_idf, n = 5) %>% 
  ggplot(aes(x = reorder_within(word, tf_idf, character), y = tf_idf, fill = character)) + 
  geom_col() + 
  scale_x_reordered() + 
  facet_wrap(~character, scales = "free") + 
  coord_flip() + 
  theme(legend.position = "none")

EF36JrY.png!web

Now this is very interesting. The top 5 tf-idf words do reveal a lot about each candidate. Yang talks about 1000 which references his freedom dividend. Beto talks about Texas and Paso which makes sense because that is his hometown. Eric Swallwell with Torch being the most identifiable word for him which references the infamous “pass the torch” speech. Even looking at the x axis Torch has the highest value which shows how well it identifies the candidate. Tf-idf can be applied to any n-gram too. You may have noticed that the first graph shows the top 5 trigrams for each candidate.

df %>% 
  unnest_tokens(trigram, "text", token = "ngrams", n = 3) %>% 
  count(character, trigram) %>% 
  bind_tf_idf(trigram, character,n) %>% 
  group_by(character) %>% 
  top_n(tf_idf, n = 5) %>% 
  mutate(rank = rank(tf_idf, ties.method = "random")) %>% 
  arrange(character, rank) %>% 
  filter(rank <=5) %>% 
  ggplot(aes(x = reorder_within(trigram, tf_idf, character), 
             y = tf_idf, 
             color = character, 
             fill = character)) + 
  geom_col() +
  scale_x_reordered() +
  facet_wrap(~character, scales = "free") +
  coord_flip() + 
  theme(legend.position = "None")

7J7Bzi2.png!web

Conclusion

Bag-of-words and n-grams are the foundation of text mining and other NLP topics. It is very simple and elegant way of converting text into numerical features. Additionally, tf-idf is another tool that can add to the effectiveness of bag-of-words and n-grams. These basic concepts can take you far into the world of text mining. Consider creating a model using bag-of-words and n-grams, play with how tf-idf changes the accuracy of the model, use lexicons to apply sentiment values to bag-of-words, or play with negations using bigrams. Text mining is a relatively unexplored frontier compared to its other counterparts in data mining. However, with the rise of smartphone, online businesses, and social media, text mining will be needed to process the millions of text posted, shared, and retweeted every day.

Introducing Text Mining

N-grams and Bag-of-words

N-grams and Bag-of-Words feature engineering in R

Term Frequency and Inverse Document Frequency (TF-IDF)

Conclusion

Recommend

金句频出，16位大神在ICML上展开了一场机器学习的Great Debates

This Week in Data With Colin Charles 51: Debates Emerging on the Relicensing of...

Should x < $foo < y read from $foo once or twice? Perl debates

Individuals, institutions, and innovation in the debates of the French Revolutio...

Why These Four Developer Debates Are A Waste Of Time

Intel debates buyout of SiFive to bolster chip technology against Arm (source)

Engaging in Product Debates

Android vs Apple: HackerNoon Debates

Demystifying the Decentralization Debates Around Blockchain Technology

Elon Musk debates Twitter verification charge with Stephen King - The Washington...

About Joyk