22

How are Americans reacting to Covid-19?

 3 years ago
source link: https://towardsdatascience.com/how-are-americans-reacting-to-covid-19-700eb4d5b597?gi=69e744142b0c
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Using Twitter and sentiment analysis to answer the question

nYNfQzr.jpg!web

Image credit: Twitter

The Covid-19 pandemic poses an unprecedented challenge to the entire world. With the most confirmed cases and deaths, America is one of the countries hardest hit by the virus. As states begin to partially reopen, the nation has become very polarized on the topic. Some firmly advocate for this measure, citing the importance of the country’s economic health. Yet, others have a strong objection to this, contending that the human cost of reopening can’t be justified. At a time when tensions are high, I sought to gain a better understanding of how exactly Americans feel about the current state of affairs surrounding Covid-19.

In an attempt to answer this question, Srihan Mediboina and I worked together to scrape tweets related to Covid-19 from Twitter and perform sentiment analysis on them. To find out how reactions varied across America, we used tweets from New York, Texas, and California. Let’s get into the project!

Getting Twitter Data

FJFrye3.png!web

Image credit: Tweepy

Before we can access our Twitter API credentials, we need to apply for a Twitter developer account. Once our application is approved, we can use Tweepy to access the API and download all the tweets for a hashtag. Calling the search_for_hashtag function allows us to quickly scrape data across hashtags (#coronavirus, #Covid-19, #NewYork, #California, #Texas were some of the hashtags we used). For a more in-depth look at Tweepy, check out thisarticle.

We performed the sentiment analysis using a Naive Bayes classifier, which requires labeled data because it’s a supervised learning algorithm. Thus, we manually labeled 500 tweets from each of the three states for a total of 1,500 tweets. Each tweet received either a -1 for negative sentiment, a 0 for neutral sentiment, or a 1 for positive sentiment. If you’re interested in performing your own analysis, here’s a link to the data.

uIBJVby.png!web

First 5 rows of the California tweet dataset

Tokenizing

Now we tokenize the tweets by splitting them into individual words (called tokens). Without tokens, we can’t carry out the subsequent steps involved in sentiment analysis. This process becomes simple when we import TweetTokenizer from the Natural Language Toolkit ( nltk ). The tokenize_tweets function is only two lines of code and we can apply it to the dataframes to break up the tweets. nltk is a very powerful package for sentiment analysis so we’ll be using it throughout the article.

6Nri2iR.png!web

CA dataset after tokenization

Stopwords

Stopwords are common words such as “the”, “a”, and “an”. Since these words don’t further our understanding of the sentiment of the text, we filter them out. By importing stopwords from ntlk , this step becomes pretty simple: the remove_stopwords function is also two lines of code.

AnMv6jV.png!web

Some of the stopwords that were removed from the first few rows of our California dataset include “some”, “can”, “just”, and “for”.

Cleaning Text

In addition to removing stopwords, we want to make sure that any random characters in our data frames are also removed. For example, several characters such as ‘x97’ and ‘xa3’ appeared in the csv files after we scraped the tweets. After iterating through to find these miscellaneous characters, we copy pasted them into the CleanTxt function. Then, we applied the function to each data frame to remove them.

nqqqE3y.png!web

As we can see, hashtags were the most prevalent characters that were removed. By cleaning the text, we can improve our model’s performance.

Lemmatizing

Often, words referring to the same thing appear in different forms (ex. trouble, troubling, troubled, and troubles all essentially refer to trouble). By lemmatizing the text, we group various inflections of a word together to analyze them as the word’s lemma (how it appears in the dictionary). This process prevents the computer from mistaking different forms of a word for different words. We import WordNetLemmatizer from nltk and call the lemmatize_tweets function for this purpose.

Master Dataset

Since we’re done with the preprocessing steps, we can move onto creating a master dataset which encompasses all 1,500 tweets. By using df.itertuples , we can iterate over dataframe rows as tuples to append the ‘tweet text’ and ‘values’ attributes to our dataset. Then, we shuffle our dataset using random.shuffle to prevent our model from falling victim to overfitting.

Following that step, we iterate through all of the data frames and add every single word to the all_words list . Next, we use nltk.FreqDist to create a distribution of the frequency of each word. Since some words are more common than others, we want to ensure that the most relevant words are used to train our Naive Bayes classifier. Currently, each tweet is a list of words. However, we can represent each tweet as a dictionary instead of a list: the keys are the word features and the values are either True or False based on if the tweet contains that word feature. This dictionary representing the tweets is known as a feature set. We will generate feature sets for each tweet and train our Naive Bayes classifier on the feature sets.

Training/Testing the Model

The feature_sets will be split 80/20 into the training and testing set, respectively. After training the Naive Bayes classifier on the training set, we can check its performance by comparing its predictions for the sentiment of tweets ( results[i] ) against the labeled sentiment of the tweets ( testing_set[i][0] ).

yq63A3z.png!web

Our output shows the predicted value on the left and the actual value on the right. The error percentage of 40% is very high, which translates to our model only being accurate around 3 out of 5 times. Some improvements that can make our model more accurate are using a larger training set or using a validation set to test out different models before selecting the most efficient.

Using the Model

With the model trained/tested, we can use it now to make predictions on a fresh batch of tweets. We scrape more tweets and run through the same preprocessing steps as we did before for the new dataframes: ca_new_df , ny_new_df , and tx_new_df . The predictions of our classifier are stored in results_new_ca , results_new_ny , and results_new_tx . Our last step is to use the sentiment_percent function to quantify the percentages.

sentiment_percent(results_new_ca)
YVvUBju.png!web
sentiment_percent(results_new_ny)
2e2IVre.png!web
sentiment_percent(results_new_tx)
IRbu6rZ.png!web

In our results, California had only around 6% of tweets as positive while Texas had around 27% of tweets as negative. California and New York both had 73% of tweets neutral with their positive and negative percentages varying by around 4%. Texas did have the most percent of tweets negative, but they also had the most amount of tweets positive at around 10% because a lower percentage of their tweets were neutral. It’s important to keep in mind that our model was only 60% accurate so these results probably aren’t the most indicative of the real sentiment expressed in these tweets.

Some code was omitted from this article for the sake of brevity. Click here for the full code.

References

[1] Computer Science channel, Twitter Sentiment Analysis Using Python , Youtube

[2] Vicky Qian, Twitter Crawler , Github

[3]Mohamed Afham, Twitter Sentiment Analysis using NLTK, Python , Towards Data Science

[4]Adam Majmudar, Machines’ key to understanding humans , Medium


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK