9

Decoding the Covid-19 tweets using NLP and Graph Database

 3 years ago
source link: https://towardsdatascience.com/decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Decoding the Covid-19 tweets using NLP and Graph Database

Building the Twitter graph to find insights into covid tweets

1*K06P7WUB_8qks0T3WIRAww.jpeg?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Everything is connected, leveraging graph to connect the information to reveal insights of covid 19 tweets. Image source Unsplash by Pietro Jeng

COVID-19 Vaccination is a new hot topic in social media (Twitter) so why don’t we leverage the Natural Language Processing (NLP) and Graph Database (Neo4j) to find insights into the covid 19 vaccine-related information.

Neo4j Graph database provides excellent tools, libraries to work on connected data and many scientific projects are based on these advanced tools. This article is mainly focused on how to use NLP in Neo4j for text data analysis. Neo4j -NLP helps us to do basic sentiment analysis to understand where people’s opinions are regarding vaccination.

So let's start building from scratch.

1*NsIC1rPlLgLMQjv3gn-zww.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
graph network of tweets. Image by Author

Analysis Flow :

1*-ZgHL_W7-zDNuOMMmJs6Bw.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Data processing flow. Image source Author

Data source: I have taken the Twitter data set ( corona_tweets_268.csv: 3,190,245 tweets (December 11, 2020, 08:00 AM — December 12, 2020 10:10 AM) related to corona from IEEE Dataport.

Twitter doesn’t allow to store/process entire user posts so once you gather Tweets ID, you have to use tools like Hydrator to extract post information from Tweets ID.

Once you get the Twitter post information in JSONL file, you can then convert to JSON format, export the file to any programming language to do data cleaning, data massage, and then import the file to neo4j. This is one of the tedious tasks.

Note: for sake of simplicity, you can download the twitter.json which contains 5000 tweets from the below GitHub link mentioned in the reference.

Core Tools and Libraries:

We are going to use the Neo4j database where we will store tweets and do further analysis.

Plugins to be installed in neo4j:

APOC pluggin
Graph data science pluggin

Graphaware: Graphaware has lots of libraries that offer NLP capability to Neo4j so we will use these libraries

List of libraries provided by graph aware:

neo4j-framework
neo4j-nlp
neo4j-nlp-stanfordnlpstanford-english-corenlp ( Language Model) This is language model file, need to be downloaded separately

Set up is one of the most difficult parts due to many dependent libraries and versions.

Follow the Installation step strictly in order as mentioned in guidelines to set up Graphaware libraries in neo4j.

Due to a lack of proper documentation, you may be lost in finding the right libraries and version. So here is one document I prepared to help you out in finding libraries and their appropriate versions.

I am using the Bloom tool for visualization which is optional.

At the end of set up your neo4j.conf file should look like this :

# nlp settings
apoc.import.file.enabled=true
dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
com.graphaware.runtime.enabled=true
com.graphaware.module.NLP.2=com.graphaware.nlp.module.NLPBootstrapper
dbms.security.procedures.whitelist=ga.nlp.*,gds.*,apoc.*
1*EQfw_WrutfTtP0PDIQZ7zA.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Plugins should have these jar files
1*VoFcZeCoovES2-eOJepbew.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
import should have the twitter.json file which you have created earlier or you can download from my GitHub link.

Text Analysis :

We are going to follow the below step for text analysis:

1: Bulk upload of tweets using APOC.JSON and then create the Tweets node
2: Create the User Node from tweets using regex
3 :Create the HashTag Node from tweets using regex
4: Annotate Tweets using nlp pipeline
5: Query to analyse tweets using sentiment analysis, Named Entitiy Recognition etc.
1*yZa2eX94dfTOPUZnXDc34g.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
In the end, you will able to build this beautiful graph

Step 1: Bulk upload and Twitter Node creation:

// Use Apoc.load json to do bulk uploadCALL apoc.load.json("file:///twitter.json")
YIELD value as t
CREATE (tweet:Tweet {
user_id: t.user_id,
user_name: t.user_name,
user_screen_name: t.user_screen_name,
tweets: t.tweet,
twitter_id: t.twitter_id,
post_created_at: t.post_created_at,
user_description: t.user_description
})
RETURN count(t);
1*pgJiVvyoiA23Tez8k54GaQ.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
User tweets

Step 2: Create the Hash Tag Nodes from tweets using regex

// extracting hash
MATCH (t:Tweet)
WHERE t.tweets =~ ".*#.*" //finding pattern #
WITH
t,
apoc.text.regexGroups(t.tweets, "(#\\w+)")[0] as hashtags //regex to extract text contain # character
UNWIND hashtags as hashtag
MERGE (h:Hashtag { name: toUpper(hashtag) }) //create hashtag node
MERGE (h)<-[:hashtag { used: hashtag }]-(t)
RETURN count(h)
1*UoUeV_y0iVzPbTyW887qZw.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Hash tag Nodes

Step 3: Create the User Node mentioned in tweets using regex

MATCH (t:Tweet) 
WHERE t.tweets =~ ".*@.*" //finding pattern @
WITH
t,
apoc.text.regexGroups(t.tweets, "(@\\w+)")[0] as mentions //regex to extract text contain @ character
UNWIND mentions as mention
MERGE (u:User { name: mention }) //create User node
MERGE (u)<-[:mention]-(t)
RETURN count(u);
1*9tehR3lo7Xe2DrbKgBFFRw.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
The user mentioned in tweets

Step 4: From here core NLP work starts, so first we will build a pipeline that will annotate text, do Name Entity Recognition and sentiment analysis in a single query.

//pipeline CALL ga.nlp.processor.addPipeline({textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', name: 'customStopWords', processingSteps: {tokenize: true, ner: true, dependency: false,sentiment:true}, stopWords: '+,result, all, during', 
threadNumber: 20})

It will create a default pipeline named customStopWords having com.graphaware.nlp.processor.stanford.StanfordTextProcessor as text processor. It has set tokenize, ner, and sentiment features as true.

CALL ga.nlp.processor.getPipelines() // To query about your pipeline
1*FqVY4LwlQM-9ut1aFLnTCQ.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
NLP Pipeline

Do not annotate text without iteration. if the text data set is large, you will end up in a deadlock. So use apoc.periodic.iterate to annotate text one by one.

CALL apoc.periodic.iterate(
"MATCH (t:Tweet) return t",
"CALL ga.nlp.annotate({text: t.tweets, id: id(t)})
YIELD result MERGE (t)-[:HAS_ANNOTATED_TEXT]->(result)", {batchSize:1, iterateList:true})

Note: There is the issue of memory consumption, deadlock so read about how to increase the neo4j memory heap size, indexing.

1*gVj4ufZxKpkbOuFjLfkw5w.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
This is annotated Text which tokenize the sentence, break into word, noun, verb, part of speech
1*WhwnP7M5rbIk6HSpnhPiFA.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Annotated Text and their relationships with others Image by Author

Named Entity Recognition :

MATCH (n:NER_Person) RETURN n LIMIT 25
1*MqrJ8y80Qd5dXB1s8eO2Kw.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
The person mentioned in Tweets
MATCH (n:NER_Organization) RETURN n LIMIT 25
1*hygtqNr5fyu5Oh-r-3BDfA.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Organisation mentioned in Tweets

Now the fun part is querying the data for insights.

You have seen Pfizer is one Pharma company quite mentioned in tweets about its vaccine development. Let’s see what is the sentiment (positive, negative) of posts regarding this entity.

MATCH (:NE_Organization{value: 'Pfizer'})-[]-(s:Sentence:Negative)-[]-(:AnnotatedText)-[]-(tweet:Tweet) 
RETURN distinct tweet.tweets;
1*hV2ITmYWI0NpVkqLtCSlZQ.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Most people talk negatively about it. I have not found a single positive sentiment about Pfizer's recent development.

How about people's opinions about vaccination?

People’s tweets supporting vaccination:

MATCH (n:NER_O)-[]-(s:Sentence:Positive)-[]-(:AnnotatedText)-[]-(tweet:Tweet) 
where n.value starts with 'vacc'
RETURN distinct tweet.tweets;
1*hjlNDGcy-E-z09FR3KNUcA.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
There is still hope.

People’s tweets against the vaccination :

MATCH (n:NER_O)-[]-(s:Sentence:Negative)-[]-(:AnnotatedText)-[]-(tweet:Tweet) 
where n.value starts with 'vacc'
RETURN distinct tweet.tweets;
1*T4LCKqPw2FftnzgVf1p4-w.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
2nd tweet is too good

Wow, I am having fun exploring the power of the graph to analyse text data. Let’s see more

The organization mentioned for wrong (negative) reasons:

MATCH (n:NER_Organization)-[]-(s:Sentence)-[]-(:AnnotatedText)-[]-(tweet:Tweet) 
RETURN distinct n.value,tweet.tweets;
1*cAJDWDMqrqp--tTvsFG3bA.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
Trump has earned his place, last tweet by FDA is really funny :P

People mentioned for positive sentiment :

MATCH (n:NER_Person)-[]-(s:Sentence:Positive)-[]-(:AnnotatedText)-[]-(tweet:Tweet) 
RETURN distinct n.value as name, tweet.tweets as tweets
1*DRPvtzVB9EZZdrUU9yX2jg.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
The graph is not too intelligent, it has considered the covid positive patient as the positive sentiment e.g Neet Kapoor mentioned in tweets

People, organization talking about the vaccine:

1*59num-3z6FBZ3dj22Aa_7w.png?q=20
decoding-the-covid-19-tweets-using-nlp-and-graph-database-35e1b406f439
These are a few of many tweets, which have mentioned vaccines in their tweets. It seems the world is on roll

And the list goes on, you can do too many things. It's really fascinating to uncover information using connected data and this is one of many simple use cases to do so.

This is the typical flow for any kind of text analysis, either News articles, feedback, review system, and doing through the graph is really fun.

Drop your feedback, messages in case if you stuck in some setup or configuration.

Cheers!!

Reference :

1: Twitter.json

2: Neo4j: Natural Language Processing (NLP) in Cypher article by David Allen, most of the inspiration got from this article

3: Neo4j NLP

4: Neo4j

5: Graphaware


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK