50

Bigram and sentiment analysis applied to READMEs from top GitHub repositories

 5 years ago
source link: https://www.tuicool.com/articles/hit/aAzyYvv
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Basics of Text Analysis & Visualization

q26V7zb.png!web
Bi-gram tag cloud made from READMEs of the Top 2,000 GitHub repositories

“Good software is approachable, consistent, explains itself, teaches, and is for humans. “— Mike Bostock

As a UX engineer, I think a lot about the user. When writing software, the wants and needs of the user come first and foremost. However, I often use, contribute to, and write open source software. Open source projects also need to be designed with their users, other developers , in mind.

Previously , I discussed a project aimed at surveying the ways in which software engineers describe their work. The objective is to understand what makes good software by understanding how good software is described. My definition of “good” is highly rated or stared.

I scraped the READMEs of the top 2,000 repositories on GitHub and visualized the words that showed up most frequently. While I am not a data scientist, I frequently work analyzing and visualizing data sets, and want to share the results and discoveries that have come from this project.

For initial results and context, see Open Source Words Part 3

For more on the extraction and cleanup, see Open Source Words — Part 1

Tag Clouds

The simplest and most common form of text visualization is a tag (or word) cloud . They depict tags arranged in space varied in size, color, and position based on tag frequency, categorization, or significance .

YRVzIj6.png!web
Word cloud from the READMEs of the Top 2,000 GitHub repositories

In this simple example, color and position are arbitrary but font size is varied based on word frequency. Since even counting is complicated, more specifically these words vary in size based on the total word frequency as opposed to the unique word frequency (words count once per document).

import ntlk
unique_frequencies = dict()
total_frequencies = dict()
for readme in reasmes:
    words = nltk.word_tokenize(readme)
    fdist = nltk.FreqDist(words)
    for word, freq in fdist.most_common(50):
        total_frequencies[word] += freq # total count
        unique_frequencies[word] += 1 # unique count

This is an example for counting total and unique frequencies of words within a dataset. Tim Strehle has a much more comprehensive example that includes case normalization, tokenization, Part-of-Speech (POS) tagging, and the removal of stop words , punctuation, etc. This type of cleanup is often necessary prior to analysis.

Collocate clouds

Zr2mYjj.jpg!web
Wordle Bi-gram cloud from Nik’s QuickShout

Collocate clouds are another variation of tag clouds, which visualize particular words that frequently collocate (are found next to one another). They fall under the general class of N-Gram problems, with the most common examples being Bi-grams (two) and Tri-grams (three).

N-Grams have many applications in genomics and are used in algorithms for grammar correction and text compression. The hero image above is a bi-gram of the most frequent word pairs found in the README dataset.

import nltk
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
bi_dict = dict()
bg_measures = BigramAssocMeasures()
for readme in readmes:
    words = nltk.word_tokenize(readme)
    bi_finder = BigramCollocationFinder.from_words(words)
    bi_collocs = bi_finder.nbest(bg_measures.likelihood_ratio, 10)
for colloc in bi_collocs:
        bi_dict[colloc] += 1

Above is an example using nltk (Natural Language ToolKit) to process a text dataset into Bi-grams.

Sentiment Analysis

E3EJV3r.png!web
VADER sentiment analysis , compound score distribution of sentences from 2,000 READMEs; 1 is positive, -1 is negative

Sentiment analysis is the process of computationally categorizing text based on the writer’s attitude toward a topic. It can be especially useful on social media feeds like comment threads to get a general sense for whether users are talking positively, negatively, or neutrally about a product. It fits broadly under the group of machine learning classification algorithms, and are best when trained using relevant datasets.

For the purposes of learning, I used VADER sentiment analysis since it comes packaged with nltk. VADER “is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.” Below is an example using VADER in Python:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
sentiment_summary = dict()
for readme in readmes:
    sentences = nltk.tokenize.sent_tokenize(readme)
    for sentence in sentences:
        sentiment_score = sid.polarity_scores(sentence)
if sentiment_score["compound"] == 0.0:
            sentiment_summary["neutral"] += 1
        elif sentiment_score["compound"] > 0.0:
            sentiment_summary["positive"] += 1
        else:
            sentiment_summary["negative"] += 1

In the chart above, nearly 2,000 READMEs were tokenized into more than 120,000 sentences with an overall compound score of 0.143. One interpretation might be that, on average, developers speak positively about their libraries and projects. Yet nearly half of compound scores were neutral, suggesting that developers are writing without ascribing sentiment.

It’s important to note that these results are far from conclusive, especially given the lack of a relevant training set and the reliance on social media corpora. It’s intended for illustrative purposes only. Though I do not consider this analysis valid, I do find it interesting. This is an example sentence with a strongly negative ( compound < -0.95 ) sentence:

If the issue is still not solved, see the guides for common problems: A cask fails to install: curl error Permission denied error Checksum does not match error source is not there error wrong number of arguments error Unlisted reason uninstall wrongly reports cask as not installed Error: Unknown command: cask error My problem isnt listed Requests Cask requests will be automatically closed.

It’s my assumption that words like fails, wrong, and error are responsible for classifying this example as strongly negative.

For more information, see Sentiment analysis with NLTK /VADER

Other analyses

This article didn’t cover topic modeling , summarization , subject identification , stemming , entity recognition , and so many other topics.

Whether using this README dataset, or another, I intend to keep exploring other areas of data science and visualization. As I learn, I will share my process and results on Medium , as well as data and source code on GitHub .

For a compressive overview of natural language processing methods, visualizations, and examples I’d recommend these resources:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK