Bigram and sentiment analysis applied to READMEs from top GitHub repositories

Basics of Text Analysis & Visualization

Bi-gram tag cloud made from READMEs of the Top 2,000 GitHub repositories

“Good software is approachable, consistent, explains itself, teaches, and is for humans. “— Mike Bostock

As a UX engineer, I think a lot about the user. When writing software, the wants and needs of the user come first and foremost. However, I often use, contribute to, and write open source software. Open source projects also need to be designed with their users, other developers , in mind.

Previously , I discussed a project aimed at surveying the ways in which software engineers describe their work. The objective is to understand what makes good software by understanding how good software is described. My definition of “good” is highly rated or stared.

I scraped the READMEs of the top 2,000 repositories on GitHub and visualized the words that showed up most frequently. While I am not a data scientist, I frequently work analyzing and visualizing data sets, and want to share the results and discoveries that have come from this project.

For initial results and context, see Open Source Words — Part 3

For more on the extraction and cleanup, see Open Source Words — Part 1

Collocate clouds

Wordle Bi-gram cloud from Nik’s QuickShout

Collocate clouds are another variation of tag clouds, which visualize particular words that frequently collocate (are found next to one another). They fall under the general class of N-Gram problems, with the most common examples being Bi-grams (two) and Tri-grams (three).

N-Grams have many applications in genomics and are used in algorithms for grammar correction and text compression. The hero image above is a bi-gram of the most frequent word pairs found in the README dataset.

import nltk
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

bi_dict = dict()
bg_measures = BigramAssocMeasures()

for readme in readmes:
    words = nltk.word_tokenize(readme)
    bi_finder = BigramCollocationFinder.from_words(words)
    bi_collocs = bi_finder.nbest(bg_measures.likelihood_ratio, 10)

for colloc in bi_collocs:
        bi_dict[colloc] += 1

Above is an example using nltk (Natural Language ToolKit) to process a text dataset into Bi-grams.

Sentiment Analysis

VADER sentiment analysis , compound score distribution of sentences from 2,000 READMEs; 1 is positive, -1 is negative

Sentiment analysis is the process of computationally categorizing text based on the writer’s attitude toward a topic. It can be especially useful on social media feeds like comment threads to get a general sense for whether users are talking positively, negatively, or neutrally about a product. It fits broadly under the group of machine learning classification algorithms, and are best when trained using relevant datasets.

For the purposes of learning, I used VADER sentiment analysis since it comes packaged with nltk. VADER “is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.” Below is an example using VADER in Python:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
sentiment_summary = dict()

for readme in readmes:
    sentences = nltk.tokenize.sent_tokenize(readme)
    for sentence in sentences:
        sentiment_score = sid.polarity_scores(sentence)

if sentiment_score["compound"] == 0.0:
            sentiment_summary["neutral"] += 1
        elif sentiment_score["compound"] > 0.0:
            sentiment_summary["positive"] += 1
        else:
            sentiment_summary["negative"] += 1

In the chart above, nearly 2,000 READMEs were tokenized into more than 120,000 sentences with an overall compound score of 0.143. One interpretation might be that, on average, developers speak positively about their libraries and projects. Yet nearly half of compound scores were neutral, suggesting that developers are writing without ascribing sentiment.

It’s important to note that these results are far from conclusive, especially given the lack of a relevant training set and the reliance on social media corpora. It’s intended for illustrative purposes only. Though I do not consider this analysis valid, I do find it interesting. This is an example sentence with a strongly negative ( compound < -0.95 ) sentence:

If the issue is still not solved, see the guides for common problems: A cask fails to install: curl error Permission denied error Checksum does not match error source is not there error wrong number of arguments error Unlisted reason uninstall wrongly reports cask as not installed Error: Unknown command: cask error My problem isnt listed Requests Cask requests will be automatically closed.

It’s my assumption that words like fails, wrong, and error are responsible for classifying this example as strongly negative.

For more information, see Sentiment analysis with NLTK /VADER

Other analyses

This article didn’t cover topic modeling , summarization , subject identification , stemming , entity recognition , and so many other topics.

Whether using this README dataset, or another, I intend to keep exploring other areas of data science and visualization. As I learn, I will share my process and results on Medium , as well as data and source code on GitHub .

For a compressive overview of natural language processing methods, visualizations, and examples I’d recommend these resources:

Basics of Text Analysis & Visualization

Tag Clouds

Collocate clouds

Sentiment Analysis

Other analyses

Recommend

C# 8 Ranges and Recursive Patterns

In-App Purchase Tutorial: Getting Started

Badging API Explainer

IBM launches resources for cloud-native Node.js apps

In the name of Elegance

Creating SQL images in Azure with ACR Build – Part Two

微信小程序适配 iPhone X 总结

Documenting Python Code: A Complete Guide

Block ads at home using Pi-hole and a Raspberry Pi

Why it's not a good idea to handle evdev directly

About Joyk