Uhuru Kenyatta’s 2019 State of the Nation Address, Most Positive

rAbuYfY.jpg!web

President Uhuru Kenyatta gives his State of the Nation address in Parliament, Nairobi, on March 15, 2017 https://bit.ly/2YJT28T

As the president of Kenya, Uhuru Kenyatta https://en.wikipedia.org/wiki/Uhuru_Kenyatta , is charged with the responsibility of informing Kenyans about the progresses the government has made each year through the State of the Nation (SoN) addresses. He came to power in 2013 thus his first SoN was in 2014. I looked at the speeches and one thing struck out. They were just too long and boring to read for a person like me. Listening to them would have made sense, but I missed that all together. Therefore, I deployed a few Natural Language Processing (NLP) techniques to extract some knowledge in the speeches. This is where machines do better than humans.

Data

We made use of the publicly available state of the union addresses given between 2014, a year after Uhuru’s ascent to power till 2019. I used R in my analysis where I covered a few data analysis and modelling aspects.

Data Cleanup and A little bit of Analysis

I used some of the below libraries in manipulating the data. Caution! Some may not be useful at all. Hopefully you can sort out what’ s useful for you in this analysis.

The document was converted to a Term Document Matrix for easier statistical computation of words in it . The followed process is as below:-

The frequency for each word in the matrix was computed as below.

The outputs for the two; head(most frequent) and tail (least frequent) words are as below: -

Overall, the president repeated some words more than others. On the higher side, 580 words were mentioned at least twice in all his speeches as per the below table.

The president emphasized on the below words as they were the most frequent in the dataset.

“country”, “government” ,“kenya” ,“kenyans”, “members” ,“nation”, “national” “people”, “year”

A look at words associated with the most frequent words above gives us an idea of the context within which the words were used. An example for the word “country” is illustrated in the code and output below: -

Summary of Dominant Terms

A summary of the most dominant terms is in the below word cloud .

jMVZjeu.png!web

Summary of all words in Uhurus’ speeches minus stop-words. Bigger words were emphasized by the president more than the ones in smaller fonts.

J7rqYff.png!web

The word cloud and bar chart are generated via the below code:-

Topic Modelling

In NLP, topic models are defined as statistical models that aid in discovery of abstract topics in documents. Documents in this instance can be a collection of speeches, books, reviews or even tweets. Topic modelling in a nutshell helps in the discovery of semantically close words/terms describing a certain entity. Its natural for example to expect “iPhone” and “lightning cable” to appear often in reviews about Apple products.

Latent Dirichlet Allocation (LDA) is one technique for modelling topics. An in-depth description of LDA ishere .

From the above code, I chose K= 3, meaning the dataset was subdivided into 3 topically representative sets. K can be changed to any number but I set it to 3 topics. A fewmethods are out there for identifying optimal cluster numbers.

A summary of some words that I picked up in the topics over the years are below:-

Topic 3 (2014, 2018)— constitution, Kenyans, values, devolution, peace, unity

Topic 2 (2015, 2016, 2017)— administration, county, economy, people, service.

Topic 1 (2019)— national, security, corruption, development

We can comfortably infer from this analysis that the president emphasized on matters of administration, provision of services at county levels for economic prosperity of Kenyans in his 2015–2017 speeches. His 2019 speech centered on security, development as well as corruption. Devolution, upholding the constitution, peace and unity among others were emphasized on in 2014 and 2018.

Further Quantitative Analysis

Further to the experiments above, I’ll be performing a side by side comparison of all speeches from 2014 to 2019 over a wide array of measures. I’ll make use of an R package called Quantitative Discourse Analysis Package ( qdap) . The package provisions several measures for comparison of text documents. Speeches will be turned to data frames, then split into sentences with sentence tokenizer then combined into one data frame with a variable that specifies the year for each speech. For the text input, the below processes are required: -

1) Apply qprep() function from Qdap

— bracketX () #apply bracket removal

— replace_abbreviation() # replaces abbreviations

— replace_number()#numbers to words, for example ‘100’ becomes ‘one hundred’

— replace_symbol()# symbols become words, for example @ becomes ‘at’

2) replace contractions ( don’t to do not )

3) strip() — remove unwanted characters, with the exception of periods and question marks.

4) Sent split — splits text into sentences and adds then to the grouping variable, the year of the speech. This also creates the Turn of Talk (tot) variable an indicator of sentence order . This bit is vital in dialogue analysis e.g. in a question and answer session.

Enough talk and straight to the juicy part. The code to perform the above is as below: -

From the output, a plot of the most frequent terms is below: -

IJ3y2q3.png!web

Plot showing the most frequent words in Uhuru’s SON speeches between 2014–2019

A tabular visualization of the usage frequency of these words over the years is also shown below. Wordmat method simplifies this process as below.

We can now get complete word statistics about the dataset as below: -

eeEJ7nn.png!web

Word statistics summary

From the above plot, 2016 speech is the longest (n.sent). The shortest of his speeches is the 2014 one. Probably the president feared chatter around his 2014 speech bearing in mind it was his first SoN after 2013 elections. A description of the statistical variables in the above chart are found here . Please check them out.

Polarity

Polarity is defined as the state of having two opposite or contradictory tendencies, opinions, or aspects. On the other hand, textual sentiments are normally either positive, negative or neutral. From the table below, the president’s speeches have largely been positive.

However, his 2019 speech that centered around national, security, corruption, development was the most positive with a standardized mean polarity of 0.482 . Standardized Mean polarity in this case is the average polarity divided by the standard deviation. Polarity over the sentences is summarized in this plot.

bMjiiy3.png!web

2014 Speech was deemed most negative with the 2019 one most positive .

The most positive negative sentence being “suffered severe setbacks result terrorist attacks” while the most positive was “represents thirty percent growth total generation capacity.” Makes sense huh! The code for getting this inference is below:-

Readability

Readability of the speeches follows automated readability indexes . Its simply an estimation of how difficult text is to read. This is made possible by measuring text’s complexity in terms of word and sentence lengths, syllable counts etc. Each readability index gives an estimated grade level. This ideally represents the number of years of formal education conducted in English.

2018’s State of the Union address was the most readable compared to the rest of the other speeches as much as it was slightly longer than 2014’s. The rest of the other speeches showed stability in their readabilities especially with the 2014 and 2017 ones. 2015's speech needed someone at postgraduate/ advanced undergraduate level (year 18 of formal education) to understand it well.

Automated Readability Index table

Diversity in the speeches

Diversity, as applied to dialogue, is a measure of language richness. Specifically, it measures how expansive the vocabulary is while taking into account the number and diversity of words. So how different were the president’s speeches over time? From the below table and graphs, the speeches were not overly diverse.

QZJFzqJ.png!web

Plot showing how diverse the speeches were over time .

This means that the president used almost the same language in his speeches. Probably the choice of words had to just be the same over the years for clarity in articulation of his agenda. Would it just be the political language?

One final interesting plot is in relation to words/terms dispersion over time. Some words will fade or increase over time. Choice of words to look at are user-driven as long as they exist in the corpus. I looked at the dispersion of words “security”, “jobs”, “economy”,”youth”,”women”,”corruption” in his speeches as I personally feel they stand a higher chance emphasis from the head of state.

From the plot below, “corruption”, “security” and to some extent “economy” characterized the president’s speeches over the six year period. Interestingly, the president didn’t talk of “jobs” in 2014 but emphasized on security. In short, lots of inferences can be made from this plot.

jQRNR3b.png!web

Dispersion of terms over time in the speeches.

Conclusions

The above analyses resulted in interesting insights in Uhuru Kenyatta’s speeches from topics to diversity in his speeches. Qdap package was instrumental in side by side speech comparisons. I look at getting reactions of Kenyans after the speeches were given as there were lots of Twitter activity at the time. Diversity plots will probably make sense between the speeches and what Kenyans thought at the time.

Data

Data Cleanup and A little bit of Analysis

Summary of Dominant Terms

Topic Modelling

Further Quantitative Analysis

Polarity

Readability

Diversity in the speeches

Conclusions

Recommend

Complex Queries in SQL

KDD Cup 2019 AutoML Track冠军深兰科技DeepBlueAI团队技术分享 | 开源代码

码良：在线生成 h5 页面并提供页面管理和页面编辑的平台

深度学习+量子计算打印文章整理 - 知乎

微服务与网关技术（SIA-GateWay）

含金量100%的10个Excel函数公式实用技巧解读！

MySQL、SqlServer、Oracle对比，你必须了解的三大数据库区别

The Fully Remote Attack Surface of the iPhone

CatAccounting：记账类工具喵账簿 iOS 客户端仓库

思科交换机的新漏洞恐引起新一轮全球扫描

About Joyk