Exploratory Data Analysis in R for beginners (Part 2)

A more advanced method of doing EDA with ‘ggplot2’ and ‘tidyverse’

Oct 19 ·9min read

YjAZfem.png!web

ggplot2

In my previous article, ‘ Exploratory Data Analysis in R for beginners (Part 1) ’, I have introduced a basic step-by-step approach from data importing to cleaning and visualization. Here is a quick summary of Part 1:

Import data appropriately with fileEncoding and na.strings arguments. I showed how it is different from the normal way of importing csv file with read.csv().
Some basic cleaning the data with ‘tidyverse’ package
Visualization with Boxplot, Barplot, Correlation plot in ‘ggplot2’ package

Those are the basic steps in performing simple EDA. However, to make our plots, charts and graphs more informative and of course visually appealing, we need to make one step further. What do I mean by that? Let’s find it out!

What would you expect to find in this article?

Doing EDA is not merely about plotting graphs. It is about making informative graphs. In this article, you would expect to find the following tricks:

How to play around with the dataset to get the best version for each type of analysis? There is no one-size-fits-all dataset. Each analysis and visualization have different purposes, hence there comes different data structures.
Change the order of the legends in the plot
Let R identify the outliers and label them on the plot
Combine graphs with the use of ‘ gridExtra ’ package

By the end of this article, you would be able to generate the following plots:

rEjYja7.png!web

YzUjeaY.png!web

Let’s get started, folks!!!!

To refresh your memories

Let’s quickly look at the types of datasets that we created inPart 1 of this series.

view(df)

vaE3qay.png!web

View(df2)

NjYBfiy.png!web

View(df4) ## df3 combines with df2 to get df4

7zm6jyY.png!web

Please refer back to myprevious article for a detailed explanation on how to manipulate the original dataset to get these various versions.

Great! Let’s start our new visual plots now

Boxplot

Simple boxplot

First of all, we want to achieve this kind of boxplot

uqqIZfY.png!web

In order to get this, we must have a data frame where the rows are the countries and 4 columns, namely, Country names, % difference in Maths, Reading and Science. Now refer back to all the dataframes we created earlier, we can see that df has all of these requirements. Hence we will select relevant columns from df and name it as df5

df5 = df[,c(1,11,12,13)]
boxplot(df5$Maths.Diff, df5$Reading.Diff, df5$Science.Diff,
        main = 'Are Males better than Females?',
        names = c('Maths','Reading','Science'),
        col = 'green'
        )

Done! Following this code, you should be able to get the plot as above.

Higher-level boxplot

Now we want to move on to the next level. This is the plot that we want to get

viIN7rv.png!web

Notice the following differences:

The subtitle
The titles of the axes
The layout and color of outliers.
The caption

Thankfully, the ‘ ggplot2 ’ package has everything we need. It is just a matter of whether we can find these arguments and functions in the ‘ ggplot2 ’ package. But before we even move on to what arguments and functions to use, we need to determine what kind of data frame/structure the dataset should be. By looking at this plot, we can see that the data frame must have 1 column of 3 categories (Maths, Reading and Science), 1 column of numeric results for the % difference in performance in each subject and of course, 1 column for the country name. Hence 3 columns in total .

Now we have df5 that looks like this

nE3MreV.png!web

How would we transform this into a data frame described above? Remember a trick I introduced inPart 1 of this series, the magical function is pivot_longer(). Now, lets do it:

df6 = df5 %>% pivot_longer(c(2,3,4))
names(df6) = c('Country','Test Diff','Result')
View(df6)

Jj6FFju.png!web

new = rep(c('Maths', 'Reading', 'Science'),68) #to create a new   column indicating 'Maths', 'Reading' and 'Science'df6 = cbind(df6, new)
names(df6) = c('Country','Test Diff','Result', 'Test')   #change column names
View(df6)

V7buyu6.png!web

The column ‘Test Diff’ is now redundant so you can choose to delete that, simply use the code below. (This step is not compulsory)

df6$'Test Diff' = NULL

Here, I will not delete it.

Great! Now that we have the data set in the right structure ready for visualization, let’s use ggplot now

To get the title, subtitle, caption, use labs(title = ‘ …’ , y = ‘ …’, x = ‘ …’, caption = ‘ …’, subtitle = ‘ …’)
If the title or caption is long, use ‘ …. \n …’ to break it down into 2 lines.
To indicate outliers, inside geom_boxplot()

geom_boxplot(alpha = 0.7,
               outlier.colour='blue',   #color of outlier
               outlier.shape=19,        #shape of outlier
               outlier.size=3,          #size of outlier
               width = 0.6, color = "#1F3552", fill = "#4271AE"
               )

Combine everything together:

ggplot(data = df6, aes(x=Test,y=Result, fill=Test)) + 
  geom_boxplot(alpha = 0.7,
               outlier.colour='blue', 
               outlier.shape=19, 
               outlier.size=3, 
               width = 0.6, color = "#1F3552", fill = "#4271AE"
               )+
  theme_grey() +
  labs(title = 'Did males perform better than females?',
       y='% Difference from FEMALES',x='',
       caption  = 'Positive % Difference means Males performed \n better than Females and vice versa',
       subtitle = 'Based on PISA Score 2015')

meUBzuM.png!web

Is this good enough? The answer is NO: Caption, titles and subtitles are too small, the proportion and size of the plot is not as good as the plot we introduced at the start of this section.

We can do a better job.

To adjust the size and so on, use theme () function

theme(axis.text=element_text(size=20),
        plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(size = 10),
        plot.caption = element_text(color = "Red", face = "italic", size = 13)
      
        )

Here, don’t ask me how I got those numbers to put. This is trial and error. You can just simply try out any numbers until you get the perfect size and position.

Let’s now combine everything together

ggplot(data = df6, aes(x=Test,y=Result, fill=Test)) + 
  geom_boxplot(alpha = 0.7,
               outlier.colour='blue', 
               outlier.shape=19, 
               outlier.size=3, 
               width = 0.6, color = "#1F3552", fill = "#4271AE"
               )+
  theme_grey() +
  labs(title = 'Did males perform better than females?',
       y='% Difference from FEMALES',x='',
       caption  = 'Positive % Difference means Males performed \n better than Females and vice versa',
       subtitle = 'Based on PISA Score 2015') + 
theme(axis.text=element_text(size=20),
        plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(size = 10),
        plot.caption = element_text(color = "Red", face = "italic", size = 13)
)

MzaqI3A.png!web

The plot looks much better now. You can stop here. However, I would like to introduce a way to rearrange the order of variables to get a decreasing trend. Simply use

scale_x_discrete(limits=c("Maths","Science","Reading"))

Hence, combining everything

ggplot(data = df6, aes(x=Test,y=Result, fill=Test)) + 
  geom_boxplot(alpha = 0.7,
               outlier.colour='blue', 
               outlier.shape=19, 
               outlier.size=3, 
               width = 0.6, color = "#1F3552", fill = "#4271AE"
               )+
scale_x_discrete(limits=c("Maths","Science","Reading"))+  theme_grey() +
  labs(title = 'Did males perform better than females?',
       y='% Difference from FEMALES',x='',
       caption  = 'Positive % Difference means Males performed \n better than Females and vice versa',
       subtitle = 'Based on PISA Score 2015') + 
theme(axis.text=element_text(size=20),
        plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(size = 10),
        plot.caption = element_text(color = "Red", face = "italic", size = 13)
)

aEfUVvA.png!web

Awesome!! Now the plot looks really informative. However, better data analysts would produce something even more informative like the following:

yyYjMbm.png!web

How can we let R identify the outliers and label them for us?

df6$Test = factor(df6$Test, levels = c("Maths","Science","Reading"))  # To change order of legend

Let’s define the outlier. This part requires a bit of Statistics knowledge. I would recommend you to read Michael Galarnyk’s articlehere. He explains the concepts pretty well.

is_outlier <- function(x) {
   return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}   # define a function to detect outliersstr(df6)

Notice that the columns ‘Country’ and ‘Test’ are factors. First let’s change it to characters.

df6$Country = as.character(df6$Country)
df7 <- df6  %>% group_by(as.character(Test)) %>% 
  mutate(is_outlier=ifelse(is_outlier(Result), Country, as.numeric(NA)))### What we are doing now is that we are creating a new data frame with the last column 'is_outlier' indicating whether the data point is an outlier or notView(df7)

fIv6Rbb.png!web

As you can see, the last column of the data frame shows that Jordan is an outlier. Now, at the ‘Country’ column, we want to make sure that only the outliers are labeled, the rest should be put as ‘NA’. This will be helpful for us later when we code the visualization plot.

df7$Country[which(is.na(df7$is_outlier))] <- as.numeric(NA)
View(df7)

aIbIrey.png!web

Now, let’s plot the graph. The same code as above. However we need to add in geom_text () to label the outliers

ggplot(data = df7, aes(x=Test,y=Result, fill=Test)) + 
  geom_boxplot(alpha = 0.7,
               outlier.colour='red', 
               outlier.shape=19, 
               outlier.size=3, 
               width = 0.6)+
geom_text(aes(label = Country), na.rm = TRUE, hjust = -0.2)+         
  theme_grey() +
  labs(title = 'Did males perform better than females?',
       y='% Difference from FEMALES',x='',
       caption  = 'Positive % Difference means Males performed \n better than Females and vice versa',
       subtitle = 'Based on PISA Score 2015') + 
  theme(axis.text=element_text(size=20),
        legend.text = element_text(size = 16), 
        legend.title = element_text(size = 16),

        legend.position = 'right', aspect.ratio = 1.4,
        plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(size = 10),
        plot.caption = element_text(color = "Red", face = "italic", size = 13)
  )

RNjiYbZ.png!web

Combine multiple plots

YzUjeaY.png!web

For each of the plot in the combined plot above, we have gone through how to create it inPart 1 of this series. Here is the recap:

plot1 = ggplot(data=df, aes(x=reorder(Country.Name, Maths.Diff), y=Maths.Diff)) +
  geom_bar(stat = "identity", aes(fill=Maths.Diff)) +
  coord_flip() +
  theme_light() +
  geom_hline(yintercept = mean(df$Maths.Diff), size=1, color="black") +
  labs(x="", y="Maths")+
  scale_fill_gradient(name="% Difference Level", low = "red", high = "green")+
  theme(legend.position = "none")
plot2 = ggplot(data=df, aes(x=reorder(Country.Name, Reading.Diff), y=Reading.Diff)) +
  geom_bar(stat = "identity", aes(fill=Reading.Diff)) +
  coord_flip() +
  theme_light() +
  geom_hline(yintercept = mean(df$Reading.Diff), size=1, color="black") +
  labs(x="", y="Reading")+
  scale_fill_gradient(name="% Difference Level", low = "red", high = "green") +
  theme(legend.position = "none")
plot3 = ggplot(data=df, aes(x=reorder(Country.Name, Science.Diff), y=Science.Diff)) +
  geom_bar(stat = "identity", aes(fill=Science.Diff)) +
  coord_flip() +
  theme_light() +
  geom_hline(yintercept = mean(df$Science.Diff), size=1, color="black") +
  labs(x="", y="Science")+
  scale_fill_gradient(name="% Difference", low = "red", high = "green") +
  theme(legend.position = "none")

To combine them all together, use ‘gridExtra’ package.

install.packages('gridExtra')
library(gridExtra)grid.arrange(plot1, plot2,plot3, nrow = 1,
             top = 'Are Males better than Females?',
             bottom = '% Difference from Females'
             )#nrow=1 means all the plots are placed in one row

That is it! Again, I hope you guys enjoyed and picked up something from this article. Of course, this guide is not exhaustive, and there are a lot of other techniques we can use to do EDA. However, I believe this guide will more or less give you some ideas of how to do improve from a simple plot to a more complicated and informative plot with R .

If you have any questions, feel free to put them down in the comment section below. Once again, thank you for your read. Have a great day and happy programming!!!

A more advanced method of doing EDA with ‘ggplot2’ and ‘tidyverse’

What would you expect to find in this article?

To refresh your memories

Boxplot

Simple boxplot

Higher-level boxplot

How can we let R identify the outliers and label them for us?

Combine multiple plots

Recommend

Soccer Player Detection in Overhead Images using RetinaNet

Goodbye, shitty Car extends Vehicle object-orientation tutorial (2011)

Webpack 快乐学习之旅（一）

Options for Hosting Your Own Non-JavaScript-Based Analytics

Golang 交叉编译跨平台的可执行程序 (Mac、Linux、Windows )

I made a NES emulator in Rust using generators

拒绝美式政治正确，他负气出走，选择来中国

Gitlab Pages: Gitlab has enabled access control for static Pages

终端 10X 工作法（一） - 知乎

React and Apollo with SSR Boilerplate

About Joyk