Exploratory Data Analysis in R for beginners (Part 2)
source link: https://www.tuicool.com/articles/mUZbUnM
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
A more advanced method of doing EDA with ‘ggplot2’ and ‘tidyverse’
ggplot2
In my previous article, ‘ Exploratory Data Analysis in R for beginners (Part 1) ’, I have introduced a basic step-by-step approach from data importing to cleaning and visualization. Here is a quick summary of Part 1:
- Import data appropriately with fileEncoding and na.strings arguments. I showed how it is different from the normal way of importing csv file with read.csv().
- Some basic cleaning the data with ‘tidyverse’ package
- Visualization with Boxplot, Barplot, Correlation plot in ‘ggplot2’ package
Those are the basic steps in performing simple EDA. However, to make our plots, charts and graphs more informative and of course visually appealing, we need to make one step further. What do I mean by that? Let’s find it out!
What would you expect to find in this article?
Doing EDA is not merely about plotting graphs. It is about making informative graphs. In this article, you would expect to find the following tricks:
- How to play around with the dataset to get the best version for each type of analysis? There is no one-size-fits-all dataset. Each analysis and visualization have different purposes, hence there comes different data structures.
- Change the order of the legends in the plot
- Let R identify the outliers and label them on the plot
- Combine graphs with the use of ‘ gridExtra ’ package
By the end of this article, you would be able to generate the following plots:
Let’s get started, folks!!!!
To refresh your memories
Let’s quickly look at the types of datasets that we created inPart 1 of this series.
view(df)
View(df2)
View(df4) ## df3 combines with df2 to get df4
Please refer back to myprevious article for a detailed explanation on how to manipulate the original dataset to get these various versions.
Great! Let’s start our new visual plots now
Boxplot
Simple boxplot
First of all, we want to achieve this kind of boxplot
In order to get this, we must have a data frame where the rows are the countries and 4 columns, namely, Country names, % difference in Maths, Reading and Science. Now refer back to all the dataframes we created earlier, we can see that df has all of these requirements. Hence we will select relevant columns from df and name it as df5
df5 = df[,c(1,11,12,13)] boxplot(df5$Maths.Diff, df5$Reading.Diff, df5$Science.Diff, main = 'Are Males better than Females?', names = c('Maths','Reading','Science'), col = 'green' )
Done! Following this code, you should be able to get the plot as above.
Higher-level boxplot
Now we want to move on to the next level. This is the plot that we want to get
Notice the following differences:
- The subtitle
- The titles of the axes
- The layout and color of outliers.
- The caption
Thankfully, the ‘ ggplot2 ’ package has everything we need. It is just a matter of whether we can find these arguments and functions in the ‘ ggplot2 ’ package. But before we even move on to what arguments and functions to use, we need to determine what kind of data frame/structure the dataset should be. By looking at this plot, we can see that the data frame must have 1 column of 3 categories (Maths, Reading and Science), 1 column of numeric results for the % difference in performance in each subject and of course, 1 column for the country name. Hence 3 columns in total .
Now we have df5 that looks like this
How would we transform this into a data frame described above? Remember a trick I introduced inPart 1 of this series, the magical function is pivot_longer(). Now, lets do it:
df6 = df5 %>% pivot_longer(c(2,3,4)) names(df6) = c('Country','Test Diff','Result') View(df6)
new = rep(c('Maths', 'Reading', 'Science'),68) #to create a new column indicating 'Maths', 'Reading' and 'Science'df6 = cbind(df6, new) names(df6) = c('Country','Test Diff','Result', 'Test') #change column names View(df6)
The column ‘Test Diff’ is now redundant so you can choose to delete that, simply use the code below. (This step is not compulsory)
df6$'Test Diff' = NULL
Here, I will not delete it.
Great! Now that we have the data set in the right structure ready for visualization, let’s use ggplot now
- To get the title, subtitle, caption, use labs(title = ‘ …’ , y = ‘ …’, x = ‘ …’, caption = ‘ …’, subtitle = ‘ …’)
- If the title or caption is long, use ‘ …. \n …’ to break it down into 2 lines.
- To indicate outliers, inside geom_boxplot()
geom_boxplot(alpha = 0.7, outlier.colour='blue', #color of outlier outlier.shape=19, #shape of outlier outlier.size=3, #size of outlier width = 0.6, color = "#1F3552", fill = "#4271AE" )
Combine everything together:
ggplot(data = df6, aes(x=Test,y=Result, fill=Test)) + geom_boxplot(alpha = 0.7, outlier.colour='blue', outlier.shape=19, outlier.size=3, width = 0.6, color = "#1F3552", fill = "#4271AE" )+ theme_grey() + labs(title = 'Did males perform better than females?', y='% Difference from FEMALES',x='', caption = 'Positive % Difference means Males performed \n better than Females and vice versa', subtitle = 'Based on PISA Score 2015')
Is this good enough? The answer is NO: Caption, titles and subtitles are too small, the proportion and size of the plot is not as good as the plot we introduced at the start of this section.
We can do a better job.
To adjust the size and so on, use theme () function
theme(axis.text=element_text(size=20), plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 10), plot.caption = element_text(color = "Red", face = "italic", size = 13) )
Here, don’t ask me how I got those numbers to put. This is trial and error. You can just simply try out any numbers until you get the perfect size and position.
Let’s now combine everything together
ggplot(data = df6, aes(x=Test,y=Result, fill=Test)) + geom_boxplot(alpha = 0.7, outlier.colour='blue', outlier.shape=19, outlier.size=3, width = 0.6, color = "#1F3552", fill = "#4271AE" )+ theme_grey() + labs(title = 'Did males perform better than females?', y='% Difference from FEMALES',x='', caption = 'Positive % Difference means Males performed \n better than Females and vice versa', subtitle = 'Based on PISA Score 2015') + theme(axis.text=element_text(size=20), plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 10), plot.caption = element_text(color = "Red", face = "italic", size = 13) )
The plot looks much better now. You can stop here. However, I would like to introduce a way to rearrange the order of variables to get a decreasing trend. Simply use
scale_x_discrete(limits=c("Maths","Science","Reading"))
Hence, combining everything
ggplot(data = df6, aes(x=Test,y=Result, fill=Test)) +
geom_boxplot(alpha = 0.7,
outlier.colour='blue',
outlier.shape=19,
outlier.size=3,
width = 0.6, color = "#1F3552", fill = "#4271AE"
)+
scale_x_discrete(limits=c("Maths","Science","Reading"))+ theme_grey() +
labs(title = 'Did males perform better than females?',
y='% Difference from FEMALES',x='',
caption = 'Positive % Difference means Males performed \n better than Females and vice versa',
subtitle = 'Based on PISA Score 2015') +
theme(axis.text=element_text(size=20),
plot.title = element_text(size = 20, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(color = "Red", face = "italic", size = 13)
)
Awesome!! Now the plot looks really informative. However, better data analysts would produce something even more informative like the following:
How can we let R identify the outliers and label them for us?
df6$Test = factor(df6$Test, levels = c("Maths","Science","Reading")) # To change order of legend
Let’s define the outlier. This part requires a bit of Statistics knowledge. I would recommend you to read Michael Galarnyk’s articlehere. He explains the concepts pretty well.
is_outlier <- function(x) { return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x)) } # define a function to detect outliersstr(df6)
Notice that the columns ‘Country’ and ‘Test’ are factors. First let’s change it to characters.
df6$Country = as.character(df6$Country) df7 <- df6 %>% group_by(as.character(Test)) %>% mutate(is_outlier=ifelse(is_outlier(Result), Country, as.numeric(NA)))### What we are doing now is that we are creating a new data frame with the last column 'is_outlier' indicating whether the data point is an outlier or notView(df7)
As you can see, the last column of the data frame shows that Jordan is an outlier. Now, at the ‘Country’ column, we want to make sure that only the outliers are labeled, the rest should be put as ‘NA’. This will be helpful for us later when we code the visualization plot.
df7$Country[which(is.na(df7$is_outlier))] <- as.numeric(NA) View(df7)
Now, let’s plot the graph. The same code as above. However we need to add in geom_text () to label the outliers
ggplot(data = df7, aes(x=Test,y=Result, fill=Test)) +
geom_boxplot(alpha = 0.7,
outlier.colour='red',
outlier.shape=19,
outlier.size=3,
width = 0.6)+
geom_text(aes(label = Country), na.rm = TRUE, hjust = -0.2)+
theme_grey() +
labs(title = 'Did males perform better than females?',
y='% Difference from FEMALES',x='',
caption = 'Positive % Difference means Males performed \n better than Females and vice versa',
subtitle = 'Based on PISA Score 2015') +
theme(axis.text=element_text(size=20),
legend.text = element_text(size = 16),
legend.title = element_text(size = 16),
legend.position = 'right', aspect.ratio = 1.4,
plot.title = element_text(size = 20, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(color = "Red", face = "italic", size = 13)
)
Combine multiple plots
For each of the plot in the combined plot above, we have gone through how to create it inPart 1 of this series. Here is the recap:
plot1 = ggplot(data=df, aes(x=reorder(Country.Name, Maths.Diff), y=Maths.Diff)) + geom_bar(stat = "identity", aes(fill=Maths.Diff)) + coord_flip() + theme_light() + geom_hline(yintercept = mean(df$Maths.Diff), size=1, color="black") + labs(x="", y="Maths")+ scale_fill_gradient(name="% Difference Level", low = "red", high = "green")+ theme(legend.position = "none") plot2 = ggplot(data=df, aes(x=reorder(Country.Name, Reading.Diff), y=Reading.Diff)) + geom_bar(stat = "identity", aes(fill=Reading.Diff)) + coord_flip() + theme_light() + geom_hline(yintercept = mean(df$Reading.Diff), size=1, color="black") + labs(x="", y="Reading")+ scale_fill_gradient(name="% Difference Level", low = "red", high = "green") + theme(legend.position = "none") plot3 = ggplot(data=df, aes(x=reorder(Country.Name, Science.Diff), y=Science.Diff)) + geom_bar(stat = "identity", aes(fill=Science.Diff)) + coord_flip() + theme_light() + geom_hline(yintercept = mean(df$Science.Diff), size=1, color="black") + labs(x="", y="Science")+ scale_fill_gradient(name="% Difference", low = "red", high = "green") + theme(legend.position = "none")
To combine them all together, use ‘gridExtra’ package.
install.packages('gridExtra') library(gridExtra)grid.arrange(plot1, plot2,plot3, nrow = 1, top = 'Are Males better than Females?', bottom = '% Difference from Females' )#nrow=1 means all the plots are placed in one row
That is it! Again, I hope you guys enjoyed and picked up something from this article. Of course, this guide is not exhaustive, and there are a lot of other techniques we can use to do EDA. However, I believe this guide will more or less give you some ideas of how to do improve from a simple plot to a more complicated and informative plot with R .
If you have any questions, feel free to put them down in the comment section below. Once again, thank you for your read. Have a great day and happy programming!!!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK