Visualizing Brooklyn Nine-Nine with R!
source link: https://www.tuicool.com/articles/hit/MVzqeiu
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
(This article was first published on R by R(yo) , and kindly contributed toR-bloggers)
“Hey there criminal. It’s me, Johnny Law!” – Jake Peralta, NYPD.
Brooklyn Nine-Nine has become one of my favorite sitcoms in recent
years, probably taking over from Parks & Recreation and Community. So in
this blog post, I’m going to web scrape some very simple TV statistics,
clean it up with the tidyverse
, and visualize it with ggplot2
…
“Terry loves ggplot2!”
Let’s get started!
Packages
pacman::p_load(tidyverse, rvest, glue, cowplot, ggbeeswarm, polite, extrafont, knitr, kableExtra) loadfonts() # "Univers" and "Univers LT 93 ExtraBlackEx" fonts
B99 Custom Theme
First, I created a custom Brooklyn Nine-Nine theme that I can put
on every plot. This will save me time from typing in
the same options over and over again! I googled the font type that the
official Brooklyn Nine-Nine media uses, downloaded them, and got it
installed for R using the extrafont
package. For some of the different
colors you’ll see in the plots I sourced them from pasting in the
Brooklyn Nine-Nine logo and other official media images into
imagecolorpicker.com and saving the
hex codes that it gave me. Otherwise, I experimented with different
using color!).
theme_b99 <- function(){ base_size <- 11 half_line <- base_size / 2 theme_minimal() %+replace% theme(text = element_text(family = "Univers", color = "#F9FEFF", face = "plain", size = 14, hjust = 0.5, vjust = 0.5, angle = 0, lineheight = 0.9, margin = margin(half_line, half_line, half_line, half_line), debug = FALSE), plot.title = element_text(family = "Univers LT 93 ExtraBlackEx", size = 20, color = "#F9FEFF"), plot.background = element_rect(color = NA, fill = "#0053CD"), panel.background = element_rect(color = NA, fill = "#0053CD"), # axis options axis.text = element_text(family = "Univers", color = "#F9FEFF", size = 12), axis.title = element_text(size = 14), # legend options (for ratings plot) legend.title = element_text(family = "Univers", color = "#F9FEFF"), legend.text = element_text(family = "Univers", color = "#F9FEFF", size = 9), legend.position = "bottom", legend.key = element_rect(colour = "black", linetype = "solid", size = 1.5), legend.background = element_rect(color = "black", fill = "#0053CD", linetype = "solid")) }
Now the plots will all look very similar, just like a good police squad:
With that done, I can start making plots!
Episode Ratings
As in my more
recent blog
posts I used the polite
package to web scrape responsibly (note
the bow()
and scrape()
functions).
I map_2()
over the IMDB page for the episodes for each season and I
append the season number to each episode row. After I’m done with the
web scraping I mutate()
in the episode number for each season.
url_df <- tibble( urls = c("https://www.imdb.com/title/tt2467372/episodes?season=1", "https://www.imdb.com/title/tt2467372/episodes?season=2", "https://www.imdb.com/title/tt2467372/episodes?season=3", "https://www.imdb.com/title/tt2467372/episodes?season=4", "https://www.imdb.com/title/tt2467372/episodes?season=5"), season_num = c(1, 2, 3, 4, 5)) # scraping function: brooklyn99_ep_rating <- function(url) { session <- bow(url) url2 <- scrape(session) # Grab episode names sX_ep_name <- url2 %>% html_nodes(".info a") %>% html_text() %>% as_tibble() %>% mutate(value = gsub("\\n", "", x = value)) # Grab episode rating sX_rate <- url2 %>% html_nodes(".ipl-rating-widget > .ipl-rating-star .ipl-rating-star__rating") %>% html_text() %>% as_tibble() %>% mutate(rating = gsub("\\n", "", x = value) %>% as.numeric) %>% select(-value) # Clean episode name df sX_ep_name <- sX_ep_name %>% mutate(title = trimws(value)) %>% filter(!str_detect(title, "Rate"), title != "") %>% select(-value) # combine name + rating ep_rating <- sX_ep_name %>% bind_cols(sX_rate) } ep_rating_df <- map2(.x = url_df$urls, .y = url_df$season_num, ~ brooklyn99_ep_rating(url = .x) %>% mutate(season = .y)) %>% reduce(bind_rows) ep_rating_df <- ep_rating_df %>% group_by(season) %>% mutate(ep_num = row_number()) %>% ungroup() glimpse(ep_rating_df)
## Observations: 112 ## Variables: 4 ## $ title <chr> "Pilot", "The Tagger", "The Slump", "M.E. Time", "The V... ## $ rating <dbl> 7.9, 7.7, 7.7, 7.8, 8.1, 8.5, 8.2, 7.9, 7.8, 8.3, 8.3, ... ## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1... ## $ ep_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ... </int> </dbl> </dbl> </chr>
OK, looks good.
Episode Ratings Plot: Heatmap and Boxplot
I used geom_tile()
to create a heat map of episode ratings with the
season number as the rows and the episode number for that season as the
columns. I also used the dichromat
package for the color scheme
“LightBluetoDarkBlue.10”, it meshes pretty well with the Brooklyn Nine-Nine
color theme!
rating_plot <- ep_rating_df %>% ggplot(aes(x = ep_num, y = season)) + geom_tile(aes(fill = rating), size = 1.5, color = "black") + scale_fill_gradientn("Rating", colors = dichromat::colorschemes$LightBluetoDarkBlue.10, breaks = c(7.7, 8, 8.5, 9, 9.5)) + guides(fill = guide_colourbar(frame.colour = "black", barwidth = unit(2, "in"))) + scale_x_continuous(expand = c(0.01, 0.01), breaks = c(1, seq(from = 5, to = 20, by = 5), 23), labels = c(1, seq(from = 5, to = 20, by = 5), 23)) + scale_y_reverse(expand = c(0.01, 0.01), breaks = c(1:7), labels = c(1:7)) + labs(x = "Episode Number", y = "Season", title = "Episode Ratings") + theme_b99() + theme(panel.grid = element_blank()) rating_plot
From what we can see on the graph there doesn’t seem to be any trends
along either axis besides the fact that the finales and the subsequent
opening episode of the following season are generally well-received. Two
stand out episodes from Season 5 can be clearly seen in the plot.
These episodes are “HalloVeen” (the fifth Halloween episode of the
Sterling K.
Brown
as the perp).Box Plot
Another way to visualize this would be with the box plot. This way
allows you to make a comparison between seasons, whereas the heat map
could be used to pick out any trends over time or sequence.
The debate between violin plot vs. box plot vs. sina plot vs. etc. rages
on to this day (some examples I’ve read over the years:
1 ,
2 ,
3 ) and at the end of the day, it may come down to personal preference. Since the data
I’m using is quite small (~20 episode ratings for each season), in my
case it might be better to use box plots (instead of violin plots) and
sprinkle bee swarm points on top with the
ggbeeswarm package. I commented
out some of the other methods if you wanted to copy-paste this code
chunk into your R console to try them out!
I highlighted the outliers for each season in red with the unintentional
result being that the color scheme mimics that of the flag of Colorado…
cols <- c("1" = "#6CA9C3", "2" = "#3A3533", "3" = "#000E33", "4" = "#CBCFD2", "5" = "#175E78") box_plot <- ep_rating_df %>% ggplot(aes(x = season, y = rating, group = season)) + #geom_violin(color = "#F9FEFF", fill = "#000E33") + #ggthemes::geom_tufteboxplot(color = "#F9FEFF", fill = "#000E33") + #ggforce::geom_sina(color = "#F9FEFF", fill = "#000E33") + geom_boxplot(color = "#F9FEFF", fill = "#000E33", outlier.color = "red", outlier.size = 5) + geom_beeswarm(color = "#FCF40E", cex = 2, size = 2.25) + scale_y_continuous(limits = c(7.5, 9.6), labels = c(7.5, 8, 8.5, 9, 9.5)) + labs(title = "Episode Ratings by Season", x = "Season", y = "Rating") + theme_b99() + theme(panel.grid.minor = element_blank()) box_plot
Noice, Smort! We get the best of both worlds by combining two different
types of visualizing distributions. Let’s check out which episodes those
red dots are…
ep_rating_df %>% group_by(season) %>% mutate(n = n(), third_quantile = quantile(rating)[4]) %>% filter(rating > third_quantile + 1.58 * (IQR(rating))) %>% select(-n, -third_quantile)
## # A tibble: 3 x 4 ## # Groups: season [2] ## title rating season ep_num ## <chr> <dbl> <dbl> <int> ## 1 Johnny and Dora 9.1 2 23 ## 2 HalloVeen 9.5 5 4 ## 3 The Box 9.4 5 14 </int> </dbl> </dbl> </chr>
Besides the (re)appearances of Doug Judy, Brooklyn Nine-Nine is known
for its Halloween Episodes . I wanted to use gt
somewhere in this blog
post but that’s little bit overkill so a kable
table will suffice for
now.
ep_rating_df %>% group_by(season) %>% top_n(n = 5, wt = rating) %>% arrange(desc(rating)) %>% mutate(rank = row_number()) %>% arrange(season, desc(rating)) %>% filter(str_detect(title, "Hallo")) %>% rename(Title = title, Rating = rating, Season = season, `Episode Number` = ep_num, Rank = rank) %>% kable(format = "html", caption = "The Halloween Episodes and Rank in Season") %>% kable_styling(full_width = FALSE)The Halloween Episodes and Rank in Season Title Rating Season Episode Number Rank Halloween 8.5 1 6 4 Halloween II 8.7 2 4 2 Halloween III 8.9 3 5 2 Halloween IV 8.7 4 5 2 HalloVeen 9.5 5 4 1
It is very clear that the Halloween episodes are highly regarded to
those that rate Brooklyn Nine-Nine on IMDB (although I’m sure most
fans including myself will agree).
A more recent variant for showing a distribution is the ridge-line plot,
courtesy of the ggridges
package:
ep_rating_df2 <- ep_rating_df %>% mutate(season = as_factor(as.character(season))) ep_rating_df2 %>% ggplot(aes(x = rating, y = season, height = ..density..)) + ggridges::geom_density_ridges(color = "#F9FEFF", fill = "#000E33") + labs(title = "Episode Ratings by Season", x = "Rating", y = "Season") + scale_x_continuous(limits = c(7.25, 9.8), breaks = c(7.5, 8, 8.5, 9, 9.5), labels = c(7.5, 8, 8.5, 9, 9.5)) + scale_y_discrete(limits = rev(levels(ep_rating_df2$season))) + theme_b99() + theme(panel.grid.minor = element_blank())
## Picking joint bandwidth of 0.155
Well OK, we’re just looking at distributions… but I wanted to use this
GIF in some capacity!
Cast Appearances
The main cast of Brooklyn Nine-Nine are pretty tightly knit and as
members of the same precinct it makes sense that they’ll generally
appear together. So instead, I wanted to look at which non-main cast
Special
: Hitchcock and Scully weren’t officially “main cast” untilSeason 2but I left them out of the non-main cast list.
After scraping for the full cast list and main cast list, I anti_join()
them
so I am left with the non-main cast and the number of episodes that they
appeared in.
# Entire cast: cast_url <- bow("https://www.imdb.com/title/tt2467372/fullcredits?ref_=tt_cl_sm#cast") cast_info_raw <- scrape(cast_url) %>% html_nodes(".character") %>% html_text() %>% as_tibble() cast_info_clean <- cast_info_raw %>% separate(value, into = c("blank", "name", "episode_num", "dots"), sep = "\n") %>% mutate(episode_num = case_when( episode_num == " / ... " ~ dots, TRUE ~ episode_num), episode_num = episode_num %>% word(., 1, sep = "e") %>% as.numeric, name = str_trim(name, side = "both")) %>% select(-blank, -dots) %>% mutate(name = case_when( name == "Captain Ray Holt" ~ "Ray Holt", name == "Scully" ~ "Norm Scully", name == "Hitchcock" ~ "Michael Hitchcock", name == "Deputy Chief Madeline Wuntch" ~ "Madeline Wuntch", TRUE ~ name)) # Main cast: cast_main_url <- bow("https://en.wikipedia.org/wiki/List_of_Brooklyn_Nine-Nine_characters") cast_main_raw <- scrape(cast_main_url) %>% html_nodes(".wikitable") %>% html_table(fill = TRUE) %>% flatten_df() %>% as_tibble() cast_main_clean <- cast_main_raw %>% slice(-1) %>% select(Character) # anti-join non_main_cast <- anti_join(cast_info_clean, cast_main_clean, by = c("name" = "Character"))
Non-Main Cast Appearances Plot
To shorten the list I’ll just look at the top five cast members. I
created a halfway
variable so that the number labels will appear right
in the middle of each bar. Using the axis_canvas()
, draw_image()
,
and insert_axis_grob()
from the cowplot
package I can insert images
of the characters along the bottom of the plot.
non_main_plot <- non_main_cast %>% arrange(desc(episode_num)) %>% head(5) %>% mutate(halfway = episode_num / 2) %>% ggplot(aes(x = reorder(name, desc(episode_num)), y = episode_num)) + geom_col(fill = "#000E33") + geom_text(aes(y = halfway, label = episode_num, family = "Univers"), color = "#F9FEFF", size = 8) + scale_y_continuous(expand = c(0, 0), limits = c(0, 12.5), breaks = c(2, 4, 6, 8, 10, 12), labels = c(2, 4, 6, 8, 10, 12)) + labs(title = "Most Appearances by Non-Main Cast", x = NULL, y = "Number of Episodes") + theme_b99() + theme(panel.grid.major.x = element_blank()) # images pimage <- axis_canvas(non_main_plot, axis = 'x') + draw_image( "https://vignette.wikia.nocookie.net/tvdatabase/images/d/d5/Adrian_Pimento.jpg", x = 0.5, scale = 1.3, clip = "on") + draw_image( "https://vignette.wikia.nocookie.net/brooklynnine-nine/images/a/ab/Kevin.jpg", x = 1.5, scale = 1.3, clip = "on") + draw_image( "https://vignette.wikia.nocookie.net/brooklynnine-nine/images/b/b3/Wuntch.png", x = 2.5, scale = 1.3, clip = "on") + draw_image( "https://vignette.wikia.nocookie.net/brooklynnine-nine/images/0/0a/Doug_Judy.png", x = 3.5, scale = 1.3, clip = "on") + draw_image( "https://vignette.wikia.nocookie.net/brooklynnine-nine/images/2/23/Vulture.jpg", x = 4.5, scale = 1.3, clip = "on") # insert the image strip into the bar plot and draw ncast_plot <- ggdraw(insert_xaxis_grob(non_main_plot, pimage, position = "bottom")) ncast_plot
Adrian has appeared in the most episodes just beating out Kevin. This
really shows how involved Adrian was in the story in Season 4 and 5 especially compared to Kevin who has been popping in and out since
the first season. Everybody’s favorite DOUG JUDY rounds off this bar
chart next to The Vulture.
Viewer Numbers
Pretty much the same M.O. as what I did to get the episode ratings here.
One thing of note was using regex to get rid of the footnotes. I had to
be careful to double escape the square brackets there.
url_wiki_df <- tibble( urls = c("https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_1)", "https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_2)", "https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_3)", "https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_4)", "https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_5)"), season_num = c(1, 2, 3, 4, 5)) brooklyn99_ep_info <- function(url) { session <- bow(url) episode_raw <- scrape(session) %>% html_nodes(".wikiepisodetable") %>% html_table(fill = TRUE) %>% flatten_df() %>% as_tibble() %>% filter(row_number() %% 2 != 0) episode_table <- episode_raw %>% set_names(c("num_overall", "num_season", "title", "director", "writer", "air_date", "prod_code", "viewers")) %>% mutate(viewers = str_remove_all(viewers, "\\[[0-9]+\\]") %>% as.numeric) } ep_info_df <- map2(.x = url_wiki_df$urls, .y = url_wiki_df$season_num, ~ brooklyn99_ep_info(url = .x) %>% mutate(season = .y)) %>% reduce(rbind) ep_info_df %>% slice(77:83)
## # A tibble: 7 x 9 ## num_overall num_season title director writer air_date prod_code viewers ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 77 9 "\"T~ Dean Ho~ Luke ~ Decembe~ 409 2.31 ## 2 78 10 "\"C~ Jaffar ~ Matt ~ Decembe~ 410 2.15 ## 3 7980 1112 "\"T~ Rebecca~ Carol~ January~ 411412 3.49 ## 4 81 13 "\"T~ Beth Mc~ Carly~ April 1~ 413 1.91 ## 5 82 14 "\"S~ Michael~ Andre~ April 1~ 414 1.91 ## 6 83 15 "\"T~ Linda M~ David~ April 2~ 415 1.88 ## 7 84 16 "\"M~ Maggie ~ Phil ~ May 2, ~ 418 1.72 ## # ... with 1 more variable: season <dbl> </dbl> </dbl> </chr> </chr> </chr> </chr> </chr> </chr> </chr>
tidylog Demonstration
A little intermission here as I wanted to talk about this cool package I
recently found on Twitter called
tidylog . To show you an example, I
Season
5
:library(tidylog)
## ## Attaching package: 'tidylog' ## The following objects are masked from 'package:dplyr': ## ## anti_join, distinct, filter, filter_all, filter_at, filter_if, ## full_join, group_by, group_by_all, group_by_at, group_by_if, ## inner_join, left_join, mutate, mutate_all, mutate_at, ## mutate_if, right_join, select, select_all, select_at, ## select_if, semi_join, transmute, transmute_all, transmute_at, ## transmute_if ## The following object is masked from 'package:stats': ## ## filter
ep_info_df %>% select(season, num_season, title, num_overall, writer, viewers) %>% filter(season == 5) %>% separate_rows(writer, sep = "&") %>% mutate(writer = writer %>% trimws) %>% group_by(writer) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(5)
## select: dropped 3 variables (director, air_date, prod_code) ## filter: removed 89 out of 111 rows (80%) ## mutate: changed 6 values (24%) of 'writer' (0 new NA) ## group_by: 16 groups (writer) ## # A tibble: 5 x 2 ## writer n ## <chr> <int> ## 1 Luke Del Tredici 3 ## 2 Carly Hallam Tosh 2 ## 3 Carol Kolb 2 ## 4 Dan Goor 2 ## 5 David Phillips 2 </int> </chr>
You can see that it gives you details on what each dplyr
function
changed in your data frame in the order that the operations were
performed. It’s pretty cool and it gets more useful as yourdplyr
pipeline grows longer! However, let’s turn tidylog
off for now:
options("tidylog.display" = list())
The episode info data frame we got from Wikipedia had a lot of
information so let’s chop it down a bit to get data on the number of
viewers per episode.
From what we saw of ep_info_df
earlier, the data looked pretty clean
already except for Episode 79 and 80 (“The Fugitive” episodes)…
There are other multi-part episodes throughout the show but this pair is
the only one that aired on the same day, on New Years Day 2017, so they
got smushed together in the Wikipedia table when we scraped it. There
wasn’t a quick and easy way to regex
them into separate rows so I just
filtered them out and added them back in. Then at the end of the pipe, I
created a first
and last
variable for each season taking note of what
overall episode number the first and last episodes of each season were.
You’ll see why I did this soon.
viewers_df <- ep_info_df %>% select(season, num_season, title, num_overall, viewers) %>% # manually fix "The Fugitive" episodes filter(num_overall != 7980) %>% add_row(season = 4, num_season = 11, title = "The Fugitive: Part 1", num_overall = 79, viewers = 3.49) %>% add_row(season = 4, num_season = 12, title = "The Fugitive: Part 2", num_overall = 80, viewers = 3.49) %>% mutate(num_overall = as.numeric(num_overall), num_season = as.numeric(num_season), season
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK