37

Visualizing Brooklyn Nine-Nine with R!

 5 years ago
source link: https://www.tuicool.com/articles/hit/MVzqeiu
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

(This article was first published on R by R(yo) , and kindly contributed toR-bloggers)

“Hey there criminal. It’s me, Johnny Law!” – Jake Peralta, NYPD.

Brooklyn Nine-Nine has become one of my favorite sitcoms in recent

years, probably taking over from Parks & Recreation and Community. So in

this blog post, I’m going to web scrape some very simple TV statistics,

clean it up with the tidyverse , and visualize it with ggplot2

“Terry loves ggplot2!”

Let’s get started!

Packages

pacman::p_load(tidyverse, rvest, glue, cowplot, ggbeeswarm, 
               polite, extrafont, knitr, kableExtra)
loadfonts() # "Univers" and "Univers LT 93 ExtraBlackEx" fonts

B99 Custom Theme

First, I created a custom Brooklyn Nine-Nine theme that I can put

on every plot. This will save me time from typing in

the same options over and over again! I googled the font type that the

official Brooklyn Nine-Nine media uses, downloaded them, and got it

installed for R using the extrafont package. For some of the different

colors you’ll see in the plots I sourced them from pasting in the

Brooklyn Nine-Nine logo and other official media images into

imagecolorpicker.com and saving the

hex codes that it gave me. Otherwise, I experimented with different

palettes from perusing

Emil

awesome

r-color-palettes

Github repository (Also special thanks to

David

for giving me some advice on

using color!).

theme_b99 <- function(){
  
  base_size <- 11
  half_line <- base_size / 2
  
  theme_minimal() %+replace%
    theme(text = element_text(family = "Univers", color = "#F9FEFF",
                              face = "plain", size = 14, 
                              hjust = 0.5, vjust = 0.5, angle = 0,
                              lineheight = 0.9, 
                              margin = margin(half_line, half_line,
                                              half_line, half_line),
                              debug = FALSE),
          plot.title = element_text(family = "Univers LT 93 ExtraBlackEx",
                                    size = 20, color = "#F9FEFF"),
          plot.background = element_rect(color = NA, fill = "#0053CD"), 
          panel.background = element_rect(color = NA, fill = "#0053CD"),
          # axis options
          axis.text = element_text(family = "Univers", color = "#F9FEFF", size = 12),
          axis.title = element_text(size = 14),
          # legend options (for ratings plot)
          legend.title = element_text(family = "Univers", color = "#F9FEFF"),
          legend.text = element_text(family = "Univers", color = "#F9FEFF",
                                     size = 9),
          legend.position = "bottom",
          legend.key = element_rect(colour = "black", linetype = "solid", size = 1.5),
          legend.background = element_rect(color = "black", fill = "#0053CD",
                                           linetype = "solid"))
}

Now the plots will all look very similar, just like a good police squad:

MrUjuaj.gif

With that done, I can start making plots!

Episode Ratings

As in my more

recent blog

posts I used the polite package to web scrape responsibly (note

the bow() and scrape() functions).

I map_2() over the IMDB page for the episodes for each season and I

append the season number to each episode row. After I’m done with the

web scraping I mutate() in the episode number for each season.

url_df <- tibble(
  urls = c("https://www.imdb.com/title/tt2467372/episodes?season=1",
           "https://www.imdb.com/title/tt2467372/episodes?season=2",
           "https://www.imdb.com/title/tt2467372/episodes?season=3",
           "https://www.imdb.com/title/tt2467372/episodes?season=4",
           "https://www.imdb.com/title/tt2467372/episodes?season=5"),
  season_num = c(1, 2, 3, 4, 5)) 

# scraping function:
brooklyn99_ep_rating <- function(url) {
  
  session <- bow(url)
  url2 <- scrape(session)
  
  # Grab episode names
  sX_ep_name <- url2 %>% 
    html_nodes(".info a") %>% 
    html_text() %>% 
    as_tibble() %>% 
    mutate(value = gsub("\\n", "", x = value))
  
  # Grab episode rating
  sX_rate <- url2 %>% 
    html_nodes(".ipl-rating-widget > .ipl-rating-star .ipl-rating-star__rating") %>% 
    html_text() %>% 
    as_tibble() %>%  
    mutate(rating = gsub("\\n", "", x = value) %>% as.numeric) %>% 
    select(-value)
  
  # Clean episode name df
  sX_ep_name <- sX_ep_name %>% 
    mutate(title = trimws(value)) %>% 
    filter(!str_detect(title, "Rate"), title != "") %>% 
    select(-value)
  
  # combine name + rating
  ep_rating <- sX_ep_name %>% 
    bind_cols(sX_rate)
}

ep_rating_df <- map2(.x = url_df$urls, .y = url_df$season_num,
                          ~ brooklyn99_ep_rating(url = .x) %>% 
         mutate(season = .y)) %>% 
  reduce(bind_rows)

ep_rating_df <- ep_rating_df %>% 
  group_by(season) %>% 
  mutate(ep_num = row_number()) %>% 
  ungroup()

glimpse(ep_rating_df)
## Observations: 112
## Variables: 4
## $ title  
<chr>
  "Pilot", "The Tagger", "The Slump", "M.E. Time", "The V...
## $ rating 
 <dbl>
   7.9, 7.7, 7.7, 7.8, 8.1, 8.5, 8.2, 7.9, 7.8, 8.3, 8.3, ...
## $ season 
  <dbl>
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ep_num 
   <int>
     1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...

   </int>
  </dbl>
 </dbl>
</chr>

OK, looks good.

Episode Ratings Plot: Heatmap and Boxplot

I used geom_tile() to create a heat map of episode ratings with the

season number as the rows and the episode number for that season as the

columns. I also used the dichromat package for the color scheme

“LightBluetoDarkBlue.10”, it meshes pretty well with the Brooklyn Nine-Nine

color theme!

rating_plot <- ep_rating_df %>% 
  ggplot(aes(x = ep_num, y = season)) +
  geom_tile(aes(fill = rating), size = 1.5,
            color = "black") +
  scale_fill_gradientn("Rating", colors = dichromat::colorschemes$LightBluetoDarkBlue.10,
                       breaks = c(7.7, 8, 8.5, 9, 9.5)) +
  guides(fill = guide_colourbar(frame.colour = "black",
                                barwidth = unit(2, "in"))) +
  scale_x_continuous(expand = c(0.01, 0.01), 
                     breaks = c(1, seq(from = 5, to = 20, by = 5), 23),
                     labels = c(1, seq(from = 5, to = 20, by = 5), 23)) +
  scale_y_reverse(expand = c(0.01, 0.01),
                  breaks = c(1:7), 
                  labels = c(1:7)) +
  labs(x = "Episode Number", y = "Season",
       title = "Episode Ratings") +
  theme_b99() +
  theme(panel.grid = element_blank())

rating_plot

aeIbMj6.png!web

From what we can see on the graph there doesn’t seem to be any trends

along either axis besides the fact that the finales and the subsequent

opening episode of the following season are generally well-received. Two

stand out episodes from Season 5 can be clearly seen in the plot.

These episodes are “HalloVeen” (the fifth Halloween episode of the

series) and “The Box” (an excellent episode featuring

Sterling K.

Brown

as the perp).

Box Plot

Another way to visualize this would be with the box plot. This way

allows you to make a comparison between seasons, whereas the heat map

could be used to pick out any trends over time or sequence.

The debate between violin plot vs. box plot vs. sina plot vs. etc. rages

on to this day (some examples I’ve read over the years:

1 ,

2 ,

3 ) and at the end of the day, it may come down to personal preference. Since the data

I’m using is quite small (~20 episode ratings for each season), in my

case it might be better to use box plots (instead of violin plots) and

sprinkle bee swarm points on top with the

ggbeeswarm package. I commented

out some of the other methods if you wanted to copy-paste this code

chunk into your R console to try them out!

I highlighted the outliers for each season in red with the unintentional

result being that the color scheme mimics that of the flag of Colorado…

cols <- c("1" = "#6CA9C3", "2" = "#3A3533", 
          "3" = "#000E33", "4" = "#CBCFD2", "5" = "#175E78")

box_plot <- ep_rating_df %>% 
  ggplot(aes(x = season, y = rating, group = season)) +
  #geom_violin(color = "#F9FEFF", fill = "#000E33") +
  #ggthemes::geom_tufteboxplot(color = "#F9FEFF", fill = "#000E33") +
  #ggforce::geom_sina(color = "#F9FEFF", fill = "#000E33") +
  geom_boxplot(color = "#F9FEFF", fill = "#000E33",
              outlier.color = "red", outlier.size = 5) +
  geom_beeswarm(color = "#FCF40E", cex = 2, size = 2.25) +
  scale_y_continuous(limits = c(7.5, 9.6),
                     labels = c(7.5, 8, 8.5, 9, 9.5)) +
  labs(title = "Episode Ratings by Season",
       x = "Season", y = "Rating") +
  theme_b99() +
  theme(panel.grid.minor = element_blank())

box_plot

Jni6ziY.png!web

Noice, Smort! We get the best of both worlds by combining two different

types of visualizing distributions. Let’s check out which episodes those

red dots are…

ep_rating_df %>% 
  group_by(season) %>% 
  mutate(n = n(),
         third_quantile = quantile(rating)[4]) %>% 
  filter(rating > third_quantile + 1.58 * (IQR(rating))) %>% 
  select(-n, -third_quantile)
## # A tibble: 3 x 4
## # Groups:   season [2]
##   title           rating season ep_num
##   
<chr>            
 <dbl>  
  <dbl>  
   <int>
    
## 1 Johnny and Dora    9.1      2     23
## 2 HalloVeen          9.5      5      4
## 3 The Box            9.4      5     14

   </int>
  </dbl>
 </dbl>
</chr>

Besides the (re)appearances of Doug Judy, Brooklyn Nine-Nine is known

for its Halloween Episodes . I wanted to use gt somewhere in this blog

post but that’s little bit overkill so a kable table will suffice for

now.

ep_rating_df %>% 
  group_by(season) %>% 
  top_n(n = 5, wt = rating) %>% 
  arrange(desc(rating)) %>% 
  mutate(rank = row_number()) %>% 
  arrange(season, desc(rating)) %>% 
  filter(str_detect(title, "Hallo")) %>% 
  rename(Title = title, Rating = rating, Season = season,
         `Episode Number` = ep_num, Rank = rank) %>% 
  kable(format = "html",
        caption = "The Halloween Episodes and Rank in Season") %>% 
  kable_styling(full_width = FALSE)
The Halloween Episodes and Rank in Season Title Rating Season Episode Number Rank Halloween 8.5 1 6 4 Halloween II 8.7 2 4 2 Halloween III 8.9 3 5 2 Halloween IV 8.7 4 5 2 HalloVeen 9.5 5 4 1

It is very clear that the Halloween episodes are highly regarded to

those that rate Brooklyn Nine-Nine on IMDB (although I’m sure most

fans including myself will agree).

MFNfAfB.gif

A more recent variant for showing a distribution is the ridge-line plot,

courtesy of the ggridges

package:

ep_rating_df2 <- ep_rating_df %>% 
  mutate(season = as_factor(as.character(season)))

ep_rating_df2 %>% 
  ggplot(aes(x = rating, y = season, height = ..density..)) +
  ggridges::geom_density_ridges(color = "#F9FEFF", fill = "#000E33") +
  labs(title = "Episode Ratings by Season",
       x = "Rating", y = "Season") +
  scale_x_continuous(limits = c(7.25, 9.8),
                     breaks = c(7.5, 8, 8.5, 9, 9.5),
                     labels = c(7.5, 8, 8.5, 9, 9.5)) +
  scale_y_discrete(limits = rev(levels(ep_rating_df2$season))) +
  theme_b99() +
  theme(panel.grid.minor = element_blank())
## Picking joint bandwidth of 0.155

zy63uae.png!web

IJNZFfa.gif

Well OK, we’re just looking at distributions… but I wanted to use this

GIF in some capacity!

Cast Appearances

The main cast of Brooklyn Nine-Nine are pretty tightly knit and as

members of the same precinct it makes sense that they’ll generally

appear together. So instead, I wanted to look at which non-main cast

members and guests made the most appearances on the show.

Special

: Hitchcock and Scully weren’t officially “main cast” until

Season 2but I left them out of the non-main cast list.

After scraping for the full cast list and main cast list, I anti_join() them

so I am left with the non-main cast and the number of episodes that they

appeared in.

# Entire cast:
cast_url <- bow("https://www.imdb.com/title/tt2467372/fullcredits?ref_=tt_cl_sm#cast")

cast_info_raw <- scrape(cast_url) %>% 
  html_nodes(".character") %>% 
  html_text() %>% 
  as_tibble()

cast_info_clean <- cast_info_raw %>% 
  separate(value, into = c("blank", "name", "episode_num", "dots"), sep = "\n") %>% 
  mutate(episode_num = case_when(
    episode_num == "         / ...  " ~ dots,
    TRUE ~ episode_num),
    episode_num = episode_num %>% word(., 1, sep = "e") %>% as.numeric,
    name = str_trim(name, side = "both")) %>% 
  select(-blank, -dots) %>% 
  mutate(name = case_when(
    name == "Captain Ray Holt" ~ "Ray Holt",
    name == "Scully" ~ "Norm Scully",
    name == "Hitchcock" ~ "Michael Hitchcock",
    name == "Deputy Chief Madeline Wuntch" ~ "Madeline Wuntch",
    TRUE ~ name))
  
# Main cast:
cast_main_url <- bow("https://en.wikipedia.org/wiki/List_of_Brooklyn_Nine-Nine_characters")

cast_main_raw <- scrape(cast_main_url) %>% 
  html_nodes(".wikitable") %>% 
  html_table(fill = TRUE) %>% 
  flatten_df() %>% 
  as_tibble()

cast_main_clean <- cast_main_raw %>% 
  slice(-1) %>% 
  select(Character)

# anti-join
non_main_cast <- anti_join(cast_info_clean, cast_main_clean, 
                           by = c("name" = "Character"))

Non-Main Cast Appearances Plot

To shorten the list I’ll just look at the top five cast members. I

created a halfway variable so that the number labels will appear right

in the middle of each bar. Using the axis_canvas() , draw_image() ,

and insert_axis_grob() from the cowplot package I can insert images

of the characters along the bottom of the plot.

non_main_plot <- non_main_cast %>% 
  arrange(desc(episode_num)) %>% 
  head(5) %>% 
  mutate(halfway = episode_num / 2) %>% 
  ggplot(aes(x = reorder(name, desc(episode_num)), y = episode_num)) +
  geom_col(fill = "#000E33") +
  geom_text(aes(y = halfway, label = episode_num,
                family = "Univers"),
            color = "#F9FEFF",
            size = 8) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 12.5),
                     breaks = c(2, 4, 6, 8, 10, 12),
                     labels = c(2, 4, 6, 8, 10, 12)) +
  labs(title = "Most Appearances by Non-Main Cast", 
       x = NULL, y = "Number of Episodes") +
  theme_b99() +
  theme(panel.grid.major.x = element_blank())

# images
pimage <- axis_canvas(non_main_plot, axis = 'x') + 
  draw_image(
    "https://vignette.wikia.nocookie.net/tvdatabase/images/d/d5/Adrian_Pimento.jpg", 
    x = 0.5, scale = 1.3, clip = "on") +
  draw_image(
    "https://vignette.wikia.nocookie.net/brooklynnine-nine/images/a/ab/Kevin.jpg", 
    x = 1.5, scale = 1.3, clip = "on") +
  draw_image(
    "https://vignette.wikia.nocookie.net/brooklynnine-nine/images/b/b3/Wuntch.png", 
    x = 2.5, scale = 1.3, clip = "on") +
  draw_image(
    "https://vignette.wikia.nocookie.net/brooklynnine-nine/images/0/0a/Doug_Judy.png",
    x = 3.5, scale = 1.3, clip = "on") +
  draw_image(
    "https://vignette.wikia.nocookie.net/brooklynnine-nine/images/2/23/Vulture.jpg",
    x = 4.5, scale = 1.3, clip = "on")

# insert the image strip into the bar plot and draw  
ncast_plot <- ggdraw(insert_xaxis_grob(non_main_plot, pimage, position = "bottom"))

ncast_plot

JJFNVza.png!web

Adrian has appeared in the most episodes just beating out Kevin. This

really shows how involved Adrian was in the story in Season 4 and 5 especially compared to Kevin who has been popping in and out since

the first season. Everybody’s favorite DOUG JUDY rounds off this bar

chart next to The Vulture.

JrmMRrM.gif

Viewer Numbers

Pretty much the same M.O. as what I did to get the episode ratings here.

One thing of note was using regex to get rid of the footnotes. I had to

be careful to double escape the square brackets there.

url_wiki_df <- tibble(
  urls = c("https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_1)",
           "https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_2)",
           "https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_3)",
           "https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_4)",
           "https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_5)"),
  season_num = c(1, 2, 3, 4, 5))

brooklyn99_ep_info <- function(url) {
  
  session <- bow(url)
  
  episode_raw <- scrape(session) %>% 
    html_nodes(".wikiepisodetable") %>% 
    html_table(fill = TRUE) %>% 
    flatten_df() %>% 
    as_tibble() %>% 
    filter(row_number() %% 2 != 0)
  
  episode_table <- episode_raw %>% 
  set_names(c("num_overall", "num_season", "title", "director", "writer",
              "air_date", "prod_code", "viewers")) %>% 
  mutate(viewers = str_remove_all(viewers, "\\[[0-9]+\\]") %>% as.numeric)
  
}

ep_info_df <- map2(.x = url_wiki_df$urls, .y = url_wiki_df$season_num,
                          ~ brooklyn99_ep_info(url = .x) %>% 
         mutate(season = .y)) %>% 
  reduce(rbind)

ep_info_df %>% 
  slice(77:83)
## # A tibble: 7 x 9
##   num_overall num_season title director writer air_date prod_code viewers
##   
<chr>       
 <chr>      
  <chr> 
   <chr>    
    <chr>  
     <chr>    
      <chr>       
       <dbl>
        
## 1 77          9          "\"T~ Dean Ho~ Luke ~ Decembe~ 409          2.31
## 2 78          10         "\"C~ Jaffar ~ Matt ~ Decembe~ 410          2.15
## 3 7980        1112       "\"T~ Rebecca~ Carol~ January~ 411412       3.49
## 4 81          13         "\"T~ Beth Mc~ Carly~ April 1~ 413          1.91
## 5 82          14         "\"S~ Michael~ Andre~ April 1~ 414          1.91
## 6 83          15         "\"T~ Linda M~ David~ April 2~ 415          1.88
## 7 84          16         "\"M~ Maggie ~ Phil ~ May 2, ~ 418          1.72
## # ... with 1 more variable: season 
        <dbl>

        </dbl>
       </dbl>
      </chr>
     </chr>
    </chr>
   </chr>
  </chr>
 </chr>
</chr>

tidylog Demonstration

A little intermission here as I wanted to talk about this cool package I

recently found on Twitter called

tidylog . To show you an example, I

will take a look at which writers wrote the most episodes in

Season

5

:
library(tidylog)
## 
## Attaching package: 'tidylog'

## The following objects are masked from 'package:dplyr':
## 
##     anti_join, distinct, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at,
##     mutate_if, right_join, select, select_all, select_at,
##     select_if, semi_join, transmute, transmute_all, transmute_at,
##     transmute_if

## The following object is masked from 'package:stats':
## 
##     filter
ep_info_df %>% 
  select(season, num_season, title, num_overall, writer, viewers) %>% 
  filter(season == 5) %>% 
  separate_rows(writer, sep = "&") %>% 
  mutate(writer = writer %>% trimws) %>% 
  group_by(writer) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  head(5)
## select: dropped 3 variables (director, air_date, prod_code)

## filter: removed 89 out of 111 rows (80%)

## mutate: changed 6 values (24%) of 'writer' (0 new NA)

## group_by: 16 groups (writer)

## # A tibble: 5 x 2
##   writer                n
##   
<chr>             
 <int>
  
## 1 Luke Del Tredici      3
## 2 Carly Hallam Tosh     2
## 3 Carol Kolb            2
## 4 Dan Goor              2
## 5 David Phillips        2

 </int>
</chr>

You can see that it gives you details on what each dplyr function

changed in your data frame in the order that the operations were

performed. It’s pretty cool and it gets more useful as your dplyr

pipeline grows longer! However, let’s turn tidylog off for now:

options("tidylog.display" = list())

The episode info data frame we got from Wikipedia had a lot of

information so let’s chop it down a bit to get data on the number of

viewers per episode.

From what we saw of ep_info_df earlier, the data looked pretty clean

already except for Episode 79 and 80 (“The Fugitive” episodes)…

Zb6b2a3.gif

There are other multi-part episodes throughout the show but this pair is

the only one that aired on the same day, on New Years Day 2017, so they

got smushed together in the Wikipedia table when we scraped it. There

wasn’t a quick and easy way to regex them into separate rows so I just

filtered them out and added them back in. Then at the end of the pipe, I

created a first and last variable for each season taking note of what

overall episode number the first and last episodes of each season were.

You’ll see why I did this soon.

viewers_df <- ep_info_df %>% 
  select(season, num_season, title, num_overall, viewers) %>% 
  # manually fix "The Fugitive" episodes
  filter(num_overall != 7980) %>% 
  add_row(season = 4, num_season = 11, title = "The Fugitive: Part 1", 
          num_overall = 79, viewers = 3.49) %>% 
  add_row(season = 4, num_season = 12, title = "The Fugitive: Part 2", 
          num_overall = 80, viewers = 3.49) %>% 
  mutate(num_overall = as.numeric(num_overall),
         num_season = as.numeric(num_season),
         season

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK