Get any US music chart listing from history in your R console
source link: https://www.tuicool.com/articles/hit/iARN3u6
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
We are lucky enough to live in an age where we can get pretty much any factoid we want. If we want to find out the Top Billboard 200 albums from 1980, we just need to go to the official Billboard 200 website , enter the date, and up the list comes in a nice display with album art and all that nice stuff.
But often we don’t care about the nice stuff, and we don’t want to visit a website and go through several clicks to get the info we need. Wouldn’t it be great if we could just get it in our console with a simple function or command.
Well, if the website is well structured and accessing data from a structured dataset, then in all likelihood you can scrape it, meaning that you can extract the precise information you want into a vector or table, allowing you to conduct analysis or do whatever.
In this article we are going to review elementary web scraping in R using the packages rvest
and xml2
. These packages are remarkably easy to use. By the end of the article we will have created a function called get_charts()
which will take a date, a chart version and a vector of ranked positions as its arguments and instantly return the chart entries in those positions on that date. I hope this will encourage you to try it on countless other sources of web data.
For this tutorial you will need to have installed dplyr
, xml2
and rvest
packages. You will also need to be using the Google Chrome browser.
Getting started — Scraping this week’s Billboard Hot 100
We will start by working out how to scrape this week’s Billboard Hot 100 to get a ranked list of artists and titles. If you take a look at the Hot 100 page at https://www.billboard.com/charts/hot-100 you can observe its overall structure. It has various banners and advertising and there’s quite a lot going on, but you can immediately see that there is a Hot 100 list on the page, so we know that the information we want is on this page and we will need to navigate the underlying code to find it.
The packages rvest
and xml2
in R are designed to make it easy to extract and analyse deeply nested HTML and XML code that sits behind most websites today. HTML and XML are different — I won’t go into the details of that here — but you’ll usually need rvest
to dig down and find the specific HTML nodes that you need and xml2
to pull out the XML attributes that contain the specific data you want.
After we load our packages, the first thing we will want to do is read the html from the web page, so we have a starting point for digging to find the nodes and attributes we want.
# required libraries library(rvest) library(xml2) library(dplyr)
# get url from input input <- "https://www.billboard.com/charts/hot-100"
# read html code from url chart_page <- xml2::read_html(input)
Now we have a list object chart_page
which contains two elements, one for the head of the webpage and the other for the body of the webpage.
We now need to inspect the website using Chrome. Right click on the website and choose ‘inspect’. This will bring up a panel showing you all the nested HTML and XML code. As you roll your mouse over this code you will see that the part of the page that it refers to is highlighted. For example, you can see here that the section we are interested in highlights when I mouse over the highlighted <div class = "container chart-container ...">
which makes sense.
So we need to find this in the body of the page. We use the rvest
function html_nodes()
to get the nodes of the body of the page, and we use the xml2
function xml_children()
to find the parts of the nodes that we want to dig into.
# browse nodes in body of article
chart_nest_1 <- chart_page %>% rvest::html_nodes('body') %>% xml2::xml_children()
View(chart_nest_1)
This gives us a nested numbered list which we can click and browse through, like so:
List element 3 (child 3) contains the main page content according to its XML attributes, so we continue diving to find its children:
chart_nest_2 <- chart_nest_1 %>% xml2::xml_child(3) %>% xml2::xml_children()
View(chart_nest_2)
and if we again browse the resulting list, we can see under child 3 the precise XML attribute we are looking for:
By going in this fashion we can get to the precise segments of the code that we need to get the contents of the Hot 100 list, which turns out to be a few children down from the original body node:
# drill down XML children
chart <- chart_page %>% rvest::html_nodes('body') %>% xml2::xml_child(3) %>% xml2::xml_child(3) %>% xml2::xml_child(1)
We can now extract the attributes of the children to see what we are interested in:
# get contents and attributes of children
attrs <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attrs()
View(attrs)
We can now see that the chart entries are inside this list (along with other things like advertising banners and videos):
We can now extract the data we want. To get rank, artist and title we just grab the data-rank
, data-artist
and data-title
attributes as separate vectors and combine them into a dataframe. Some of the entries we download will not refer to chart entries but to other XML classes. These will appear as NA
in our dataframe and can be easily removed.
# get ranks, artists, and titles as vectors
rank <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-rank') artist <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-artist') title <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-title')
# combine into a dataframe and remove NAs
chart_df <- data.frame(rank, artist, title) chart_df <- chart_df %>% dplyr::filter(!is.na(rank))
View(chart_df)
And there we have it, a nice list of exactly what we want, and good to know that there are 100 rows as expected:
Generalizing to pull any chart from any date
So that was a lot of investigative work, and digging into HTML and XML can be annoying. There are Chrome plugins like SelectorGadget that can help with this, but I find them unpredictable and prefer to just investigate the underying code like I did above.
Now that we know where the data sits, however, we can now make this a lot more powerful. If you play with the billboard.com website, you’ll notice that you can get to a specific chart on any historic date by simply editing the URL. So for example if you wanted to see the Billboard 200 as of 22nd March 1983 you just go to https://www.billboard.com/charts/billboard-200/1983-03-22 .
So, this allows us to take the code above and easily generalize it by creating a function that accepts the date, chart type and positions we are interested in. Let’s write that function with some default values for date (today), chart type (default to Hot 100), and positions(top 10).
get_chart <- function(date = Sys.Date(), positions = c(1:10), type = "hot-100") {
# get url from input and pull html
input <- paste0("https://www.billboard.com/charts/", type, "/", date)
chart_page <- xml2::read_html(input)
# scrape data
chart <- chart_page %>% rvest::html_nodes('body') %>% xml2::xml_child(3) %>% xml2::xml_child(3) %>% xml2::xml_child(1)
rank <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-rank') artist <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-artist') title <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-title')
# generate a display dataframe
chart_df <- data.frame(rank, artist, title) chart_df <- chart_df %>% dplyr::filter(!is.na(rank), rank %in% positions)
chart_df
}
OK, let’s test our function. What were the Top 20 singles on 22nd March 1983?
What were the Top 10 albums on 1st April 1970?
What I love about rvest
and xml2
is how simple and powerful they are. Look how lean the content of the function is — it didn’t take much to create something quite powerful. Give it a try with some other sources of web data and feel free to add to the Github repo here
if you create any other cool scraping functions.
Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter .
You can find out more about
rvest
here
and
xml2
here
.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK