Created by Keith McNulty on 2nd September 2022
- Web Page Structure and Format
- Basic harvesting: The Billboard Hot 100 page
- Making scraping easy by automating tasks
Any webpage you visit has a particular, expected general structure. It usually consists of two types of code.
HTML
code, which focuses on the appearance and format of a web page.XML
code, which doesn’t look a lot different fromHTML
but focuses more on managing data in a web page.
HTML
code has an expected format and structure, to make it easy for
people to develop web pages. Here is an example of a simple HTML
page:
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
As you can see, the content is wrapped in tags like <head></head>
,
<body></body>
, <p></p>
. These tags are pre-defined by the language
(you can only use the tags that HTML
allows). Because HTML
has a
more predictable structure, it is often easier to work with it and mine
it.
XML
format and structure is less predictable. Although it looks very
similar to HTML
, users can create their own named tags. Here is an
example:
<note>
<to>Keith</to>
<from>Steve</from>
<heading>Kudos</heading>
<body>Awesome work, dude!</body>
</note>
Tags like <to></to>
and <from></from>
are completely made up by me.
The fact that tags are not pre-defined makes XML
a little harder to
mine and analyze. But it’s hard to get at some of the data on the web
without using XML
.
To mine web data, it’s important that you can see the underlying code and understand how it relates to what you are seeing on the page. The best way to do this (in my opinion) is to use the Developer Tools that come with Google Chrome.
When you are viewing a web page in Chrome, simply used Ctrl+Shift+C
in
Windows or Cmd+Option+C
on a Mac to open up the Elements console where
you can see all the code underlying the page. This can look really
complex, but don’t worry. Here’s a photo of Google Chrome Developer open
on the Billboard Hot 100
page:
If you play around with the code in the Developer you will see that it has an embedded structure.
- At the highest level there is a
<html>
tag. - At the second level there are
<head>
and<body>
tags. - Inside the
<body>
of the page, different elements are often separated by<div>
tags. - Many different types of tags continue to be embedded down to many nested levels
This is important because it means we can mine elements of a web page and treat them like lists in R. We often call a specific element of the page a node. So if we want to mine a specific node, we can capture its sub-nodes in a list. By doing so, this gives us the opportunity to apply the tidyverse when mining web pages. The process of mining data from the web is called scraping or harvesting.
The rvest
and xml2
packages were designed to make it easier for
people working in R to harvest web data. Since xml2
is a required
package for rvest
and the idea is that both packages work together,
you only need to install rvest
. First, let’s ensure the packages we
need are installed and loaded:
if (!("rvest" %in% installed.packages())) {
install.packages("rvest")
}
if (!("dplyr" %in% installed.packages())) {
install.packages("dplyr")
}
library(rvest)
library(dplyr)
rvest
and xml2
contain functions that allow us to read the code of a
web page, break it into a neat structure, and work with the pipe command
to efficiently find and extract specific pieces of information. Think of
it a bit like performing keyhole surgery on a webpage. Once you
understand what functions are available and what they do, it makes basic
web scraping very easy and can produce really powerful functionality.
We are going to use the example of mining the Billboard Hot 100 page at https://www.billboard.com/charts/hot-100. If you view this page, it’s pretty bling. There are videos popping up, images all over the place. But the basic point of the page is to show the current Hot 100 chart.
So let’s set ourselves the task of just harvesting the basic info from this page: Rank, Artist, Song Title for the Hot 100.
First we load our packages and then we use the function read_html()
to
capture the HTML code of the Billboard Hot 100 page.
hot100page <- "https://www.billboard.com/charts/hot-100"
hot100 <- read_html(hot100page)
hot100
## {html_document}
## <html lang="en-US">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="pmc-chart-template-default single single-pmc-chart postid-14 ...
str(hot100)
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
The function has captured the entire content of the page in the form of
a special list-type document with two nodes <head>
and <body>
.
Almost always we are interested in the body of a web page. You can
select a node using html_node()
and then see its child nodes using
html_children()
.
body_nodes <- hot100 |>
html_node("body") |>
html_children()
body_nodes
## {xml_nodeset (7)}
## [1] <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-MD ...
## [2] <div class="floating-preroll-ad floating-preroll-ad-v2">\n\t<div class="f ...
## [3] <script type="text/javascript">\n\twindow.pmc_jwplayer.load("xFzS1DCh");\ ...
## [4] <noscript>\n\t\t<img src="https://sb.scorecardresearch.com/p?c1=2&c2= ...
## [5] <script type="text/plain" class="optanon-category-C0002">\n\t\tvar _qeven ...
## [6] <div id="skin-ad-section" data-content-container="main-wrapper">\n\n\t<di ...
## [7] <div id="main-wrapper" class="u-overflow-hidden"> \n\t<div id="modal-back ...
If we want, we can go one level deeper, to see the nodes inside the nodes. In this way, we can just continue to pipe deeper into the code:
body_nodes |>
html_children()
## {xml_nodeset (75)}
## [1] <iframe src="https://www.googletagmanager.com/ns.html?id=GTM-MD3ZLS" hei ...
## [2] <div class="floating-preroll-ad-container">\n\t\t\t\t\t<div class="float ...
## [3] <img src="https://sb.scorecardresearch.com/p?c1=2&c2=6035310&c3= ...
## [4] <div id="skin-ad-left-rail-container"></div>
## [5] <div id="skin-ad-right-rail-container"></div>
## [6] <div id="skin-ad-inject-container">\n\t\t\t<div class="admz" id="adm-res ...
## [7] <div id="modal-background" class="modal-background"></div>
## [8] <a href="#pagetop" class="a-screen-reader-shortcut">\n\t\t\tSkip to main ...
## [9] <div class="ad_block above-header-ad lrv-u-background-color-black">\n\t\ ...
## [10] <header class="header // js-Header lrv-u-margin-lr-auto"><div class="js- ...
## [11] <div class="sticky-leaderboard-ad"></div>
## [12] <main><a name="pagetop"></a>\n\t\t<div id="eventListHolder" class="nodis ...
## [13] <div class="admz" id="adm-footer">\n\t\t\t\t\t<div class="adma boomerang ...
## [14] <footer class="footer // site-footer"><div class="lrv-a-wrapper lrv-u-pa ...
## [15] <div class="pmc-footer // lrv-u-background-color-white">\n\t<div class=" ...
## [16] <div role="dialog" aria-modal="true" class="mega-menu // js-MegaMenu a-m ...
## [17] <div id="pmc-ad-bait" class="pub_300x250 pub_300x250m pub_728x90 text-ad ...
## [18] <script>\n\t\t\tif ( 'undefined' !== typeof jQuery ) {\n\t\t\t\tvar $pmc ...
## [19] <script type="text/plain" class="optanon-category-C0004" async="async" s ...
## [20] <script type="text/liquid" id="ac_article">\n\t<div class="ac_title ac_a ...
## ...
So we could mess around with the functions above for a long time, but
might find it hard to work out where exactly this chart data is. We can
use Chrome Developer to tell us where we can find the data in the code,
and then we can use rvest
to harvest out the data.
If you run your mouse over the code in the Developer Console, you will see that the elements of the page that the code refers to are highlighted in the browser. You can click to expand embedded nodes to get to more specific parts of the page. As elements are highlighted, you will see that their full html node identifiers appear, which are strings starting with the html tag and then the various class tags all separated by periods. These help you zoom in on the specific elements of a web page which you hope to harvest data from. If the code is very complex, another option is to right click the element you are interested in in the browser and then choose ‘Inspect’, which will take you straight to the elements code in the Developer Console. Here is another image showing how to find the full html identifier:
If we look carefully, we will see that the chart position for each
element in the Billboard 100 list, is contained in a ul
tag with the
class o-chart-results-list-row
, and is an XML arribute called
data-detail-target
.
Therefore, we can use the following code to pull a vector of chart positions, which we expect to have length 100.
# get rank vector
rank <- hot100 |>
rvest::html_nodes('ul.o-chart-results-list-row') |>
xml2::xml_attr('data-detail-target')
# check it has length 100
length(rank)
## [1] 100
Perfect! Now for the the vector of song titles, we can find these inside
h3
nodes whose class starts with
c-title.a-no-trucate.a-font-primary-bold-s.u-letter-spacing-0021
. So
we just need to get whatever text is contained in these tags. Because
text in web pages can often be surrounded by space and tabs, we can use
the trimws()
function to remove any surrounding whitespace.
# get title vector
title <- hot100 |>
rvest::html_nodes('h3.c-title.a-no-trucate.a-font-primary-bold-s.u-letter-spacing-0021') |>
rvest::html_text() |>
trimws()
# check it is of length 100
length(title)
## [1] 100
Similarly we can get the vector of artists:
# get artist vector
artist <- hot100 |>
rvest::html_nodes('span.c-label.a-no-trucate.a-font-primary-s') |>
rvest::html_text() |>
trimws()
# check it is of length 100
length(artist)
## [1] 100
That’s the Billboard Hot 100! Nice! Now we can combine them all into a neat dataframe.
chart_df <- data.frame(rank, artist, title)
knitr::kable(
chart_df |> head(10)
)
rank | artist | title |
---|---|---|
1 | Harry Styles | As It Was |
2 | Lizzo | About Damn Time |
3 | Steve Lacy | Bad Habit |
4 | Kate Bush | Running Up That Hill (A Deal With God) |
5 | Beyonce | Break My Soul |
6 | Nicky Youre & dazy | Sunroof |
7 | Nicki Minaj | Super Freaky Girl |
8 | Future Featuring Drake & Tems | Wait For U |
9 | Bad Bunny & Chencho Corleone | Me Porto Bonito |
10 | Post Malone Featuring Doja Cat | I Like You (A Happier Song) |
Generally we don’t just scrape a single webpage for fun. We are usually scraping because there is information that we need on a large scale or on a regular basis. Therefore, once you have worked out how to scrape this information, you’ll need to set things up in a way that it is easy to obtain it in the future. Writing functions is often a good way of doing this.
If you take a look around the billboard site, you’ll see that you can basically look up any chart at any date in history by simply inserting the chart name and date at the appropriate point in the URL. For example, to see the Billboard 200 on 21st July 1972 you would navigate to https://www.billboard.com/charts/billboard-200/1972-07-21.
Since this will always produce a webpage in exactly the same structure as the one we just scraped, we can now create quite a powerful function that accepts a chart name, date and set of ranks, and returns the entries for that chart on that date in those ranks.
#' Get billboard chart entries from history
#'
#' @param date date in the form YYYY-MM-DD
#' @param positions numeric vector
#' @param type character string of chart type (as per billboard.com URLs)
#' @return a dataframe of rank, artist, title
#' @examples get_chart(date = "1972-11-02", positions = c(1:100), type = "billboard-200")
get_chart <- function(date = Sys.Date(), positions = 1:10, type = "hot-100") {
# get url from input and read html
input <- paste0("https://www.billboard.com/charts/", type, "/", date)
chart_page <- xml2::read_html(input)
# scrape data
rank <- chart_page |>
rvest::html_nodes('ul.o-chart-results-list-row') |>
xml2::xml_attr('data-detail-target')
title <- chart_page |>
rvest::html_nodes('h3.c-title.a-no-trucate.a-font-primary-bold-s.u-letter-spacing-0021') |>
rvest::html_text() |>
trimws()
artist <- chart_page |>
rvest::html_nodes('span.c-label.a-no-trucate.a-font-primary-s') |>
rvest::html_text() |>
trimws()
# create dataframe, remove nas and return result
chart_df <- data.frame(rank, artist, title)
chart_df <- chart_df |>
dplyr::filter(!is.na(rank), rank %in% positions)
chart_df
}
Now let’s test our function by looking up the Top 10 singles from 20th January 1975:
test1 <- get_chart(date = "1975-01-20", positions = 1:10, type = "hot-100")
knitr::kable(test1)
rank | artist | title |
---|---|---|
1 | Carpenters | Please Mr. Postman |
2 | Neil Sedaka | Laughter In The Rain |
3 | Barry Manilow | Mandy |
4 | Ohio Players | Fire |
5 | Stevie Wonder | Boogie On Reggae Woman |
6 | Linda Ronstadt | You’re No Good |
7 | Paul Anka with Odia Coates | One Man Woman/One Woman Man |
8 | Donny & Marie Osmond | Morning Side Of The Mountain |
9 | Gloria Gaynor | Never Can Say Goodbye |
10 | AWB | Pick Up The Pieces |
Similarly, we can create a function get_eurovision()
to scrape the
results of any Eurovision Song
Contest since
1957. I will source this function from inside this repo and then grab
the 1974 contest results:
source("eurovision_scraping.R")
eurovision_1974 <- get_eurovision(1974)
knitr::kable(eurovision_1974)
R/O | Country | Artist | Song | Language[6][7] | Points | Place[8] |
---|---|---|---|---|---|---|
8 | Sweden | ABBA | “Waterloo” | English | 24 | 1 |
17 | Italy | Gigliola Cinquetti | “Sì” | Italian | 18 | 2 |
12 | Netherlands | Mouth and MacNeal | “I See a Star” | English | 15 | 3 |
2 | United Kingdom | Olivia Newton-John | “Long Live Love” | English | 14 | 4 |
9 | Luxembourg | Ireen Sheer | “Bye Bye I Love You” | French[b] | 14 | 4 |
10 | Monaco | Romuald | “Celui qui reste et celui qui s’en va” | French | 14 | 4 |
6 | Israel | Poogy | “Natati La Khayay” (נתתי לה חיי) | Hebrew | 11 | 7 |
13 | Ireland | Tina Reynolds | “Cross Your Heart” | English | 11 | 7 |
3 | Spain | Peret | “Canta y sé feliz” | Spanish | 10 | 9 |
11 | Belgium | Jacques Hustin | “Fleur de liberté” | French | 10 | 9 |
5 | Greece | Marinella | “Krasi, thalassa ke t’ agori mou”(Κρασί, θάλασσα και τ’ αγόρι μου) | Greek | 7 | 11 |
7 | Yugoslavia | Korni Grupa | “Generacija ’42” (Генерација ’42) | Serbo-Croatian | 6 | 12 |
1 | Finland | Carita | “Keep Me Warm” | English | 4 | 13 |
4 | Norway | Anne-Karine Strøm and the Bendik Singers | “The First Day of Love” | English | 3 | 14 |
14 | Germany | Cindy and Bert | “Die Sommermelodie” | German | 3 | 14 |
15 | Switzerland | Piera Martell | “Mein Ruf nach dir” | German | 3 | 14 |
16 | Portugal | Paulo de Carvalho | “E depois do adeus” | Portuguese | 3 | 14 |
Recently I thought it might be useful to have a package that generated random facts for people. This could be helpful for scripts or apps that take a long time to execute, where you could occasionally display random facts to keep people interested.
The Wikipedia Main Page has three predictable sections which can be reliably scraped. So I used them to create three functions:
wiki_didyouknow()
which takes random facts from the ‘Did you know…’ sectionwiki_onthisday()
which takes random facts from the ‘On this day…’ sectionwiki_inthenews()
which takes random facts from the ‘In the news…’ section
A fourth function wiki_randomfact()
executes one of the above three
functions at random.
I packaged this into a package called wikifacts
which can be installed
from github. Here’s some examples of the functions at work:
library(devtools)
devtools::install_github("keithmcnulty/wikifacts")
library(wikifacts)
wiki_didyouknow()
wiki_onthisday()
## [1] "Did you know that Pulitzer Prize-winning sportswriter Arthur Daley wrote more than 10,000 columns for The New York Times? (Courtesy of Wikipedia)"
## [1] "Did you know that on February 2 in 1438 – Nine leaders of the Transylvanian peasant revolt were executed at Torda. (Courtesy of Wikipedia)"
git2r::repository()
## Local: master /home/rstudio/rstudio_projects/scraping
## Remote: master @ origin (https://github.com/keithmcnulty/scraping.git)
## Head: [ea28818] 2022-09-02: Update
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] wikifacts_0.4.2 xml2_1.3.2 dplyr_1.0.8 rvest_1.0.2
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.13 knitr_1.39 magrittr_2.0.1 tidyselect_1.1.1
## [5] R6_2.5.1 rlang_1.0.3 fastmap_1.1.0 fansi_0.4.2
## [9] highr_0.8 stringr_1.4.0 httr_1.4.3 tools_4.2.0
## [13] xfun_0.31 utf8_1.2.1 DBI_1.1.0 cli_3.3.0
## [17] git2r_0.27.1 selectr_0.4-2 htmltools_0.5.2 ellipsis_0.3.2
## [21] assertthat_0.2.1 yaml_2.2.1 digest_0.6.29 tibble_3.1.1
## [25] lifecycle_1.0.1 crayon_1.4.1 purrr_0.3.4 vctrs_0.3.8
## [29] curl_4.3.2 glue_1.6.2 evaluate_0.15 rmarkdown_2.14
## [33] stringi_1.7.6 compiler_4.2.0 pillar_1.6.0 generics_0.1.3
## [37] pkgconfig_2.0.3