```{r, include = FALSE}
current_file <- knitr::current_input()
basename <- gsub(".Rmd$", "", current_file)
knitr::opts_chunk$set(
fig.path = sprintf("images/%s/", basename),
fig.width = 6,
fig.height = 4,
out.width = "100%",
fig.align = "center",
dev.args = list(bg = 'transparent'),
fig.retina = 3,
echo = FALSE,
warning = FALSE,
message = FALSE,
cache = FALSE,
cache.path = "cache/"
)
```
```{r setup, include = FALSE}
library(rvest)
library(tidyverse)
library(plotly)
```
```{r wrap-hook, echo = FALSE}
library(knitr)
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
if (any(nchar(x) > n)) x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
```{r titleslide, child="assets/titleslide.Rmd"}
```
---
## Motivation
---
## After PhD - I wanted to read more!
---
## Question
.idea-box.tl.w-70[
How should I choose what to read?
Am I reading diversely?
]
---
## Picking from lists
---
## Questions for our data
.info-box.tl.w-70[
Are lists like this biased?
]
---
class: center middle bg-gray
.aim-box.tl.w-70[
Today you will:
- Get to know the basics of webpages
- Look at some examples of webscraping
- Get some data to answer my questions
]
--
.aim-box.tl.w-70[
Coding Perspective:
- Learn how to read data from a webpage into R
- Do more string manipulation
- Learn about automating a scraper
]
---
## Packages we need
`rvest` is the [R package](https://rvest.tidyverse.org/articles/rvest.html) that we'll need to get started with learning the 101 of web scraping
```{r eval = FALSE, echo = TRUE}
library(rvest)
library(tidyverse)
```
* Good for static webpages
---
class: transition
# Webpage basics
---
## Go to a webpage
https://en.wikipedia.org/wiki/Tim_Winton
---
## View html code in Chrome
* Right click the part of the page you want
* Select inpsect
---
## Html code
* Brings up the html code
* Highlights the piece of html code related to your click
* Hover over html code to see other features of the web page
---
## Inpsect button
* Similarly, click the top left button in the side panel
* Explore related features of the webpage and html code
---
## Basic html types
By browsing you observe the basic **structure** of html webpages
Opening and closing [tags](https://www.w3schools.com/tags/) wrapped around content to define its purpose and appearance on a webpage.
e.g. < tag \> lorem ipsem text < /tag \>
Some basic **tag types** are:
div - Division or section
table - Table
p - Paragraph elements
h - Heading
---
## Read a webpage
```{r echo = TRUE}
library(rvest)
author_url <- "https://en.wikipedia.org/wiki/Tim_Winton"
wiki_data <- read_html(author_url) # Read the webpage into R
str(wiki_data)
wiki_data
```
---
## How to scrape a table - html_table()
So we can read data from the website into R, but we need the data in a form we can use.
```{r echo = TRUE, eval = TRUE}
table_data <- wiki_data |>
rvest::html_table(header = FALSE) #Get all tables on the webpage
length(table_data)
table_data[[1]]
```
---
## Other approaches - html_nodes()
```{r, echo = TRUE, eval = TRUE}
table_data_eg1 <- wiki_data |>
rvest::html_nodes("table") |> # get all the nodes of type table
purrr::pluck(1) |> #pull out the first one
rvest::html_table(header = FALSE) #convert it to table type
table_data_eg1
```
---
## Other approaches - html_node()
Lots of functions in `rvest` give you the option to return the first match or to return all matches.
```{r echo = TRUE, eval = TRUE}
table_data_eg2 <- wiki_data |>
rvest::html_node("table") |> # just get the first table match
rvest::html_table(header = FALSE) #convert it to table type
table_data_eg2
```
---
## Get the Nationality
One more step - Need to get the nationality from the table
```{r echo = TRUE, eval = TRUE}
author_nationality = table_data_eg2 |>
dplyr::rename(Category = X1, Response = X2) |>
dplyr::filter(Category == "Nationality") |>
dplyr::select(Response) |>
as.character()
author_nationality
```
--
.idea-box.tl.w-50[
Now the real challenge: **Can we generalise?**
]
---
# Breakout Session
.aim-box.tl.w-70[
Try it yourself time:
- Pick your an author and find their wikipedia page
- Explore the structure of the webpage
- Download their infocard into R
- Can you get their nationality from the infocard?
]
---
## Let's try a different author
"https://en.wikipedia.org/wiki/Jane_Austen"
---
## Generalise the web page
```{r eval = TRUE, echo = TRUE}
author_first_name = "Jane"
author_last_name = "Austen"
author_url <- paste("https://en.wikipedia.org/wiki/",
author_first_name, "_", author_last_name, sep = "")
wiki_data <- read_html(author_url)
```
---
## Let's get that table
```{r eval = TRUE, echo = TRUE}
table_data <- wiki_data |>
rvest::html_nodes(".infobox.vcard") |> #search for a class
rvest::html_table(header = FALSE) |>
purrr::pluck(1)
head(table_data)
```
---
## Web scraping is tricky
```{r, echo = TRUE, eval = TRUE}
table_data |> dplyr::select(1) |> unlist() |> as.vector()
```
.aim-box.tl.w-80[
**No nationality** category in Jane Austen's infocard
]
--
.idea-box.tl.w-80[
Her nationality is in the webpage text.
Let's scrape the nationality from there.
]
---
## Try another way
```{r echo = TRUE, eval = TRUE}
para_data <- wiki_data |>
rvest::html_nodes("p") # get all the paragraphs
head(para_data)
```
.aim-box.tl.w-80[
But where exactly do we find her nationality in all this text?
Let's go back to exploring the webpage.
]
---
## Get the text - html_text()
```{r echo = TRUE, eval = TRUE}
text_data <- para_data |>
purrr::pluck(2) |> # get the second paragraph
rvest::html_text() # convert the paragraph to text
head(text_data)
```
--
.aim-box.tl.w-80[
Let's look at two other ways we could do this.
]
---
## Xpath Example
* Right click html code, copy, copy Xpath
---
## Using an Xpath
```{r, eval = TRUE, echo = TRUE}
para_xpath = '//*[@id="mw-content-text"]/div/p[2]'
text_data <- wiki_data |>
rvest::html_nodes(xpath = para_xpath) |>
rvest::html_text()
text_data
```
---
## JSpath Example
* Right click html code, copy, copy JS path
---
## Using CSS ID
```{r eval = TRUE, echo = TRUE}
para_css = "#mw-content-text > div > p:nth-child(5)"
text_data <- wiki_data |>
rvest::html_nodes(css = para_css) |>
rvest::html_text()
text_data
```
---
## Text Analysis
Still need to get her nationality, use `str_count`
```{r echo = TRUE, eval = TRUE}
possible_nationalities <- c("Australian", "Chinese", "Mexican", "English", "Ethiopian")
# Do any of these nationalities appear in the text?
count_values = str_count(text_data, possible_nationalities)
count_values == TRUE # Which ones were matched
possible_nationalities[count_values == TRUE] #Get the matching nationalities
```
.aim-box.tl.w-80[
- What do you think of my solution?
- Any guesses why I didn't use `str_match`?
]
---
.aim-box.tl.w-90[
## Learnt so far
- Know how to explore a web page with inspect
- Know some basics about how to get data
Also know:
- Can be hard to generalise
- Formats aren't always standard
]
---
class: transition
# Back to the original question
---
## Need to get our list
---
## Read the book list from a website
```{r echo = TRUE, eval = TRUE}
book_list_url <- "https://mizparker.wordpress.com/the-lists/1001-books-to-read-before-you-die/"
paragraph_data <- read_html(book_list_url) |> # read the web page
rvest::html_nodes("p") # get the paragraphs
paragraph_data[1:12]
```
---
## Get the book list from the paragraphs
This list is in pieces, but the format seems mostly consistent
```{r echo = TRUE, eval = TRUE}
book_string <- paragraph_data |>
purrr::pluck(4) |> # get the first part of the book list
html_text(trim = TRUE) |> # convert it to text, remove excess white space
str_replace_all("\n", "")
head(book_string)
```
---
## More string manipulations
Web scraping often means string handling
We want to split the string by any numbers followed by a full stop
Careful:
* don't want to split book titles with numbers, like Catch 22,
* don't want to split authors with full stops, like J.R.R Tolkien
Actually bit tricky!
--
.aim-box.tl.w-90[
### Resources:
* Lecture 4
* stringr cheatsheet from RStudio
* generative AI is also great resource
]
---
## Do some string handling
```{r, echo = TRUE, eval = TRUE}
eg_string = "9. book - author 10. book - author"
str_view(eg_string, "[:digit:]") #Match by any digit
str_view(eg_string, "[:digit:]+") #Match by one or more digits
str_view(eg_string, "\\.") #Match by fullstop
str_view(eg_string, "[[:digit:]]+?\\.") #Match digits followed by a fullstop
```
---
## Get the list as a dataframe
```{r echo = TRUE, eval = TRUE}
books_df <- book_string |>
str_split("[[:digit:]]+?\\.") |>
as.data.frame(stringsAsFactors = FALSE)
names(books_df) = "Book_Author"
books_df = books_df |>
dplyr::filter(Book_Author != "") |> # remove any empty rows
tidyr::separate(Book_Author, sep = "\\–", into = c("book", "author")) # splits into two columns
```
---
## Result
```{r}
head(books_df)
```
.aim-box.tl.w-70[
#### We are very lucky!
- Easily split our author and book into columns
- Thanks to whoever coded this webpage using a long hash!
]
---
## Need to repeat it: So wrap code in a function
```{r echo = TRUE, eval = TRUE}
Convert_book_string_to_df <- function(para_ind, pargraph_data){
book_string <- paragraph_data |>
purrr::pluck(para_ind) |>
html_text(trim = TRUE) |>
str_replace_all("\n", "")
books_df <- book_string |>
str_split("[[:digit:]]+?\\.") |>
as.data.frame(stringsAsFactors = FALSE)
names(books_df) = "Book_Author"
books_df = books_df |>
dplyr::filter(Book_Author != "") |>
tidyr::separate(Book_Author, sep = "\\–", into = c("book", "author"))
return(books_df)}
```
---
## Get the final list
Need to run this function for every second paragraph starting from number 4 until paragraph 12.
```{r echo = TRUE, eval = TRUE}
book_data <- lapply(seq(4,12,2) %>% as.list(),
Convert_book_string_to_df, paragraph_data) %>%
do.call(rbind, .) %>%
dplyr::mutate(author = str_trim(author))
nrow(book_data) # Has 1001 rows
```
---
## Check what it looks like
Randomly pick 10 rows
```{r}
book_data |> sample_n(size = 10)
```
--
.aim-box.tl.w-50[
# Still not done
Need the nationalities of all the authors!
]
---
# Pseduo code
Want a function to:
1. Read the wiki webpage using the author name.
2. Read the info card.
3. Get the nationality from the infocard.
If no nationality or infocard, want a function to:
4. Find which html paragraphs have text in them.
5. Guess the nationality from the text.
6. A function that bring the above all together.
---
## More wrapping of code chunks
Want a function to read the wiki webpage using the author name
```{r, eval = TRUE, echo = TRUE}
Read_wiki_page <- function(author_name){
author_name_no_space = str_replace_all(author_name, "\\s+", "_")
wiki_url <- paste("https://en.wikipedia.org/wiki/",
author_name_no_space, sep = "")
wiki_data <- read_html(wiki_url)
return(wiki_data)
}
```
---
## More wrapping of code chunks
Want a function to read the info card
```{r, eval = TRUE, echo = TRUE}
Get_wiki_infocard <- function(wiki_data){
infocard <- wiki_data |>
rvest::html_nodes(".infobox.vcard") |>
rvest::html_table(header = FALSE) |>
purrr::pluck(1)
return(infocard)
}
```
---
## More wrapping of code chunks
Want a function to get the nationality from the infocard
```{r, eval = TRUE, echo = TRUE}
Get_nationality_from_infocard <- function(infocard){
nationality <- infocard %>%
dplyr::rename(Category = X1, Response = X2) %>%
dplyr::filter(Category == "Nationality") %>%
dplyr::select(Response) %>%
as.character()
return(nationality)
}
```
---
## More wrapping of code chunks
Need a function to find which html paragraphs have text in them
```{r, eval = TRUE, echo = TRUE}
Get_first_text <- function(wiki_data){
paragraph_data <- wiki_data %>%
rvest::html_nodes("p")
i = 1
no_text = TRUE
while(no_text){
text_data <- paragraph_data %>%
purrr::pluck(i) %>%
rvest::html_text()
check_text = gsub("\\s+", "", text_data)
if(check_text == ""){
i = i + 1
}else{
no_text = FALSE
}
}
return(text_data)
}
```
---
## More wrapping of code chunks
Need another function to get the nationality from the text
```{r, eval = TRUE, echo = TRUE}
Guess_nationality_from_text <- function(text_data, possible_nationalities){
num_matches <- str_count(text_data, possible_nationalities)
prob_matches <- num_matches/sum(num_matches)
i = which(prob_matches > 0)
if(length(i) == 1){
prob_nationality = possible_nationalities[i]
}else if(length(i) > 0){
warning(paste(c("More than one match for the nationality:",
possible_nationalities[i], "\n"), collapse = " "))
match_locations = str_locate(text_data, possible_nationalities[i]) #gives locations of matches
j = i[which.min(match_locations[,1])]
prob_nationality = possible_nationalities[j]
}else{
return(NA)
}
return(prob_nationality)
}
```
---
## More wrapping of code chunks
One function that brings that all together
```{r, eval = TRUE, echo = TRUE}
Query_nationality_from_wiki <- function(author_name, possible_nationalities){
wiki_data <- Read_wiki_page(author_name)
infocard <- Get_wiki_infocard(wiki_data)
if(is.null(infocard)){
# nationality <- "Missing infocard"
first_paragraph <- Get_first_text(wiki_data)
nationality <- Guess_nationality_from_text(first_paragraph,
possible_nationalities)
return(nationality)
}
if(any(infocard[,1] == "Nationality")){
# info card exists and has nationality
nationality <- Get_nationality_from_infocard(infocard)
}else{
# find nationality in text
first_paragraph <- Get_first_text(wiki_data)
nationality <- Guess_nationality_from_text(first_paragraph,
possible_nationalities)
}
return(nationality)
}
```
---
## Examples
```{r, eval = TRUE, echo = TRUE}
Query_nationality_from_wiki("Tim Winton", c("English", "British", "Australian"))
Query_nationality_from_wiki("Jane Austen" , c("English", "British", "Australian"))
Query_nationality_from_wiki("Zadie Smith", c("English", "British", "Australian"))
```
--
.aim-box.tl.w-90[
## Still not done
We need a list of nationalities to search for in our text
]
---
## What nationalities to search for?
```{r echo = TRUE, eval = TRUE}
# Get table of nationalities
url <- "http://www.vocabulary.cl/Basic/Nationalities.htm"
xpath <- "/html/body/div[1]/article/table[2]"
nationalities_df <- url %>%
read_html() %>%
html_nodes(xpath = xpath) %>%
html_table() %>%
as.data.frame()
possible_nationalities = nationalities_df[,2]
head(possible_nationalities)
```
---
## Manual fixing
```{r echo = TRUE, eval = TRUE}
fix_entry = "ArgentineArgentinian"
i0 = which(nationalities_df == fix_entry, arr.ind = TRUE)
new_row = nationalities_df[i0[1], ]
nationalities_df[i0] = "Argentine"
new_row[,2] = "Argentinian"
nationalities_df = rbind(nationalities_df, new_row)
fix_footnote1 = "Colombia *"
i1 = which(nationalities_df == fix_footnote1, arr.ind = TRUE)
nationalities_df[i1] = strsplit(fix_footnote1, split = ' ')[[1]][1]
fix_footnote2 = "American **"
i2 = which(nationalities_df == fix_footnote2, arr.ind = TRUE)
nationalities_df[i2] = strsplit(fix_footnote2, split = ' ')[[1]][1]
possible_nationalities = nationalities_df[,2]
saveRDS(possible_nationalities, "data/possible_nationalities.rds")
```
---
## Get Nationalities
```{r, echo = TRUE, eval = TRUE}
nationality_from_author = sapply(book_data$author[1:20],
function(author_name){
nataionality = tryCatch( # Just in case!
Query_nationality_from_wiki(author_name,
possible_nationalities),
error = function(e) NA)
}) %>% unlist()
nationality_from_author
```
---
class: transition
# How diversely do we read
---
## Run it!
```{r, eval = FALSE, echo = TRUE}
nationality_from_author = sapply(book_data$author |> unique(),
function(author_name){
print(author_name)
nationality = tryCatch( # Just in case!
Query_nationality_from_wiki(author_name,
possible_nationalities),
error = function(e) NA)
})
author_nationality_df <- data.frame(
author = book_data$author |> unique(),
nationality = nationality_from_author)
book_data_with_nationality <- book_data |>
dplyr::left_join(author_nationality_df)
head(book_data_with_nationality)
saveRDS(book_data_with_nationality, "data/book_data.rds")
```
---
## Result
```{r}
book_data_with_nationality = readRDS("data/book_data.rds")
table_nationalities <- book_data_with_nationality |>
dplyr::select(author, nationality) |>
dplyr::distinct() |>
group_by(nationality) |>
summarise(count = n()) |>
ungroup() |>
arrange(desc(count))
head(table_nationalities)
```
---
## Let's take a look
```{r echo = TRUE, eval = TRUE, message = FALSE, warning = FALSE}
library(plotly)
pie_plot <- table_nationalities %>%
plot_ly(labels = ~nationality, values = ~count) %>%
add_pie(hole = 0.6) %>%
layout(title = "Nationalities", showlegend = F,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
```
---
## Plotting result
```{r pie, echo = TRUE, eval = TRUE}
pie_plot
```
---
## A Few thoughts
.aim-box.tl.w-70[
- Many ways to approach this problem
- Approach here was to use the standard rvest toolbox
- Not perfect - much needed cleaning of nationality strings
- A bit of quessing of nationalities
]
---
## What else could we have done
- Can scrape more data from goodreads website
- Goodreads has an API
- Check out the repository by famguy/rgoodreads to get started
- Using this API makes querying things like year or gender more straightforward
- But goodreads has no nationality, so this solution still is useful!
---
## What else for webscraping
* There are easier ways to answer this same question
* Namely, RSelenium for pages with javascript
* Learning the hard way can be good sometimes though!
--
We should stop before we start scraping and think about whether we should.
* Check the terms and conditions and terms of use
* Look for a data licence
* Consider ethics. Am I violating data privacy?
* Be considerate of the volume of queries and query rate limit
* Can look at the robots.txt file for the website
---
class: center middle bg-gray
.aim-box.tl.w-70[
#Summary
- We've learnt the basics of web scraping
- Know how to scrape from a statics website in R
- Work towards an automating our web scraping
- And we found out there are many challenges!
]
---
class: transition
## Slides developed by Dr Kate Saunders
---
```{r endslide, child="assets/endslide.Rmd"}
```