ETC5512: Wild Caught Data

.info-box.w-50.bg-white[
These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-10.pdf>here for the PDF </a>. 
]

---

# .monash-blue[ETC5512: Wild Caught Data]

<h2 style="font-weight:900!important;">Introduction to web scraping</h2>

.bottom_abs.width100[

Lecturer: *Kate Saunders*

Department of Econometrics and Business Statistics

ETC5512.Clayton-x@monash.edu

Week 12

]

---

## Motivation

<center>
<blockquote class="twitter-tweet" data-lang="en">The most important thing I&#39;ve ever done for learning R is to paradoxically stop &quot;learning&quot; (e.g. classes and problem sets) and start doing. Take a problem you have at work or school or a dataset you find interesting and get to work. Then write it up and post on githib or a blog. <a href="https://t.co/9aQZJRrlFK">https://t.co/9aQZJRrlFK</a>&mdash; We are R-Ladies (@WeAreRLadies) <a href="https://twitter.com/WeAreRLadies/status/1110229736956067840?ref_src=twsrc%5Etfw">March 25, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

---

## After PhD - I wanted to read more!

---

## Question

<center>
.idea-box.tl.w-70[

How should I choose what to read?

Am I reading diversely?

]
</center>

---

## Picking from lists

---

## Questions for our data

<center>
.info-box.tl.w-70[

Are lists like this biased?

]
</center>

---

.aim-box.tl.w-70[
Today you will:

- Get to know the basics of webpages
- Look at some examples of webscraping
- Get some data to answer my questions

]

.aim-box.tl.w-70[
Coding Perspective:

- Learn how to read data from a webpage into R
- Do more string manipulation
- Learn about automating a scraper

]

---

## Packages we need

`rvest` is the [R package](https://rvest.tidyverse.org/articles/rvest.html) that we'll need to get started with learning the 101 of web scraping

```r
library(rvest)
library(tidyverse)
```

* Good for static webpages

---

# Webpage basics

---

## Go to a webpage

https://en.wikipedia.org/wiki/Tim_Winton

---

## View html code in Chrome

* Right click the part of the page you want
* Select inpsect 
<center> <img src="images/lecture-10/inspect.png" width = "100%"> </center>

---

## Html code

* Brings up the html code
* Highlights the piece of html code related to your click
* Hover over html code to see other features of the web page
<center> <img src="images/lecture-10/inspect_panel.png" width = "90%"> </center>

---

## Inpsect button

* Similarly, click the top left button in the side panel
* Explore related features of the webpage and html code
<center> <img src="images/lecture-10/inspect_button.png" width = "50%"> </center>

---

## Basic html types

By browsing you observe the basic **structure** of html webpages

Opening and closing [tags](https://www.w3schools.com/tags/) wrapped around content to define its purpose and appearance on a webpage. 
 
e.g. < tag \> lorem ipsem text < /tag \>

Some basic **tag types** are:

div - Division or section

table - Table

p - Paragraph elements

h - Heading

---

## Read a webpage

```r
library(rvest)
author_url <- "https://en.wikipedia.org/wiki/Tim_Winton"
wiki_data <- read_html(author_url) # Read the webpage into R

str(wiki_data)
```

```
## List of 2
## $ node:<externalptr> 
## $ doc :<externalptr> 
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
```

```r
wiki_data
```

```
## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...
```

---

## How to scrape a table - html_table()

So we can read data from the website into R, but we need the data in a form we can use.

```r
table_data <- wiki_data |>
 rvest::html_table(header = FALSE) #Get all tables on the webpage

length(table_data)
```

```
## [1] 5
```

```r
table_data[[1]]
```

```
## # A tibble: 9 × 2
## X1 X2 
## <chr> <chr> 
## 1 Tim WintonAO Tim WintonAO 
## 2 Winton at the launch of Breath in London, 2008 Winton at the launch of Breath…
## 3 Born Timothy John Winton4 August 19…
## 4 Occupation Novelist 
## 5 Nationality Australian 
## 6 Period 1982–present 
## 7 Genre Literature, children's, non-fi…
## 8 Notable works Cloudstreet Dirt Music Breath …
## 9 Notable awards Miles Franklin 1984, 1992, 20…
```

---

## Other approaches - html_nodes()

```r
table_data_eg1 <- wiki_data |>
 rvest::html_nodes("table") |> # get all the nodes of type table
 purrr::pluck(1) |> #pull out the first one
 rvest::html_table(header = FALSE) #convert it to table type

table_data_eg1
```

---

## Other approaches - html_node()

Lots of functions in `rvest` give you the option to return the first match or to return all matches.

```r
table_data_eg2 <- wiki_data |>
 rvest::html_node("table") |> # just get the first table match
 rvest::html_table(header = FALSE) #convert it to table type

table_data_eg2
```

---

## Get the Nationality

One more step - Need to get the nationality from the table

```r
author_nationality = table_data_eg2 |>
  dplyr::rename(Category = X1, Response = X2) |>
  dplyr::filter(Category == "Nationality") |>
  dplyr::select(Response) |>
  as.character()

author_nationality
```

```
## [1] "Australian"
```

<center>
.idea-box.tl.w-50[

Now the real challenge: **Can we generalise?**

]

---

# Breakout Session

<center>
.aim-box.tl.w-70[
Try it yourself time:

- Pick your an author and find their wikipedia page
- Explore the structure of the webpage
- Download their infocard into R
- Can you get their nationality from the infocard?

]
</center>

---

## Let's try a different author

"https://en.wikipedia.org/wiki/Jane_Austen"

---

## Generalise the web page

```r
author_first_name = "Jane"
author_last_name = "Austen"
author_url <- paste("https://en.wikipedia.org/wiki/", 
 author_first_name, "_", author_last_name, sep = "")
wiki_data <- read_html(author_url)
```

---

## Let's get that table

```r
table_data <- wiki_data |>
 rvest::html_nodes(".infobox.vcard") |> #search for a class
 rvest::html_table(header = FALSE) |>
 purrr::pluck(1)

head(table_data)
```

```
## # A tibble: 6 × 2
## X1 X2 
## <chr> <chr> 
## 1 Jane Austen Jane Austen 
## 2 Portrait, c. 1810[a] Portrait, c. 1810[a] 
## 3 Born (1775-12-16)16 December 1775Steventon Rectory, Hampshire…
## 4 Died 18 July 1817(1817-07-18) (aged 41)Winchester, Hampshire,…
## 5 Resting place Winchester Cathedral, Hampshire 
## 6 Period 1787–1817
```

---

## Web scraping is tricky

```r
table_data |> dplyr::select(1) |> unlist() |> as.vector()
```

```
## [1] "Jane Austen"          "Portrait, c. 1810[a]" "Born"                
## [4] "Died"                 "Resting place"        "Period"              
## [7] "Relatives"            "Signature"            ""
```

<center>
.aim-box.tl.w-80[

**No nationality** category in Jane Austen's infocard

]

.idea-box.tl.w-80[

Her nationality is in the webpage text.  
Let's scrape the nationality from there.

]

---

## Try another way

```r
para_data <- wiki_data |>
 rvest::html_nodes("p") # get all the paragraphs
head(para_data)
```

```
## {xml_nodeset (6)}
## [1] \n\n\n\n
## [2] Jane Austen (The anonymously published <a href="/wiki/Sense_and_Sensibility" tit ...
## [4] Since her death Austen's novels have rarely been out of print. A signi ...
## [5] The scant biographical information about Austen comes from her few sur ...
## [6] The first Austen biography was <a href="/wiki/Henry_Thomas_Austen" tit ...
```
<center>
.aim-box.tl.w-80[

But where exactly do we find her nationality in all this text? 
 
Let's go back to exploring the webpage.

]
</center>

---

## Get the text - html_text()

```r
text_data <- para_data |>
 purrr::pluck(2) |> # get the second paragraph
 rvest::html_text() # convert the paragraph to text
head(text_data)
```

```
## [1] "Jane Austen (/ˈɒstɪn, ˈɔːstɪn/ OST-in, AW-stin; 16 December 1775 – 18 July 1817) was an English novelist known primarily for her six novels, which implicitly interpret, critique, and comment upon the British landed gentry at the end of the 18th century. Austen's plots often explore the dependence of women on marriage for the pursuit of favourable social standing and economic security. Her works are an implicit critique of the novels of sensibility of the second half of the 18th century and are part of the transition to 19th-century literary realism.[2][b] Her deft use of social commentary, realism and biting irony have earned her acclaim among critics and scholars.\n"
```

<center>
.aim-box.tl.w-80[

Let's look at two other ways we could do this.

]</center>

---

## Xpath Example

* Right click html code, copy, copy Xpath
<center> <img src="images/lecture-10/xpath.png" width = "100%"> </center>

---

## Using an Xpath

```r
para_xpath = '//*[@id="mw-content-text"]/div/p[2]'
text_data <- wiki_data |>
 rvest::html_nodes(xpath = para_xpath) |>
 rvest::html_text()
text_data
```

---

## JSpath Example

* Right click html code, copy, copy JS path
<center> <img src="images/lecture-10/jspath.png" width = "100%"> </center>

---

## Using CSS ID

```r
para_css = "#mw-content-text > div > p:nth-child(5)"
text_data <- wiki_data |>
 rvest::html_nodes(css = para_css) |>
 rvest::html_text()
text_data
```

---

## Text Analysis

Still need to get her nationality, use `str_count`

```r
possible_nationalities <- c("Australian", "Chinese", "Mexican", "English", "Ethiopian")

# Do any of these nationalities appear in the text?
count_values = str_count(text_data, possible_nationalities)

count_values == TRUE # Which ones were matched 
```

```
## [1] FALSE FALSE FALSE  TRUE FALSE
```

```r
possible_nationalities[count_values == TRUE] #Get the matching nationalities
```

```
## [1] "English"
```

<center>
.aim-box.tl.w-80[

- What do you think of my solution?
- Any guesses why I didn't use `str_match`?

]
</center>

---

<center>
.aim-box.tl.w-90[

## Learnt so far

- Know how to explore a web page with inspect
- Know some basics about how to get data

Also know:

- Can be hard to generalise 
- Formats aren't always standard

]</center>

---

# Back to the original question

---

## Need to get our list

---

## Read the book list from a website

```r
book_list_url <- "https://mizparker.wordpress.com/the-lists/1001-books-to-read-before-you-die/"
paragraph_data <- read_html(book_list_url) |> # read the web page
 rvest::html_nodes("p") # get the paragraphs
paragraph_data[1:12]
```

```
## {xml_nodeset (12)}
## [1] This list has appeared in several places around the internet, and is ...
## [2] If you would like to download a spreadsheet of the list and keep trac ...
## [3] 21st Century:
## [4] 1. Never Let Me Go – Kazuo Ishiguro \n2. Saturday – Ian McEwan20th Century:
## [6] 70. Timbuktu – Paul Auster \n71. The Romantics – Pankaj Mishra ...
## [7] 19th Century
## [8] 786. Some Experiences of an Irish R.M. – Somerville and Ross \n787 ...
## [9] 18th Century
## [10] 943. Hyperion – Friedrich Hölderlin \n944. The Nun – Denis Diderot ...
## [11] Pre-1700
## [12] 989. Oroonoko – Aphra Behn \n990. The Princess of Clèves – Marie-M ...
```

---

## Get the book list from the paragraphs

This list is in pieces, but the format seems mostly consistent

```r
book_string <- paragraph_data |> 
 purrr::pluck(4) |> # get the first part of the book list
 html_text(trim = TRUE) |> # convert it to text, remove excess white space
 str_replace_all("\n", "")

head(book_string)
```

```
## [1] "1.  Never Let Me Go – Kazuo Ishiguro2.  Saturday – Ian McEwan3.  On Beauty – Zadie Smith4.  Slow Man – J.M. Coetzee5.  Adjunct: An Undigest – Peter Manson6.  The Sea – John Banville7.  The Red Queen – Margaret Drabble8.  The Plot Against America – Philip Roth9.  The Master – Colm Toibin10.  Vanishing Point – David Markson11.  The Lambs of London – Peter Ackroyd12.  Dining on Stones – Iain Sinclair13.  Cloud Atlas – David Mitchell14.  Drop City – T. Coraghessan Boyle15.  The Colour – Rose Tremain16.  Thursbitch – Alan Garner17.  The Light of Day – Graham Swift18.  What I Loved – Siri Hustvedt19.  The Curious Incident of the Dog in the Night-Time – Mark Haddon20.   Islands – Dan Sleigh21.  Elizabeth Costello – J.M. Coetzee22.  London Orbital – Iain Sinclair23.  Family Matters – Rohinton Mistry24.  Fingersmith – Sarah Waters25.  The Double – Jose Saramago26.  Everything is Illuminated – Jonathan Safran Foer27.  Unless – Carol Shields28.  Kafka on the Shore – Haruki Murakami29.  The Story of Lucy Gault – William Trevor30.  That They May Face the Rising Sun – John McGahern31.  In the Forest – Edna O’Brien32.  Shroud – John Banville33.  Middlesex – Jeffrey Eugenides34.  Youth – J.M. Coetzee35.  Dead Air – Iain Banks36.  Nowhere Man – Aleksandar Hemon37.  The Book of Illusions – Paul Auster38.  Gabriel’s Gift – Hanif Kureishi39.  Austerlitz – W.G. Sebald40.  Platform – Michael Houellebecq41.  Schooling – Heather McGowan42.  Atonement – Ian McEwan43.  The Corrections – Jonathan Franzen44.  Don’t Move – Margaret Mazzantini45.  The Body Artist – Don DeLillo46.  Fury – Salman Rushdie47.  At Swim, Two Boys – Jamie O’Neill48.  Choke – Chuck Palahniuk49.  Life of Pi – Yann Martel50.  The Feast of the Goat – Mario Vargos Llosa51.  An Obedient Father – Akhil Sharma52. The Devil and Miss Prym – Paulo Coelho53.  Spring Flowers, Spring Frost – Ismail Kadare54.  White Teeth – Zadie Smith55.  The Heart of Redness – Zakes Mda56.  Under the Skin – Michel Faber57.  Ignorance – Milan Kundera58.  Nineteen Seventy Seven – David Peace59.  Celestial Harmonies – Peter Esterhazy60.  City of God – E.L. Doctorow61.  How the Dead Live – Will Self62.  The Human Stain – Philip Roth63.  The Blind Assassin – Margaret Atwood64.  After the Quake – Haruki Murakami65.  Small Remedies – Shashi Deshpande66.  Super-Cannes – J.G. Ballard67.  House of Leaves – Mark Z. Danielewski68.  Blonde – Joyce Carol Oates69.  Pastoralia – George Saunders"
```

---

## More string manipulations

Web scraping often means string handling

We want to split the string by any numbers followed by a full stop

Careful:

* don't want to split book titles with numbers, like Catch 22, 
  * don't want to split authors with full stops, like J.R.R Tolkien
  
Actually bit tricky!

<center>
.aim-box.tl.w-90[
### Resources:

* Lecture 4 
 * stringr cheatsheet from RStudio
 * generative AI is also great resource
]
</center>

---

## Do some string handling

```r
eg_string = "9. book - author 10. book - author"

str_view(eg_string, "[:digit:]") #Match by any digit
```

```
## [1] │ <9>. book - author <1><0>. book - author
```

```r
str_view(eg_string, "[:digit:]+") #Match by one or more digits
```

```
## [1] │ <9>. book - author <10>. book - author
```

```r
str_view(eg_string, "\\.") #Match by fullstop
```

```
## [1] │ 9<.> book - author 10<.> book - author
```

```r
str_view(eg_string, "[[:digit:]]+?\\.") #Match digits followed by a fullstop 
```

```
## [1] │ <9.> book - author <10.> book - author
```

---

## Get the list as a dataframe

```r
books_df <- book_string |>
 str_split("[[:digit:]]+?\\.") |>
 as.data.frame(stringsAsFactors = FALSE)

names(books_df) = "Book_Author"

books_df = books_df |>
  dplyr::filter(Book_Author != "") |> # remove any empty rows
  tidyr::separate(Book_Author, sep = "\\–", into = c("book", "author")) # splits into two columns
```

---

## Result

```
##                      book          author
## 1        Never Let Me Go   Kazuo Ishiguro
## 2               Saturday       Ian McEwan
## 3              On Beauty      Zadie Smith
## 4               Slow Man     J.M. Coetzee
## 5   Adjunct: An Undigest     Peter Manson
## 6                The Sea    John Banville
```

<center>
.aim-box.tl.w-70[

#### We are very lucky! 
  - Easily split our author and book into columns
  - Thanks to whoever coded this webpage using a long hash!

] 
</center>

---

## Need to repeat it: So wrap code in a function

```r
Convert_book_string_to_df <- function(para_ind, pargraph_data){
 
 book_string <- paragraph_data |> 
 purrr::pluck(para_ind) |> 
 html_text(trim = TRUE) |> 
 str_replace_all("\n", "")
 
 books_df <- book_string |>
 str_split("[[:digit:]]+?\\.") |>
 as.data.frame(stringsAsFactors = FALSE)

names(books_df) = "Book_Author"
  
  books_df = books_df |>
    dplyr::filter(Book_Author != "") |>
    tidyr::separate(Book_Author, sep = "\\–", into = c("book", "author"))
  
  return(books_df)}
```

---

## Get the final list

Need to run this function for every second paragraph starting from number 4 until paragraph 12.

```r
book_data <- lapply(seq(4,12,2) %>% as.list(), 
 Convert_book_string_to_df, paragraph_data) %>% 
 do.call(rbind, .) %>% 
 dplyr::mutate(author = str_trim(author))

nrow(book_data) # Has 1001 rows 
```

```
## [1] 1001
```

---

## Check what it looks like

Randomly pick 10 rows

```
##                                       book               author
## 1                          Le Père Goriot      Honoré de Balzac
## 2   The Year of the Death of Ricardo Reis         José Saramago
## 3                              Aithiopika            Heliodorus
## 4                                  Shroud         John Banville
## 5             The Case of Comrade Tulayev          Victor Serge
## 6                   The 120 Days of Sodom       Marquis de Sade
## 7                           The Godfather            Mario Puzo
## 8                       The Drowned World          J.G. Ballard
## 9                          Go Down, Moses      William Faulkner
## 10                              Drop City  T. Coraghessan Boyle
```
--

<center>
.aim-box.tl.w-50[
# Still not done

Need the nationalities of all the authors!

]</center>

---

# Pseduo code

Want a function to: 
1. Read the wiki webpage using the author name. 
2. Read the info card. 
3. Get the nationality from the infocard.

If no nationality or infocard, want a function to:  
4. Find which html paragraphs have text in them. 
5. Guess the nationality from the text.

6. A function that bring the above all together.

---

## More wrapping of code chunks

Want a function to read the wiki webpage using the author name

```r
Read_wiki_page <- function(author_name){
 author_name_no_space = str_replace_all(author_name, "\\s+", "_")
 wiki_url <- paste("https://en.wikipedia.org/wiki/", 
 author_name_no_space, sep = "")
 wiki_data <- read_html(wiki_url)
 return(wiki_data)
}
```

---

## More wrapping of code chunks

Want a function to read the info card

```r
Get_wiki_infocard <- function(wiki_data){
 infocard <- wiki_data |>
 rvest::html_nodes(".infobox.vcard") |>
 rvest::html_table(header = FALSE) |>
 purrr::pluck(1)
 return(infocard)
}
```

---

## More wrapping of code chunks

Want a function to get the nationality from the infocard

```r
Get_nationality_from_infocard <- function(infocard){
 nationality <- infocard %>%
 dplyr::rename(Category = X1, Response = X2) %>%
 dplyr::filter(Category == "Nationality") %>%
 dplyr::select(Response) %>%
 as.character()
 return(nationality)
}
```

---

## More wrapping of code chunks

Need a function to find which html paragraphs have text in them

```r
Get_first_text <- function(wiki_data){
 paragraph_data <- wiki_data %>%
 rvest::html_nodes("p")
 i = 1
 no_text = TRUE
 while(no_text){
 text_data <- paragraph_data %>%
 purrr::pluck(i) %>% 
 rvest::html_text() 
 check_text = gsub("\\s+", "", text_data)
 if(check_text == ""){ 
 i = i + 1 
 }else{ 
 no_text = FALSE
 }
 }
 return(text_data)
}
```

---

## More wrapping of code chunks
Need another function to get the nationality from the text

```r
Guess_nationality_from_text <- function(text_data, possible_nationalities){
 
 num_matches <- str_count(text_data, possible_nationalities)
 prob_matches <- num_matches/sum(num_matches)
 
 i = which(prob_matches > 0)
 if(length(i) == 1){
 prob_nationality = possible_nationalities[i] 
 }else if(length(i) > 0){
 warning(paste(c("More than one match for the nationality:", 
 possible_nationalities[i], "\n"), collapse = " "))
 match_locations = str_locate(text_data, possible_nationalities[i]) #gives locations of matches
 j = i[which.min(match_locations[,1])]
 prob_nationality = possible_nationalities[j] 
 }else{
 return(NA)
 }
 return(prob_nationality)
}
```

---

## More wrapping of code chunks

One function that brings that all together

```r
Query_nationality_from_wiki <- function(author_name, possible_nationalities){
 
 wiki_data <- Read_wiki_page(author_name)
 
 infocard <- Get_wiki_infocard(wiki_data)
 
 if(is.null(infocard)){
 
 # nationality <- "Missing infocard"
 first_paragraph <- Get_first_text(wiki_data)
 nationality <- Guess_nationality_from_text(first_paragraph,
 possible_nationalities)
 
 return(nationality)
 
 }
 
 if(any(infocard[,1] == "Nationality")){
 
 # info card exists and has nationality 
 nationality <- Get_nationality_from_infocard(infocard)
 
 }else{
 
 # find nationality in text
 first_paragraph <- Get_first_text(wiki_data)
 nationality <- Guess_nationality_from_text(first_paragraph,
 possible_nationalities)
 }
 
 return(nationality)
}
```

---

## Examples

```r
Query_nationality_from_wiki("Tim Winton", c("English", "British", "Australian"))
```

```
## [1] "Australian"
```

```r
Query_nationality_from_wiki("Jane Austen" , c("English", "British", "Australian"))
```

```
## [1] "English"
```

```r
Query_nationality_from_wiki("Zadie Smith", c("English", "British", "Australian"))
```

```
## [1] "English"
```
--

<center>
.aim-box.tl.w-90[

## Still not done

We need a list of nationalities to search for in our text

]</center>

---

## What nationalities to search for?

```r
# Get table of nationalities
url <- "http://www.vocabulary.cl/Basic/Nationalities.htm"
xpath <- "/html/body/div[1]/article/table[2]"
nationalities_df <- url %>%
 read_html() %>%
 html_nodes(xpath = xpath) %>%
 html_table() %>% 
 as.data.frame()

possible_nationalities = nationalities_df[,2]
head(possible_nationalities)
```

```
## [1] "Afghan"               "Albanian"             "Algerian"            
## [4] "ArgentineArgentinian" "Australian"           "Austrian"
```
---

## Manual fixing

```r
fix_entry = "ArgentineArgentinian"
i0 = which(nationalities_df == fix_entry, arr.ind = TRUE)
new_row = nationalities_df[i0[1], ]
nationalities_df[i0] = "Argentine"
new_row[,2] = "Argentinian"
nationalities_df = rbind(nationalities_df, new_row)

fix_footnote1 = "Colombia *"
i1 = which(nationalities_df == fix_footnote1, arr.ind = TRUE)
nationalities_df[i1] = strsplit(fix_footnote1, split = ' ')[[1]][1]

fix_footnote2 = "American **"
i2 = which(nationalities_df == fix_footnote2, arr.ind = TRUE)
nationalities_df[i2] = strsplit(fix_footnote2, split = ' ')[[1]][1]

possible_nationalities = nationalities_df[,2]

saveRDS(possible_nationalities, "data/possible_nationalities.rds")
```

---

## Get Nationalities

```r
nationality_from_author = sapply(book_data$author[1:20],  
                                        function(author_name){
  nataionality = tryCatch( # Just in case!
     Query_nationality_from_wiki(author_name, 
                                 possible_nationalities),
    error = function(e) NA)
  }) %>% unlist()

nationality_from_author
```

```
##                         Kazuo Ishiguro                             Ian McEwan 
##                             "Japanese"                              "British" 
##                            Zadie Smith                           J.M. Coetzee 
##                              "English" "South AfricanAustralian (since 2006)" 
##                           Peter Manson                          John Banville 
##                             "Scottish"                                "Irish" 
##                       Margaret Drabble                            Philip Roth 
##                              "English"                             "American" 
##                            Colm Toibin                          David Markson 
##                                "Irish"                             "American" 
##                          Peter Ackroyd                          Iain Sinclair 
##                              "British"                              "British" 
##                         David Mitchell                   T. Coraghessan Boyle 
##                                     NA                             "American" 
##                           Rose Tremain                            Alan Garner 
##                              "English"                              "English" 
##                           Graham Swift                          Siri Hustvedt 
##                              "English"                             "American" 
##                            Mark Haddon                             Dan Sleigh 
##                              "English"                        "South African"
```

---

# How diversely do we read

---

## Run it!

```r
nationality_from_author = sapply(book_data$author |> unique(),  
                                        function(author_name){
  print(author_name)
                                          
  nationality = tryCatch( # Just in case!
     Query_nationality_from_wiki(author_name, 
                                 possible_nationalities),
    error = function(e) NA)
  
})

author_nationality_df <- data.frame(
 author = book_data$author |> unique(), 
 nationality = nationality_from_author)

book_data_with_nationality <- book_data |>
 dplyr::left_join(author_nationality_df)
 
head(book_data_with_nationality)

saveRDS(book_data_with_nationality, "data/book_data.rds")
```

---

## Result

```
## # A tibble: 6 × 2
## nationality count
## <chr> <int>
## 1 <NA> 106
## 2 American 93
## 3 English 87
## 4 French 35
## 5 British 31
## 6 Irish 20
```

---

## Let's take a look

```r
library(plotly)
pie_plot <- table_nationalities %>%
 plot_ly(labels = ~nationality, values = ~count) %>%
 add_pie(hole = 0.6) %>%
 layout(title = "Nationalities", showlegend = F,
 xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
 yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
```

---

## Plotting result

```r
pie_plot
```

<div class="plotly html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-f349b77d5d2ceb1c0866" style="width:100%;height:288px;"></div>
<script type="application/json" data-for="htmlwidget-f349b77d5d2ceb1c0866">{"x":{"visdat":{"6fa62bb38889":["function () ","plotlyVisDat"]},"cur_data":"6fa62bb38889","attrs":{"6fa62bb38889":{"labels":{},"values":{},"alpha_stroke":1,"sizes":[10,100],"spans":[1,20],"type":"pie","hole":0.59999999999999998,"inherit":true}},"layout":{"margin":{"b":40,"l":60,"t":25,"r":10},"title":"Nationalities","showlegend":false,"xaxis":{"showgrid":false,"zeroline":false,"showticklabels":false},"yaxis":{"showgrid":false,"zeroline":false,"showticklabels":false},"hovermode":"closest"},"source":"A","config":{"modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"data":[{"labels":[null,"American","English","French","British","Irish","German","Italian","Russian","Scottish","Dutch","Indian","Japanese","South African","Canadian","Greek","Australian","Austrian","Hungarian","Portuguese","Brazilian","Czech","Mexican","Norwegian","Polish","Swedish","Swiss","Welsh",".mw-parser-output .hlist dl,.mw-parser-output .hlist ol,.mw-parser-output .hlist ul{margin:0;padding:0}.mw-parser-output .hlist dd,.mw-parser-output .hlist dt,.mw-parser-output .hlist li{margin:0;display:inline}.mw-parser-output .hlist.inline,.mw-parser-output .hlist.inline dl,.mw-parser-output .hlist.inline ol,.mw-parser-output .hlist.inline ul,.mw-parser-output .hlist dl dl,.mw-parser-output .hlist dl ol,.mw-parser-output .hlist dl ul,.mw-parser-output .hlist ol dl,.mw-parser-output .hlist ol ol,.mw-parser-output .hlist ol ul,.mw-parser-output .hlist ul dl,.mw-parser-output .hlist ul ol,.mw-parser-output .hlist ul ul{display:inline}.mw-parser-output .hlist .mw-empty-li{display:none}.mw-parser-output .hlist dt::after{content:\": \"}.mw-parser-output .hlist dd::after,.mw-parser-output .hlist li::after{content:\" · \";font-weight:bold}.mw-parser-output .hlist dd:last-child::after,.mw-parser-output .hlist dt:last-child::after,.mw-parser-output .hlist li:last-child::after{content:none}.mw-parser-output .hlist dd dd:first-child::before,.mw-parser-output .hlist dd dt:first-child::before,.mw-parser-output .hlist dd li:first-child::before,.mw-parser-output .hlist dt dd:first-child::before,.mw-parser-output .hlist dt dt:first-child::before,.mw-parser-output .hlist dt li:first-child::before,.mw-parser-output .hlist li dd:first-child::before,.mw-parser-output .hlist li dt:first-child::before,.mw-parser-output .hlist li li:first-child::before{content:\" (\";font-weight:normal}.mw-parser-output .hlist dd dd:last-child::after,.mw-parser-output .hlist dd dt:last-child::after,.mw-parser-output .hlist dd li:last-child::after,.mw-parser-output .hlist dt dd:last-child::after,.mw-parser-output .hlist dt dt:last-child::after,.mw-parser-output .hlist dt li:last-child::after,.mw-parser-output .hlist li dd:last-child::after,.mw-parser-output .hlist li dt:last-child::after,.mw-parser-output .hlist li li:last-child::after{content:\")\";font-weight:normal}.mw-parser-output .hlist ol{counter-reset:listitem}.mw-parser-output .hlist ol>li{counter-increment:listitem}.mw-parser-output .hlist ol>li::before{content:\" \"counter(listitem)\"\\a0 \"}.mw-parser-output .hlist dd ol>li:first-child::before,.mw-parser-output .hlist dt ol>li:first-child::before,.mw-parser-output .hlist li ol>li:first-child::before{content:\" (\"counter(listitem)\"\\a0 \"}AmericanCanadian",".mw-parser-output .hlist dl,.mw-parser-output .hlist ol,.mw-parser-output .hlist ul{margin:0;padding:0}.mw-parser-output .hlist dd,.mw-parser-output .hlist dt,.mw-parser-output .hlist li{margin:0;display:inline}.mw-parser-output .hlist.inline,.mw-parser-output .hlist.inline dl,.mw-parser-output .hlist.inline ol,.mw-parser-output .hlist.inline ul,.mw-parser-output .hlist dl dl,.mw-parser-output .hlist dl ol,.mw-parser-output .hlist dl ul,.mw-parser-output .hlist ol dl,.mw-parser-output .hlist ol ol,.mw-parser-output .hlist ol ul,.mw-parser-output .hlist ul dl,.mw-parser-output .hlist ul ol,.mw-parser-output .hlist ul ul{display:inline}.mw-parser-output .hlist .mw-empty-li{display:none}.mw-parser-output .hlist dt::after{content:\": \"}.mw-parser-output .hlist dd::after,.mw-parser-output .hlist li::after{content:\" · \";font-weight:bold}.mw-parser-output .hlist dd:last-child::after,.mw-parser-output .hlist dt:last-child::after,.mw-parser-output .hlist li:last-child::after{content:none}.mw-parser-output .hlist dd dd:first-child::before,.mw-parser-output .hlist dd dt:first-child::before,.mw-parser-output .hlist dd li:first-child::before,.mw-parser-output .hlist dt dd:first-child::before,.mw-parser-output .hlist dt dt:first-child::before,.mw-parser-output .hlist dt li:first-child::before,.mw-parser-output .hlist li dd:first-child::before,.mw-parser-output .hlist li dt:first-child::before,.mw-parser-output .hlist li li:first-child::before{content:\" (\";font-weight:normal}.mw-parser-output .hlist dd dd:last-child::after,.mw-parser-output .hlist dd dt:last-child::after,.mw-parser-output .hlist dd li:last-child::after,.mw-parser-output .hlist dt dd:last-child::after,.mw-parser-output .hlist dt dt:last-child::after,.mw-parser-output .hlist dt li:last-child::after,.mw-parser-output .hlist li dd:last-child::after,.mw-parser-output .hlist li dt:last-child::after,.mw-parser-output .hlist li li:last-child::after{content:\")\";font-weight:normal}.mw-parser-output .hlist ol{counter-reset:listitem}.mw-parser-output .hlist ol>li{counter-increment:listitem}.mw-parser-output .hlist ol>li::before{content:\" \"counter(listitem)\"\\a0 \"}.mw-parser-output .hlist dd ol>li:first-child::before,.mw-parser-output .hlist dt ol>li:first-child::before,.mw-parser-output .hlist li ol>li:first-child::before{content:\" (\"counter(listitem)\"\\a0 \"}BelgianRussian",".mw-parser-output .hlist dl,.mw-parser-output .hlist ol,.mw-parser-output .hlist ul{margin:0;padding:0}.mw-parser-output .hlist dd,.mw-parser-output .hlist dt,.mw-parser-output .hlist li{margin:0;display:inline}.mw-parser-output .hlist.inline,.mw-parser-output .hlist.inline dl,.mw-parser-output .hlist.inline ol,.mw-parser-output .hlist.inline ul,.mw-parser-output .hlist dl dl,.mw-parser-output .hlist dl ol,.mw-parser-output .hlist dl ul,.mw-parser-output .hlist ol dl,.mw-parser-output .hlist ol ol,.mw-parser-output .hlist ol ul,.mw-parser-output .hlist ul dl,.mw-parser-output .hlist ul ol,.mw-parser-output .hlist ul ul{display:inline}.mw-parser-output .hlist .mw-empty-li{display:none}.mw-parser-output .hlist dt::after{content:\": \"}.mw-parser-output .hlist dd::after,.mw-parser-output .hlist li::after{content:\" · \";font-weight:bold}.mw-parser-output .hlist dd:last-child::after,.mw-parser-output .hlist dt:last-child::after,.mw-parser-output .hlist li:last-child::after{content:none}.mw-parser-output .hlist dd dd:first-child::before,.mw-parser-output .hlist dd dt:first-child::before,.mw-parser-output .hlist dd li:first-child::before,.mw-parser-output .hlist dt dd:first-child::before,.mw-parser-output .hlist dt dt:first-child::before,.mw-parser-output .hlist dt li:first-child::before,.mw-parser-output .hlist li dd:first-child::before,.mw-parser-output .hlist li dt:first-child::before,.mw-parser-output .hlist li li:first-child::before{content:\" (\";font-weight:normal}.mw-parser-output .hlist dd dd:last-child::after,.mw-parser-output .hlist dd dt:last-child::after,.mw-parser-output .hlist dd li:last-child::after,.mw-parser-output .hlist dt dd:last-child::after,.mw-parser-output .hlist dt dt:last-child::after,.mw-parser-output .hlist dt li:last-child::after,.mw-parser-output .hlist li dd:last-child::after,.mw-parser-output .hlist li dt:last-child::after,.mw-parser-output .hlist li li:last-child::after{content:\")\";font-weight:normal}.mw-parser-output .hlist ol{counter-reset:listitem}.mw-parser-output .hlist ol>li{counter-increment:listitem}.mw-parser-output .hlist ol>li::before{content:\" \"counter(listitem)\"\\a0 \"}.mw-parser-output .hlist dd ol>li:first-child::before,.mw-parser-output .hlist dt ol>li:first-child::before,.mw-parser-output .hlist li ol>li:first-child::before{content:\" (\"counter(listitem)\"\\a0 \"}BulgarianBritish",".mw-parser-output .plainlist ol,.mw-parser-output .plainlist ul{line-height:inherit;list-style:none;margin:0;padding:0}.mw-parser-output .plainlist ol li,.mw-parser-output .plainlist ul li{margin-bottom:0} Indian (until 1964) British (from 1964)[1] American (from 2016)","Albanian","American (1888–1907, 1956–1959)British (1907–1959)","American, Canadian[citation needed]","Belgian","BosnianAmerican","British (New Zealand)","British and Chinese[1]","British, Irish","CanadianAmerican","Chilean","Cuban","Danish","Finnish","FrenchAmerican","German, French","GermanHungarian (by marriage, 1925)","Hungarian[1]","Icelandic","Irish, British","Italian, Austro-Hungarian","Italian, Portuguese","Korean","New Zealand","Nigerian","Polish–British[1]","Russian, later Soviet[1]","Scots","South AfricanAustralian (since 2006)","South Korean","Spanish","Sri Lankan","Turkish","Zimbabwean"],"values":[106,93,87,35,31,20,19,15,10,10,7,6,6,5,4,4,3,3,3,3,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],"type":"pie","hole":0.59999999999999998,"marker":{"color":"rgba(31,119,180,1)","line":{"color":"rgba(255,255,255,1)"}},"frame":null}],"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>

---

## A Few thoughts

.aim-box.tl.w-70[

- Many ways to approach this problem
- Approach here was to use the standard rvest toolbox
- Not perfect - much needed cleaning of nationality strings 
- A bit of quessing of nationalities

]

---

## What else could we have done

- Can scrape more data from goodreads website
- Goodreads has an API
- Check out the repository by famguy/rgoodreads to get started
- Using this API makes querying things like year or gender more straightforward
- But goodreads has no nationality, so this solution still is useful!

---

## What else for webscraping

* There are easier ways to answer this same question
* Namely, RSelenium for pages with javascript
* Learning the hard way can be good sometimes though!

We should stop before we start scraping and think about whether we should.

* Check the terms and conditions and terms of use
* Look for a data licence
* Consider ethics. Am I violating data privacy? 
* Be considerate of the volume of queries and query rate limit
* Can look at the robots.txt file for the website

---

.aim-box.tl.w-70[
#Summary

- We've learnt the basics of web scraping 
- Know how to scrape from a statics website in R
- Work towards an automating our web scraping 
- And we found out there are many challenges!

]

---

## Slides developed by Dr Kate Saunders

---

background-size: cover
class: title-slide
background-image: url("images/bg-12.png")

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

.bottom_abs.width100[

Lecturer: *Kate Saunders*

Department of Econometrics and Business Statistics

ETC5512.Clayton-x@monash.edu

Week 12

]