ETC5512

Introduction to web scraping

Lecturer: Kate Saunders

Department of Econometrics and Business Statistics


  • ETC5512.Clayton-x@monash.edu
  • Wild Caught Data
  • wcd.numbat.space


Journey so far

Journey

  • Weeks 1 - 2: Introducing you to open data and data collection

  • Weeks 4 - 7: Different data case studies (cakes!!!)

  • Weeks 8 - 9: Data ethics and data privacy

  • Week 10: Today two goals: Show you web-scraping & have some fun!

Motivation

After my PhD - I wanted to read more!

Question

How should I choose what to read?

I want to read different perspectives and broaden my world view.

Picking from lists

Is this a good approach?

Am I actually reading diversely?

Suspicious this might be very biased sample.

My approach

Got a question - Am I reading diversely

  • Now I’m going to try to answer it!

  • One option is to find an open data set, another is I create my own data set for analysis using web scraping

  • I’m going to take a book list of best sellers and analyse the nationalities of different authors

  • Seems simple right - Let’s see how this works out!

Today’s lecture

What we’ll cover

  • Cover the basics of navigating around webpages
  • Learn about when its appropriate to scrape data
  • Also discover the challenges of working with data from the web

Coding Perspective:

  • Learn how to read data from a webpage into R
  • See some examples of wrapping code in functions
  • Learn about automating a scraper

html basics

Go to a webpage

Link to the wikipedia page for a famous Australian author

View html code in Chrome

  • Right click the part of the page you want
  • Select inpsect

Html code

  • Brings up the html code
  • Highlights the piece of html code related to your click
  • Hover over html code to see other features of the web page

Inpsect button

  • Similarly, click the top left button in the side panel
  • Explore related features of the webpage and html code

Basic html types

By browsing you observe the basic structure of html webpages

Opening and closing tags are used to wrap around content and define its purpose and appearance on a webpage.

e.g. < tag > some random text < /tag >

Basic tag types

  • div - division or section
  • p - paragraph elements
  • h - heading
  • table - Table
  • td - table data
  • a - anchor for a hyperlink

Breakout Session

Try it yourself time

  • Pick an author and find their wikipedia page

  • Explore the structure of the webpage

  • Look at the different html tags

Web scraping in R

R Packages

There are lots of packages in R that can be used for web scraping.

Two main R packages

rvest Focus today

  • Great for those new to web scraping

  • Works best for static web pages

  • Similar coding syntax to tidyverse packages

RSelenium

  • Better for more complex tasks

  • Use it when you have web pages that load dynamically (pop ups, list menus, drop downs etc)

Read a webpage

library(tidyverse) 
library(rvest)
author_url <- "https://en.wikipedia.org/wiki/Tim_Winton"
wiki_data <- read_html(author_url) # Read the webpage into R
wiki_data
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

How to scrape a table - html_table()

So we can read data from the website into R, but we need the data in a form we can use.

all_tables <- wiki_data |>
  html_table(header = FALSE) #Get all tables on the webpage

length(all_tables)
[1] 5
infocard = all_tables |> 
  pluck(1) #pluck out the first item in the list 
infocard
# A tibble: 9 × 2
  X1                                             X2                             
  <chr>                                          <chr>                          
1 Tim WintonAO                                   Tim WintonAO                   
2 Winton at the launch of Breath in London, 2008 Winton at the launch of Breath…
3 Born                                           Timothy John Winton4 August 19…
4 Occupation                                     Novelist                       
5 Nationality                                    Australian                     
6 Period                                         1982–present                   
7 Genre                                          Literature, children's, non-fi…
8 Notable works                                  Cloudstreet Dirt Music Breath …
9 Notable awards                                 Miles Franklin  1984, 1992, 20…

Other approaches - html_elements()

infocard <- wiki_data |>
 html_elements("table") |> # get all tables
 html_table(header = FALSE) |>
 pluck(1)

infocard
# A tibble: 9 × 2
  X1                                             X2                             
  <chr>                                          <chr>                          
1 Tim WintonAO                                   Tim WintonAO                   
2 Winton at the launch of Breath in London, 2008 Winton at the launch of Breath…
3 Born                                           Timothy John Winton4 August 19…
4 Occupation                                     Novelist                       
5 Nationality                                    Australian                     
6 Period                                         1982–present                   
7 Genre                                          Literature, children's, non-fi…
8 Notable works                                  Cloudstreet Dirt Music Breath …
9 Notable awards                                 Miles Franklin  1984, 1992, 20…

Other approaches - html_element()

Similarly to other functions in the tidyverse there are functions to return all matches, or just one match.

infocard <- wiki_data |>
 html_element("table") |> # get the first table
 html_table(header = FALSE) 

infocard
# A tibble: 9 × 2
  X1                                             X2                             
  <chr>                                          <chr>                          
1 Tim WintonAO                                   Tim WintonAO                   
2 Winton at the launch of Breath in London, 2008 Winton at the launch of Breath…
3 Born                                           Timothy John Winton4 August 19…
4 Occupation                                     Novelist                       
5 Nationality                                    Australian                     
6 Period                                         1982–present                   
7 Genre                                          Literature, children's, non-fi…
8 Notable works                                  Cloudstreet Dirt Music Breath …
9 Notable awards                                 Miles Franklin  1984, 1992, 20…

Using classes

We can also match specific elements on a web page using classes.

<table class="infobox vcard">

Notice in the code we use dots instead of spaces in the class string.

infocard <- wiki_data |>
    html_element(".infobox.vcard") |> # matches the specific table class
    html_table(header = FALSE) 
infocard
# A tibble: 9 × 2
  X1                                             X2                             
  <chr>                                          <chr>                          
1 Tim WintonAO                                   Tim WintonAO                   
2 Winton at the launch of Breath in London, 2008 Winton at the launch of Breath…
3 Born                                           Timothy John Winton4 August 19…
4 Occupation                                     Novelist                       
5 Nationality                                    Australian                     
6 Period                                         1982–present                   
7 Genre                                          Literature, children's, non-fi…
8 Notable works                                  Cloudstreet Dirt Music Breath …
9 Notable awards                                 Miles Franklin  1984, 1992, 20…

Get the Nationality

One more step - Need to get the nationality from the table.

This just requires some basic data wrangling.

author_nationality = infocard |>
  rename(Category = X1, Response = X2) |>
  filter(Category == "Nationality") |>
  pull(Response) 
author_nationality
[1] "Australian"

Real challenge

Can we generalise?

Breakout Session

Try it yourself time

  • For the author you picked, download their infocard into R

  • Can you get their nationality from the infocard?

  • At your table discuss any challenges you face

Let’s try a different author

Wikipedia page for author Jane Austen

Let’s get that infocard

author_url <- "https://en.wikipedia.org/wiki/Jane_Austen"
wiki_data <- read_html(author_url)

infocard <- wiki_data |>
  html_element(".infobox.vcard") |> 
  html_table(header = FALSE) 

author_nationality = infocard |>
  rename(Category = X1, Response = X2) |>
  filter(Category == "Nationality") |>
  pull(Response) 

author_nationality # :( oh oh
character(0)

Blocker!!!

Warning

No nationality category in Jane Austen’s infocard

The infocards on wikipedia do not have a standard format!

This is a real barrier automation

Tip

But her nationality is in the webpage text.
I could just scrape her nationality from there.

Try another way

para_data <- wiki_data |>
  html_elements("p") # get all the paragraphs
head(para_data)
{xml_nodeset (6)}
[1] <p class="mw-empty-elt">\n\n\n\n</p>
[2] <p><b>Jane Austen</b> (<span class="rt-commentedText nowrap"><span class= ...
[3] <p>The anonymously published <i><a href="/wiki/Sense_and_Sensibility" tit ...
[4] <p>Since her death Austen's novels have rarely been out of print. A signi ...
[5] <p>The scant biographical information about Austen comes from her few sur ...
[6] <p>The first Austen biography was <a href="/wiki/Henry_Thomas_Austen" tit ...

Caution

But where exactly do we find her nationality in all this text?

Let’s go back to exploring the webpage.

Get the text - html_text()

text_data <- para_data |>
  pluck(2) |> # get the second paragraph
  html_text() # convert the html paragraph to text
head(text_data)
[1] "Jane Austen (/ˈɒstɪn, ˈɔːstɪn/ OST-in, AW-stin; 16 December 1775 – 18 July 1817) was an English novelist known primarily for her six novels, which implicitly interpret, critique, and comment on the English landed gentry at the end of the 18th century. Austen's plots often explore the dependence of women on marriage for the pursuit of favourable social standing and economic security. Her works are implicit critiques of the novels of sensibility of the second half of the 18th century and are part of the transition to 19th-century literary realism.[2][b] Her use of social commentary, realism, wit, and irony have earned her acclaim amongst critics and scholars.\n"

There are more ways we can navigate around a web page, some are very useful in combination with string handling.

Xpath Example

  • Right click html code, copy, copy Xpath

Using an Xpath

para_xpath = '//*[@id="mw-content-text"]/div/p[2]'

text_data <- wiki_data |>
  html_element(xpath = para_xpath) |>
  html_text()
text_data
[1] "Jane Austen (/ˈɒstɪn, ˈɔːstɪn/ OST-in, AW-stin; 16 December 1775 – 18 July 1817) was an English novelist known primarily for her six novels, which implicitly interpret, critique, and comment on the English landed gentry at the end of the 18th century. Austen's plots often explore the dependence of women on marriage for the pursuit of favourable social standing and economic security. Her works are implicit critiques of the novels of sensibility of the second half of the 18th century and are part of the transition to 19th-century literary realism.[2][b] Her use of social commentary, realism, wit, and irony have earned her acclaim amongst critics and scholars.\n"

CSS Selector Example

  • Right click html code, copy, copy selector

Using CSS ID

para_css = "#mw-content-text > div.mw-content-ltr.mw-parser-output > p:nth-child(5)"

text_data <- wiki_data |>
  html_element(css = para_css) |>
  html_text()
text_data
[1] "Jane Austen (/ˈɒstɪn, ˈɔːstɪn/ OST-in, AW-stin; 16 December 1775 – 18 July 1817) was an English novelist known primarily for her six novels, which implicitly interpret, critique, and comment on the English landed gentry at the end of the 18th century. Austen's plots often explore the dependence of women on marriage for the pursuit of favourable social standing and economic security. Her works are implicit critiques of the novels of sensibility of the second half of the 18th century and are part of the transition to 19th-century literary realism.[2][b] Her use of social commentary, realism, wit, and irony have earned her acclaim amongst critics and scholars.\n"

Back to getting the nationality

Once we have our text need to do some text analysis - could use str_count

possible_nationalities <- c("Australian", "Chinese", "Mexican", "English", "Ethiopian")

# Count how many time these appear in the text
count_values = str_count(text_data, possible_nationalities) 

# Get the matching nationalities
possible_nationalities[count_values > 0] 
[1] "English"

Caution

  • What do you think of my solution?
  • Any guesses why I didn’t use str_match?

Automation

Web scraping Applications

Examples

In business analytics web there are many scraping applications. Some include:

  • Competitive Intelligence: Monitoring competitors’ pricing, product offerings, and promotions in real-time

  • Market Research: Gathering customer reviews, sentiment analysis, and identifying trends

  • Price Optimisation: Analysing price points across markets to inform dynamic pricing strategies

  • Financial Analysis: Extracting economic indicators, stock performance data, and financial news for investment decisions

Why automate?

Web scrapers are used to collect data regularly, with no to minimal manual effort and typically at scale.

Tip

  • Regular Timing: Setting up scrapers to run daily, weekly, or hourly

  • Simple Notifications: Getting emails or messages when important data changes

  • Saving Data Automatically: Storing collected data in spreadsheets or databases

  • Error Handling: Making sure scrapers continue working when websites change

Functions for automation

Let’s automate.

Any code that gets repeated can be written as function.

Here is a simple example of a function that adds two numbers.

add_two_numbers <-function(num1, num2){
  number_sum = num1 + num2
  return(number_sum)
} 

Error handling

Key to automation is carefully handling any errors. A common function for this is in R is tryCatch()

ans1 = tryCatch(add_two_numbers(1, 1), 
         error = function(e){return(NA)})
ans1
[1] 2
ans2 = tryCatch(add_two_numbers("a", 1), 
         error = function(e){ print(e); return(NA)})
<simpleError in num1 + num2: non-numeric argument to binary operator>
ans2
[1] NA
ans3 = tryCatch(add_two_numbers(x, 1), 
         error = function(e){ print(e); return("Your code didn't work")})
<simpleError in eval(expr, envir): object 'x' not found>
ans3
[1] "Your code didn't work"

Ethical scraping

Important

Just because you can scrape - doesn’t mean you should!

As with any data analytics you must to think about ethics!

Think about

  • Check the terms and conditions and terms of use

  • Look for a data licence

  • Review the robots.txt file for the website (e.g. https://www.example.com/robots.txt).

  • Be considerate of the volume of queries and query rate limit - see polite package in R.

  • Am I respecting people’s data privacy?

Breakout Session

Try it yourself time

Is it okay to automate data scraping from:

Back to my author example

Pseudo code

Planning ahead

  • Before you start a project it can be a good idea to plan it out

  • Pseudo code is the simplified, informal description of an algorithm

  • Pseudo code can contain words and coding syntax

  • Use pseudo code to identify coding building blocks / individual steps, and to identify potential problems or blockers.

For my project

Pseudo code

  1. Read the wiki webpage about an author.

  2. Get the info card.

  3. Try to get the nationality from the infocard.

  4. If no nationality or infocard, I want to find the first paragraphs containing text.

  5. Guess the nationality from the text.
    Going to need nationalities for matching.

  6. Bring the above all together in a single function so I can automatatically iterate over a list of authors.
    Going to need my list of author web pages.

Also need to think about how I handle any errors!

Wrap my code chunks

  1. Want a function to read the wiki webpage about an author
Read_wiki_page <- function(author_url){
  wiki_data <- read_html(author_url)
  return(wiki_data)
}

More wrapping of code chunks

  1. Want a function to read the info card.
Get_wiki_infocard <- function(wiki_data){
  infocard <- wiki_data |>
    html_element(".infobox.vcard") |>
    html_table(header = FALSE) 
  return(infocard)
}

More wrapping of code chunks

  1. Want a function to get the nationality from the infocard
Get_nationality_from_infocard <- function(infocard){
  nationality <- infocard |>
    rename(Category = X1, Response = X2) |>
    filter(Category == "Nationality") |>
    pull(Response)
  return(nationality)
}

More wrapping of code chunks

  1. Need a function to find the first html paragraph containing text (Remember I didn’t say this was a good idea - we are learning!)
Get_first_text <- function(wiki_data){
  
  paragraph_data <- wiki_data |>
    html_elements("p")
  
  i = 1
  no_text = TRUE
  while(no_text){
    
    text_data <- paragraph_data |>
      purrr::pluck(i) |>
      html_text() 
    
    check_text = gsub("\\s+", "", text_data)
    
    if(check_text == ""){ 
      # keep searching 
      i = i + 1 
    }else{ 
      # end the while loop
      no_text = FALSE
    }
  }
  return(text_data)
}

More wrapping of code chunks

  1. Need another function to “guess” the nationality from the text
Guess_nationality_from_text <- function(text_data, possible_nationalities){

  num_matches <- str_count(text_data, possible_nationalities)

  prob_matches <- num_matches/sum(num_matches)

  i = which(prob_matches > 0)
  if(length(i) == 1){

    prob_nationality = possible_nationalities[i]

    return(prob_nationality)
    
  }else if(length(i) > 0){

    warning(paste(c("More than one match for the nationality:",
                    possible_nationalities[i], "\n"), collapse = " "))

    likely_nationality = which.max(prob_matches)

    prob_nationality = possible_nationalities[likely_nationality]

    return(prob_nationality)

  }else{

    return("No nationality matched")

  }

}

More wrapping of code chunks

  1. One function that brings that all together
Query_nationality_from_wiki <- function(author_url, possible_nationalities){

  wiki_data <- Read_wiki_page(author_url)

  infocard <- Get_wiki_infocard(wiki_data)

  if(is.null(infocard)){

    # missing infocard - get nationality from text 
    first_paragraph <- Get_first_text(wiki_data)

    nationality <- Guess_nationality_from_text(first_paragraph,
                                               possible_nationalities)

    return(nationality)

  }

  if(any(infocard[,1] == "Nationality")){

    # info card exists and has nationality
    nationality <- Get_nationality_from_infocard(infocard)

  }else{

    # no nationality in infocard - find nationality in text
    text_data <- Get_first_text(wiki_data)
    nationality <- Guess_nationality_from_text(text_data,
                                               possible_nationalities)

  }

  return(nationality)

}

Test our code

author_url = "https://en.wikipedia.org/wiki/Tim_Winton"
Query_nationality_from_wiki(author_url, c("English", "British", "Australian"))
[1] "Australian"
author_url <- "https://en.wikipedia.org/wiki/Jane_Austen"
Query_nationality_from_wiki(author_url , c("English", "British", "Australian"))
[1] "English"

Do some error handling

Wrapper_nationality_query <- function(author_url, possible_nationalities){

  author_nationality = tryCatch(
    Query_nationality_from_wiki(author_url, possible_nationalities),
    error = function(e){print("Encountered an error; returning an error code 99999"); return(99999)})

  return(author_nationality)
  
}

Test our code - try to break it

author_url <- "not are real web address"
Wrapper_nationality_query(author_url , c("English", "British", "Australian"))
[1] "Encountered an error; returning an error code 99999"
[1] 99999
author_url <- "https://en.wikipedia.org/wiki/Jane_Austen"
Wrapper_nationality_query(author_url , c("Chinese", "Indian", "Thai"))
[1] "No nationality matched"

Results

Data

Data Dictionary

  • author_name (character string): Author name .
  • author_links (character string): Url for the author name scraped from the wikipedia bestseller list. Some of these links are invalid and will partially match NA or character(0).
  • author_nationality (character string): The nationality is either taken directly from the author’s infocard on the author_link, or is guessed from corresponding text. If guessed, this is the nationality string that appears most often, where nationalities are matched to the list provided by the UK government webpage .
author_df = read_csv("data/author_df.csv") 
head(author_df)
# A tibble: 6 × 3
  author_name              author_links                              nationality
  <chr>                    <chr>                                     <chr>      
1 Charles Dickens          https://en.wikipedia.org/wiki/Charles_Di… English    
2 Antoine de Saint-Exupéry https://en.wikipedia.org/wiki/Antoine_de… French     
3 Paulo Coelho             https://en.wikipedia.org/wiki/Paulo_Coel… Brazilian  
4 J. K. Rowling            https://en.wikipedia.org/wiki/J._K._Rowl… British    
5 Agatha Christie          https://en.wikipedia.org/wiki/Agatha_Chr… English    
6 Cao Xueqin               https://en.wikipedia.org/wiki/Cao_Xueqin  Chinese    

Code to pull it all together

author_df$author_nationality = NA
for(i in 1:nrow(author_df)){

  author_url = author_df$author_links[i]
  author_df$nationality[i] = Wrapper_nationality_query(author_url = author_url,
                            possible_nationalities = nationalities_data)
  print(paste(i, author_df$author_name[i], author_df$nationality[i]))

}

Here nationalities_data is from the UK government website.

Not very diverse

Imperfect analysis, but the result is clear. Heavy bias towards America and the UK.

author_df = read_csv("data/author_df.csv")
print(paste("Total bestsellers:", nrow(author_df)))
[1] "Total bestsellers: 293"
author_df |>
  count(nationality) |>
  arrange(desc(n)) |>
  mutate(nationality = if_else(nationality == "99999", "Webpage error", nationality))
# A tibble: 41 × 2
   nationality                n
   <chr>                  <int>
 1 American                 112
 2 Webpage error             32
 3 British                   27
 4 English                   24
 5 Japanese                  18
 6 No nationality matched    11
 7 French                     6
 8 Russian                    6
 9 German                     5
10 Swedish                    5
# ℹ 31 more rows

Code to get nationality data

Scrape a list of nationalities from the UK Government

nationalities_url = "https://www.gov.uk/government/publications/nationalities/list-of-nationalities"

nationalities_data = read_html(nationalities_url) |>
  html_elements("td") |>
  html_text(trim = TRUE)

not_empty = (nationalities_data != "")
nationalities_data = nationalities_data[not_empty]

nationalities_data
  [1] "Afghan"                            "Albanian"                         
  [3] "Algerian"                          "American"                         
  [5] "Andorran"                          "Angolan"                          
  [7] "Anguillan"                         "Citizen of Antigua and Barbuda"   
  [9] "Argentine"                         "Armenian"                         
 [11] "Australian"                        "Austrian"                         
 [13] "Azerbaijani"                       "Bahamian"                         
 [15] "Bahraini"                          "Bangladeshi"                      
 [17] "Barbadian"                         "Belarusian"                       
 [19] "Belgian"                           "Belizean"                         
 [21] "Beninese"                          "Bermudian"                        
 [23] "Bhutanese"                         "Bolivian"                         
 [25] "Citizen of Bosnia and Herzegovina" "Botswanan"                        
 [27] "Brazilian"                         "British"                          
 [29] "British Virgin Islander"           "Bruneian"                         
 [31] "Bulgarian"                         "Burkinan"                         
 [33] "Burmese"                           "Burundian"                        
 [35] "Cambodian"                         "Cameroonian"                      
 [37] "Canadian"                          "Cape Verdean"                     
 [39] "Cayman Islander"                   "Central African"                  
 [41] "Chadian"                           "Chilean"                          
 [43] "Chinese"                           "Colombian"                        
 [45] "Comoran"                           "Congolese (Congo)"                
 [47] "Congolese (DRC)"                   "Cook Islander"                    
 [49] "Costa Rican"                       "Croatian"                         
 [51] "Cuban"                             "Cymraes"                          
 [53] "Cymro"                             "Cypriot"                          
 [55] "Czech"                             "Danish"                           
 [57] "Djiboutian"                        "Dominican"                        
 [59] "Citizen of the Dominican Republic" "Dutch"                            
 [61] "East Timorese"                     "Ecuadorean"                       
 [63] "Egyptian"                          "Emirati"                          
 [65] "English"                           "Equatorial Guinean"               
 [67] "Eritrean"                          "Estonian"                         
 [69] "Ethiopian"                         "Faroese"                          
 [71] "Fijian"                            "Filipino"                         
 [73] "Finnish"                           "French"                           
 [75] "Gabonese"                          "Gambian"                          
 [77] "Georgian"                          "German"                           
 [79] "Ghanaian"                          "Gibraltarian"                     
 [81] "Greek"                             "Greenlandic"                      
 [83] "Grenadian"                         "Guamanian"                        
 [85] "Guatemalan"                        "Citizen of Guinea-Bissau"         
 [87] "Guinean"                           "Guyanese"                         
 [89] "Haitian"                           "Honduran"                         
 [91] "Hong Konger"                       "Hungarian"                        
 [93] "Icelandic"                         "Indian"                           
 [95] "Indonesian"                        "Iranian"                          
 [97] "Iraqi"                             "Irish"                            
 [99] "Israeli"                           "Italian"                          
[101] "Ivorian"                           "Jamaican"                         
[103] "Japanese"                          "Jordanian"                        
[105] "Kazakh"                            "Kenyan"                           
[107] "Kittitian"                         "Citizen of Kiribati"              
[109] "Kosovan"                           "Kuwaiti"                          
[111] "Kyrgyz"                            "Lao"                              
[113] "Latvian"                           "Lebanese"                         
[115] "Liberian"                          "Libyan"                           
[117] "Liechtenstein citizen"             "Lithuanian"                       
[119] "Luxembourger"                      "Macanese"                         
[121] "Macedonian"                        "Malagasy"                         
[123] "Malawian"                          "Malaysian"                        
[125] "Maldivian"                         "Malian"                           
[127] "Maltese"                           "Marshallese"                      
[129] "Martiniquais"                      "Mauritanian"                      
[131] "Mauritian"                         "Mexican"                          
[133] "Micronesian"                       "Moldovan"                         
[135] "Monegasque"                        "Mongolian"                        
[137] "Montenegrin"                       "Montserratian"                    
[139] "Moroccan"                          "Mosotho"                          
[141] "Mozambican"                        "Namibian"                         
[143] "Nauruan"                           "Nepalese"                         
[145] "New Zealander"                     "Nicaraguan"                       
[147] "Nigerian"                          "Nigerien"                         
[149] "Niuean"                            "North Korean"                     
[151] "Northern Irish"                    "Norwegian"                        
[153] "Omani"                             "Pakistani"                        
[155] "Palauan"                           "Palestinian"                      
[157] "Panamanian"                        "Papua New Guinean"                
[159] "Paraguayan"                        "Peruvian"                         
[161] "Pitcairn Islander"                 "Polish"                           
[163] "Portuguese"                        "Prydeinig"                        
[165] "Puerto Rican"                      "Qatari"                           
[167] "Romanian"                          "Russian"                          
[169] "Rwandan"                           "Salvadorean"                      
[171] "Sammarinese"                       "Samoan"                           
[173] "Sao Tomean"                        "Saudi Arabian"                    
[175] "Scottish"                          "Senegalese"                       
[177] "Serbian"                           "Citizen of Seychelles"            
[179] "Sierra Leonean"                    "Singaporean"                      
[181] "Slovak"                            "Slovenian"                        
[183] "Solomon Islander"                  "Somali"                           
[185] "South African"                     "South Korean"                     
[187] "South Sudanese"                    "Spanish"                          
[189] "Sri Lankan"                        "St Helenian"                      
[191] "St Lucian"                         "Stateless"                        
[193] "Sudanese"                          "Surinamese"                       
[195] "Swazi"                             "Swedish"                          
[197] "Swiss"                             "Syrian"                           
[199] "Taiwanese"                         "Tajik"                            
[201] "Tanzanian"                         "Thai"                             
[203] "Togolese"                          "Tongan"                           
[205] "Trinidadian"                       "Tristanian"                       
[207] "Tunisian"                          "Turkish"                          
[209] "Turkmen"                           "Turks and Caicos Islander"        
[211] "Tuvaluan"                          "Ugandan"                          
[213] "Ukrainian"                         "Uruguayan"                        
[215] "Uzbek"                             "Vatican citizen"                  
[217] "Citizen of Vanuatu"                "Venezuelan"                       
[219] "Vietnamese"                        "Vincentian"                       
[221] "Wallisian"                         "Welsh"                            
[223] "Yemeni"                            "Zambian"                          
[225] "Zimbabwean"                       

Code to get author webpages from bestellers list

Get_url_for_author <- function(author_name, scraped_table){

  search_str = paste0("a[title = '", author_name, "']")

  matching_link = tryCatch(
    scraped_table |>
      html_elements(search_str) |>
      html_attr("href"),
    error = function(e) { NA })

  return(matching_link)

}

Get_url_from_html_table <- function(scraped_table){

  if(is.list(scraped_table)) scraped_table = scraped_table |> pluck(1)

  # get the html table entries
  table_entries = scraped_table |>
    html_elements("td")

  # make the table a data frame in R
  table_df = scraped_table |>
    html_table()
  author_vec = table_df$`Author(s)`

  # get the table dimensions
  table_dim = table_df |>
    dim()
  nrows = table_dim[1]
  ncols = table_dim[2]

  # the authors are in the second row, so get the cell indexes
  cell_indexes = seq(2, nrows*ncols, by = ncols)
  if(nrows*ncols == length(table_entries)){

    author_links = table_entries[cell_indexes] |>
      html_element("a") |>
      html_attr("href")

  }else{

  # however, some tables aren't standard for
  # for this we need a different function
  author_links = sapply(author_vec, Get_url_for_author, scraped_table)

  }

  author_links = paste0("https://en.wikipedia.org", author_links)
  author_df = data.frame(author_name = author_vec, author_links)

  return(author_df)

}

## Code to scrape the best sellers webpage 

bestsellers_url <- "https://en.wikipedia.org/wiki/List_of_best-selling_books"
wiki_bestseller_tables = read_html(bestsellers_url) |>
  html_elements("table.wikitable")

warning("Hard code to avoid unwanted tables - don't want the dictionary!")
wiki_bestseller_tables_relevant = wiki_bestseller_tables[1:9]

author_df = NULL
for(i in 1:length(wiki_bestseller_tables_relevant)){
  result = Get_url_from_html_table(wiki_bestseller_tables_relevant[i])
  author_df = bind_rows(author_df, result)
}

author_df$author_links
  [1] "https://en.wikipedia.org/wiki/Charles_Dickens"                                  
  [2] "https://en.wikipedia.org/wiki/Antoine_de_Saint-Exup%C3%A9ry"                    
  [3] "https://en.wikipedia.org/wiki/Paulo_Coelho"                                     
  [4] "https://en.wikipedia.org/wiki/J._K._Rowling"                                    
  [5] "https://en.wikipedia.org/wiki/Agatha_Christie"                                  
  [6] "https://en.wikipedia.org/wiki/Cao_Xueqin"                                       
  [7] "https://en.wikipedia.org/wiki/J._R._R._Tolkien"                                 
  [8] "https://en.wikipedia.org/wiki/Lewis_Carroll"                                    
  [9] "https://en.wikipedia.org/wiki/H._Rider_Haggard"                                 
 [10] "https://en.wikipedia.org/wiki/Dan_Brown"                                        
 [11] "https://en.wikipedia.org/wiki/J._K._Rowling"                                    
 [12] "https://en.wikipedia.org/wiki/J._D._Salinger"                                   
 [13] "https://en.wikipedia.org/wiki/Robert_James_Waller"                              
 [14] "https://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez"                 
 [15] "https://en.wikipedia.org/wiki/Vladimir_Nabokov"                                 
 [16] "https://en.wikipedia.org/wiki/Johanna_Spyri"                                    
 [17] "https://en.wikipedia.org/wiki/Benjamin_Spock"                                   
 [18] "https://en.wikipedia.org/wiki/Lucy_Maud_Montgomery"                             
 [19] "https://en.wikipedia.org/wiki/Anna_Sewell"                                      
 [20] "https://en.wikipedia.org/wiki/Umberto_Eco"                                      
 [21] "https://en.wikipedia.org/wiki/Jack_Higgins"                                     
 [22] "https://en.wikipedia.org/wiki/Richard_Adams"                                    
 [23] "https://en.wikipedia.org/wiki/Shere_Hite"                                       
 [24] "https://en.wikipedia.org/wiki/E._B._White"                                      
 [25] "https://en.wikipedia.org/wiki/J._P._Donleavy"                                   
 [26] "https://en.wikipedia.org/wiki/Rick_Warren"                                      
 [27] "https://en.wikipedia.org/wiki/Beatrix_Potter"                                   
 [28] "https://en.wikipedia.org/wiki/Richard_Bach"                                     
 [29] "https://en.wikipedia.org/wiki/Eric_Carle"                                       
 [30] "https://en.wikipedia.org/wiki/Elbert_Hubbard"                                   
 [31] "https://en.wikipedia.org/wiki/Harper_Lee"                                       
 [32] "https://en.wikipedia.org/wiki/V._C._Andrews"                                    
 [33] "https://en.wikipedia.org/wiki/Carl_Sagan"                                       
 [34] "https://en.wikipedia.org/wiki/Jostein_Gaarder"                                  
 [35] "https://en.wikipedia.org/wiki/Dan_Brown"                                        
 [36] "https://en.wikipedia.org/wiki/Bill_W."                                          
 [37] "https://en.wikipedia.org/wiki/Jeffrey_Archer"                                   
 [38] "https://en.wikipedia.org/wiki/Nikolai_Ostrovsky"                                
 [39] "https://en.wikipedia.org/wiki/Leo_Tolstoy"                                      
 [40] "https://en.wikipedia.org/wiki/Carlo_Collodi"                                    
 [41] "https://en.wikipedia.org/wiki/Anne_Frank"                                       
 [42] "https://en.wikipedia.org/wiki/Wayne_Dyer"                                       
 [43] "https://en.wikipedia.org/wiki/Colleen_McCullough"                               
 [44] "https://en.wikipedia.org/wiki/Khaled_Hosseini"                                  
 [45] "https://en.wikipedia.org/wiki/Jacqueline_Susann"                                
 [46] "https://en.wikipedia.org/wiki/Dale_Carnegie"                                    
 [47] "https://en.wikipedia.org/wiki/F._Scott_Fitzgerald"                              
 [48] "https://en.wikipedia.org/wiki/Margaret_Mitchell"                                
 [49] "https://en.wikipedia.org/wiki/Daphne_du_Maurier"                                
 [50] "https://en.wikipedia.org/wiki/William_Bradford_Huie"                            
 [51] "https://en.wikipedia.org/wiki/Stieg_Larsson"                                    
 [52] "https://en.wikipedia.org/wiki/Dan_Brown"                                        
 [53] "https://en.wikipedia.org/wiki/Suzanne_Collins"                                  
 [54] "https://en.wikipedia.org/wiki/Roald_Dahl"                                       
 [55] "https://en.wikipedia.org/wiki/Alexander_Alexandrovich_Fadeyev"                  
 [56] "https://en.wikipedia.org/wiki/Spencer_Johnson_(writer)"                         
 [57] "https://en.wikipedia.org/wiki/Stephen_Hawking"                                  
 [58] "https://en.wikipedia.org/wiki/Jacques-Henri_Bernardin_de_Saint-Pierre"          
 [59] "https://en.wikipedia.org/wiki/Irving_Stone"                                     
 [60] "https://en.wikipedia.org/wiki/Kenneth_Grahame"                                  
 [61] "https://en.wikipedia.org/wiki/Stephen_R._Covey"                                 
 [62] "https://en.wikipedia.org/wiki/Tetsuko_Kuroyanagi"                               
 [63] "https://en.wikipedia.org/wiki/Mikhail_Sholokhov"                                
 [64] "https://en.wikipedia.org/wiki/James_Redfield"                                   
 [65] "https://en.wikipedia.org/wiki/John_Green"                                       
 [66] "https://en.wikipedia.org/wiki/Paula_Hawkins_(author)"                           
 [67] "https://en.wikipedia.org/wiki/William_P._Young"                                 
 [68] "https://en.wikipedia.org/wiki/Sergey_Mikhalkov"                                 
 [69] "https://en.wikipedia.org/wiki/Mario_Puzo"                                       
 [70] "https://en.wikipedia.org/wiki/Erich_Segal"                                      
 [71] "https://en.wikipedia.org/wiki/Suzanne_Collins"                                  
 [72] "https://en.wikipedia.org/wiki/Suzanne_Collins"                                  
 [73] "https://en.wikipedia.org/wiki/Banana_Yoshimoto"                                 
 [74] "https://en.wikipedia.org/wiki/Ivan_Yefremov"                                    
 [75] "https://en.wikipedia.org/wiki/Gillian_Flynn"                                    
 [76] "https://en.wikipedia.org/wiki/Charles_Berlitz"                                  
 [77] "https://en.wikipedia.org/wiki/Chinua_Achebe"                                    
 [78] "https://en.wikipedia.org/wiki/Jiang_Rong"                                       
 [79] "https://en.wikipedia.org/wiki/Xaviera_Hollander"                                
 [80] "https://en.wikipedia.org/wiki/Peter_Benchley"                                   
 [81] "https://en.wikipedia.org/wiki/Robert_Munsch"                                    
 [82] "https://en.wikipedia.org/wiki/Marilyn_French"                                   
 [83] "https://en.wikipedia.org/wiki/Arlene_Eisenberg"                                 
 [84] "https://en.wikipedia.org/wiki/Mark_Twain"                                       
 [85] "https://en.wikipedia.org/wiki/Sue_Townsend"                                     
 [86] "https://en.wikipedia.org/wiki/Jane_Austen"                                      
 [87] "https://en.wikipedia.org/wiki/Thor_Heyerdahl"                                   
 [88] "https://en.wikipedia.org/wiki/Jaroslav_Ha%C5%A1ek"                              
 [89] "https://en.wikipedia.org/wiki/Maurice_Sendak"                                   
 [90] "https://en.wikipedia.org/wiki/Norman_Vincent_Peale"                             
 [91] "https://en.wikipedia.org/wiki/Rhonda_Byrne"                                     
 [92] "https://en.wikipedia.org/wiki/Erica_Jong"                                       
 [93] "https://en.wikipedia.org/wiki/Frank_Herbert"                                    
 [94] "https://en.wikipedia.org/wiki/Roald_Dahl"                                       
 [95] "https://en.wikipedia.org/wiki/Desmond_Morris"                                   
 [96] "https://en.wikipedia.org/wiki/Natsume_S%C5%8Dseki"                              
 [97] "https://en.wikipedia.org/wiki/Delia_Owens"                                      
 [98] "https://en.wikipedia.org/wiki/Susanna_Tamaro"                                   
 [99] "https://en.wikipedia.org/wiki/Roald_Dahl"                                       
[100] "https://en.wikipedia.org/wiki/Markus_Zusak"                                     
[101] "https://en.wikipedia.org/wiki/Nicholas_Evans"                                   
[102] "https://en.wikipedia.org/wiki/Margaret_Wise_Brown"                              
[103] "https://en.wikipedia.org/wiki/Michael_Ende"                                     
[104] "https://en.wikipedia.org/wiki/Anthony_Doerr"                                    
[105] "https://en.wikipedia.org/wiki/E._L._James"                                      
[106] "https://en.wikipedia.org/wiki/S._E._Hinton"                                     
[107] "https://en.wikipedia.org/wiki/Sam_McBratney"                                    
[108] "https://en.wikipedia.org/wiki/James_Clavell"                                    
[109] "https://en.wikipedia.org/wiki/Janette_Sebring_Lowrey"                           
[110] "https://en.wikipedia.org/wiki/Ken_Follett"                                      
[111] "https://en.wikipedia.org/wiki/Patrick_S%C3%BCskind"                             
[112] "https://en.wikipedia.org/wiki/John_Steinbeck"                                   
[113] "https://en.wikipedia.org/wiki/Carlos_Ruiz_Zaf%C3%B3n"                           
[114] "https://en.wikipedia.org/wiki/Jhumpa_Lahiri"                                    
[115] "https://en.wikipedia.org/wiki/Michelle_Obama"                                   
[116] "https://en.wikipedia.org/wiki/Douglas_Adams"                                    
[117] "https://en.wikipedia.org/wiki/Mitch_Albom"                                      
[118] "https://en.wikipedia.org/wiki/Erskine_Caldwell"                                 
[119] "https://en.wikipedia.org/wiki/Madeleine_L%27Engle"                              
[120] "https://en.wikipedia.org/wiki/Nelson_Mandela"                                   
[121] "https://en.wikipedia.org/wiki/Ernest_Hemingway"                                 
[122] "https://en.wikipedia.org/wiki/Raymond_Moody"                                    
[123] "https://en.wikipedia.org/wiki/Michael_Ende"                                     
[124] "https://en.wikipedia.org/wiki/Grace_Metalious"                                  
[125] "https://en.wikipedia.org/wiki/Lois_Lowry"                                       
[126] "https://en.wikipedia.org/wiki/Jojo_Moyes"                                       
[127] "https://en.wikipedia.org/wiki/Haruki_Murakami"                                  
[128] "https://en.wikipedia.org/wiki/Albert_Camus"                                     
[129] "https://en.wikipedia.org/wiki/Osamu_Dazai"                                      
[130] "https://en.wikipedia.org/wiki/Viktor_Frankl"                                    
[131] "https://en.wikipedia.org/wiki/Mark_Manson"                                      
[132] "https://en.wikipedia.org/wiki/Dante_Alighieri"                                  
[133] "https://en.wikipedia.org/wiki/Kahlil_Gibran"                                    
[134] "https://en.wikipedia.org/wiki/John_Boyne"                                       
[135] "https://en.wikipedia.org/wiki/William_Peter_Blatty"                             
[136] "https://en.wikipedia.org/wiki/Julia_Donaldson"                                  
[137] "https://en.wikipedia.org/wiki/E._L._James"                                      
[138] "https://en.wikipedia.org/wiki/Erskine_Caldwell"                                 
[139] "https://en.wikipedia.org/wiki/Astrid_Lindgren"                                  
[140] "https://en.wikipedia.org/wiki/Dr._Seuss"                                        
[141] "https://en.wikipedia.org/wiki/Andrew_Morton_(writer)"                           
[142] "https://en.wikipedia.org/wiki/Kathryn_Stockett"                                 
[143] "https://en.wikipedia.org/wiki/Joseph_Heller"                                    
[144] "https://en.wikipedia.org/wiki/Albert_Camus"                                     
[145] "https://en.wikipedia.org/wiki/Ken_Follett"                                      
[146] "https://en.wikipedia.org/wiki/Alice_Sebold"                                     
[147] "https://en.wikipedia.org/wiki/Jung_Chang"                                       
[148] "https://en.wikipedia.org/wiki/Tom%C3%A1s_Eloy_Mart%C3%ADnez"                    
[149] "https://en.wikipedia.org/wiki/Elie_Wiesel"                                      
[150] "https://en.wikipedia.org/wiki/Yu_Dan_(academic)"                                
[151] "https://en.wikipedia.org/wiki/Marabel_Morgan"                                   
[152] "https://en.wikipedia.org/w/index.php?title=Taichi_Sakaiya&action=edit&redlink=1"
[153] "https://en.wikipedia.org/wiki/Xue_Muqiao"                                       
[154] "https://en.wikipedia.org/wiki/Richard_Nelson_Bolles"                            
[155] "https://en.wikipedia.org/wiki/Pierre_Dukan"                                     
[156] "https://en.wikipedia.org/wiki/Alex_Comfort"                                     
[157] "https://en.wikipedia.org/wiki/Robert_L._Short"                                  
[158] "https://en.wikipedia.org/wiki/Yann_Martel"                                      
[159] "https://en.wikipedia.org/wiki/Patricia_Nell_Warren"                             
[160] "https://en.wikipedia.org/wiki/Eliyahu_M._Goldratt"                              
[161] "https://en.wikipedia.org/wiki/Ray_Bradbury"                                     
[162] "https://en.wikipedia.org/wiki/Frank_McCourt"                                    
[163] "https://en.wikipedia.org/wiki/Mohandas_Karamchand_Gandhi"                       
[164] "https://en.wikipedia.org/wiki/Helen_Fielding"                                   
[165] "https://en.wikipedia.org/wiki/Colleen_Hoover"                                   
[166] "https://en.wikipedia.org/wiki/J._K._Rowling"                                    
[167] "https://en.wikipedia.org/wiki/R._L._Stine"                                      
[168] "https://en.wikipedia.org/wiki/Erle_Stanley_Gardner"                             
[169] "https://en.wikipedia.org/wiki/Jeff_Kinney_(writer)"                             
[170] "https://en.wikipedia.org/wiki/Stan_and_Jan_Berenstain"                          
[171] "https://en.wikipedia.orgNA"                                                     
[172] "https://en.wikipedia.org/wiki/Francine_Pascal"                                  
[173] "https://en.wikipedia.org/wiki/Wilbert_Awdry"                                    
[174] "https://en.wikipedia.org/wiki/Enid_Blyton"                                      
[175] "https://en.wikipedia.org/wiki/Carolyn_Keene"                                    
[176] "https://en.wikipedia.org/wiki/Fr%C3%A9d%C3%A9ric_Dard"                          
[177] "https://en.wikipedia.org/wiki/Dan_Brown"                                        
[178] "https://en.wikipedia.org/wiki/Elisabetta_Dami"                                  
[179] "https://en.wikipedia.org/wiki/Rick_Riordan"                                     
[180] "https://en.wikipedia.org/wiki/Ann_M._Martin"                                    
[181] "https://en.wikipedia.orgNA"                                                     
[182] "https://en.wikipedia.org/wiki/Stephenie_Meyer"                                  
[183] "https://en.wikipedia.orgNA"                                                     
[184] "https://en.wikipedia.org/wiki/Mercer_Mayer"                                     
[185] "https://en.wikipedia.org/wiki/Beatrix_Potter"                                   
[186] "https://en.wikipedia.org/wiki/E._L._James"                                      
[187] "https://en.wikipedia.org/wiki/Jack_Canfield"                                    
[188] "https://en.wikipedia.org/wiki/Norman_Bridwell"                                  
[189] "https://en.wikipedia.org/wiki/Gilbert_Patten"                                   
[190] "https://en.wikipedia.org/wiki/Clive_Cussler"                                    
[191] "https://en.wikipedia.org/wiki/Eiji_Yoshikawa"                                   
[192] "https://en.wikipedia.org/wiki/C._S._Lewis"                                      
[193] "https://en.wikipedia.org/wiki/Roger_Hargreaves"                                 
[194] "https://en.wikipedia.org/wiki/G%C3%A9rard_de_Villiers"                          
[195] "https://en.wikipedia.org/wiki/Suzanne_Collins"                                  
[196] "https://en.wikipedia.org/wiki/Ian_Fleming"                                      
[197] "https://en.wikipedia.org/wiki/Gilbert_Delahaye"                                 
[198] "https://en.wikipedia.org/wiki/Stieg_Larsson"                                    
[199] "https://en.wikipedia.org/wiki/George_R._R._Martin"                              
[200] "https://en.wikipedia.org/wiki/Robert_Jordan"                                    
[201] "https://en.wikipedia.org/wiki/Terry_Pratchett"                                  
[202] "https://en.wikipedia.org/wiki/Dick_Bruna"                                       
[203] "https://en.wikipedia.org/wiki/James_Patterson"                                  
[204] "https://en.wikipedia.org/wiki/Takashi_Yanase"                                   
[205] "https://en.wikipedia.org/wiki/Dav_Pilkey"                                       
[206] "https://en.wikipedia.org/wiki/R._L._Stine"                                      
[207] "https://en.wikipedia.org/wiki/Astrid_Lindgren"                                  
[208] "https://en.wikipedia.org/wiki/Anne_Rice"                                        
[209] "https://en.wikipedia.org/wiki/Jean_Bruce"                                       
[210] "https://en.wikipedia.org/wiki/A._A._Milne"                                      
[211] "https://en.wikipedia.org/wiki/Mary_Pope_Osborne"                                
[212] "https://en.wikipedia.org/wiki/Tim_LaHaye"                                       
[213] "https://en.wikipedia.org/wiki/Lemony_Snicket"                                   
[214] "https://en.wikipedia.org/wiki/Laura_Ingalls_Wilder"                             
[215] "https://en.wikipedia.org/wiki/James_Herriot"                                    
[216] "https://en.wikipedia.org/wiki/Lee_Child"                                        
[217] "https://en.wikipedia.org/wiki/Joanna_Cole_(author)"                             
[218] "https://en.wikipedia.org/wiki/Martin_Handford"                                  
[219] "https://en.wikipedia.org/wiki/John_Gray_(U.S._author)"                          
[220] "https://en.wikipedia.org/wiki/Franklin_W._Dixon"                                
[221] "https://en.wikipedia.org/wiki/Laura_Lee_Hope"                                   
[222] "https://en.wikipedia.org/wiki/Edgar_Rice_Burroughs"                             
[223] "https://en.wikipedia.org/wiki/Cassandra_Clare"                                  
[224] "https://en.wikipedia.org/wiki/Jean_M._Auel"                                     
[225] "https://en.wikipedia.orgNA"                                                     
[226] "https://en.wikipedia.org/wiki/Barbara_Park"                                     
[227] "https://en.wikipedia.org/wiki/Michael_Connelly"                                 
[228] "https://en.wikipedia.org/wiki/Jo_Nesb%C3%B8"                                    
[229] "https://en.wikipedia.org/wiki/Erin_Hunter"                                      
[230] "https://en.wikipedia.org/w/index.php?title=Liu_Zhixia&action=edit&redlink=1"    
[231] "https://en.wikipedia.org/wiki/Yutaka_Hara"                                      
[232] "https://en.wikipedia.org/wiki/Michael_Bond"                                     
[233] "https://en.wikipedia.org/wiki/K_A_Applegate"                                    
[234] "https://en.wikipedia.org/wiki/Veronica_Roth"                                    
[235] "https://en.wikipedia.org/wiki/Sachiko_Kiyono"                                   
[236] "https://en.wikipedia.org/wiki/Kaoru_Kurimoto"                                   
[237] "https://en.wikipedia.org/wiki/Christopher_Paolini"                              
[238] "https://en.wikipedia.org/wiki/Robert_Kiyosaki"                                  
[239] "https://en.wikipedia.org/wiki/Kazuma_Kamachi"                                   
[240] "https://en.wikipedia.org/wiki/S%C5%8Dhachi_Yamaoka"                             
[241] "https://en.wikipedia.org/wiki/Beverly_Cleary"                                   
[242] "https://en.wikipedia.org/wiki/Stephen_King"                                     
[243] "https://en.wikipedia.org/wiki/Rachel_Renee_Russell"                             
[244] "https://en.wikipedia.org/wiki/Warren_Murphy"                                    
[245] "https://en.wikipedia.org/wiki/Liu_Cixin"                                        
[246] "https://en.wikipedia.org/wiki/Jir%C5%8D_Akagawa"                                
[247] "https://en.wikipedia.orgcharacter(0)"                                           
[248] "https://en.wikipedia.org/wiki/Terry_Brooks"                                     
[249] "https://en.wikipedia.org/wiki/Henning_Mankell"                                  
[250] "https://en.wikipedia.org/wiki/Margit_Sandemo"                                   
[251] "https://en.wikipedia.org/wiki/Terry_Goodkind"                                   
[252] "https://en.wikipedia.org/wiki/Diana_Gabaldon"                                   
[253] "https://en.wikipedia.org/wiki/Masamoto_Nasu"                                    
[254] "https://en.wikipedia.org/wiki/Sh%C5%8Dtar%C5%8D_Ikenami"                        
[255] "https://en.wikipedia.orgcharacter(0)"                                           
[256] "https://en.wikipedia.org/wiki/Arthur_Agatston"                                  
[257] "https://en.wikipedia.org/wiki/Reki_Kawahara"                                    
[258] "https://en.wikipedia.org/wiki/Ry%C5%8Dtar%C5%8D_Shiba"                          
[259] "https://en.wikipedia.org/wiki/Eoin_Colfer"                                      
[260] "https://en.wikipedia.org/wiki/Brandon_Sanderson"                                
[261] "https://en.wikipedia.org/wiki/Frank_Herbert"                                    
[262] "https://en.wikipedia.org/wiki/Lauren_Tarshis"                                   
[263] "https://en.wikipedia.orgcharacter(0)"                                           
[264] "https://en.wikipedia.org/wiki/Brian_Jacques"                                    
[265] "https://en.wikipedia.org/wiki/Lucy_Cousins"                                     
[266] "https://en.wikipedia.orgcharacter(0)"                                           
[267] "https://en.wikipedia.orgcharacter(0)"                                           
[268] "https://en.wikipedia.org/wiki/Hiroyuki_Itsuki"                                  
[269] "https://en.wikipedia.org/wiki/Hajime_Kanzaka"                                   
[270] "https://en.wikipedia.org/wiki/Isaac_Asimov"                                     
[271] "https://en.wikipedia.org/wiki/Terry_Deary"                                      
[272] "https://en.wikipedia.org/wiki/Daisy_Meadows"                                    
[273] "https://en.wikipedia.org/wiki/Louis_Masterson"                                  
[274] "https://en.wikipedia.org/wiki/Charlaine_Harris"                                 
[275] "https://en.wikipedia.org/wiki/Lester_Dent"                                      
[276] "https://en.wikipedia.org/wiki/Nagaru_Tanigawa"                                  
[277] "https://en.wikipedia.orgNA"                                                     
[278] "https://en.wikipedia.orgNA"                                                     
[279] "https://en.wikipedia.org/wiki/Shotaro_Ikenami"                                  
[280] "https://en.wikipedia.org/wiki/Boris_Akunin"                                     
[281] "https://en.wikipedia.org/wiki/Anne_McCaffrey"                                   
[282] "https://en.wikipedia.org/wiki/Hideyuki_Kikuchi"                                 
[283] "https://en.wikipedia.org/wiki/Douglas_Adams"                                    
[284] "https://en.wikipedia.org/w/index.php?title=Osamu_Soda&action=edit&redlink=1"    
[285] "https://en.wikipedia.org/wiki/Helen_Fielding"                                   
[286] "https://en.wikipedia.org/wiki/Philip_Pullman"                                   
[287] "https://en.wikipedia.org/wiki/Yoshiki_Tanaka"                                   
[288] "https://en.wikipedia.org/wiki/Alexander_McCall_Smith"                           
[289] "https://en.wikipedia.org/wiki/Marcus_Pfister"                                   
[290] "https://en.wikipedia.org/wiki/Raymond_E._Feist"                                 
[291] "https://en.wikipedia.org/wiki/Timothy_Zahn"                                     
[292] "https://en.wikipedia.org/wiki/Andrzej_Sapkowski"                                
[293] "https://en.wikipedia.org/w/index.php?title=Kazuo_Iwamura&action=edit&redlink=1" 

What have we learnt

A Few thoughts

Reflection

  • Many ways to approach this problem
    eg. RSelenium, LLMs, APIs
  • Here we used the standard rvest toolbox
  • Analysis was not perfect - the inconsistent webpage format was a real challenge!
  • Also not ideal to be guessing nationalities from text
  • But we don’t need to approach a problem perfectly to learn - break stuff, create errors, try and fail!

Motivation

We often learn best by picking a question we are passionate about and having some fun!

A few thoughts ctd.

For assignment 4 I will ask you to answer a question of your choosing using open data. You can web scrape or you can use existing open data sets.

Caution

I showed you this example, because I wanted you to know:

  1. Sometimes simple questions can be difficult to answer

  2. It’s important to evaluate whether you data is fit for purpose early on

  3. It’s good to check the logic of your approach by writing pseudo code

Summary

What we’ve learnt

  • Learnt how to navigate around a web page
  • Learnt different ways to scrape data using the rvest package
  • Learnt about important ethical considerations when web scraping
  • Also wrote our first pseudo code
  • Covered the basics about automation using functions
  • Practiced some basic error handling
  • Hopefully also had fun!

Questions