class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-10.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5512: Wild Caught Data] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Introduction to web scraping</h2> .bottom_abs.width100[ Lecturer: *Kate Saunders* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC5512.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 12 <br> ] --- ## Motivation <center> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">The most important thing I've ever done for learning R is to paradoxically stop "learning" (e.g. classes and problem sets) and start doing. Take a problem you have at work or school or a dataset you find interesting and get to work. Then write it up and post on githib or a blog. <a href="https://t.co/9aQZJRrlFK">https://t.co/9aQZJRrlFK</a></p>— We are R-Ladies (@WeAreRLadies) <a href="https://twitter.com/WeAreRLadies/status/1110229736956067840?ref_src=twsrc%5Etfw">March 25, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> --- ## After PhD - I wanted to read more! <br><br><br><center> <iframe src="https://giphy.com/embed/WoWm8YzFQJg5i" width="480" height="351" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/cartoons-comics-sea-reading-WoWm8YzFQJg5i"></a></p> </center> --- ## Question <br><br><br><center> .idea-box.tl.w-70[ How should I choose what to read? Am I reading diversely? ] </center> --- ## Picking from lists <center> <img src="images/lecture-10/Books1001.jpg" width = "50%"> </center> --- ## Questions for our data <br><br><br><center> .info-box.tl.w-70[ Are lists like this biased? ] </center> --- class: center middle bg-gray .aim-box.tl.w-70[ Today you will: - Get to know the basics of webpages - Look at some examples of webscraping - Get some data to answer my questions ] -- .aim-box.tl.w-70[ Coding Perspective: - Learn how to read data from a webpage into R - Do more string manipulation - Learn about automating a scraper ] --- ## Packages we need `rvest` is the [R package](https://rvest.tidyverse.org/articles/rvest.html) that we'll need to get started with learning the 101 of web scraping ```r library(rvest) library(tidyverse) ``` * Good for static webpages --- class: transition # Webpage basics --- ## Go to a webpage https://en.wikipedia.org/wiki/Tim_Winton <center> <img src="images/lecture-10/wiki_TimWinton.png" width = "100%"> </center> --- ## View html code in Chrome * Right click the part of the page you want * Select inpsect <center> <img src="images/lecture-10/inspect.png" width = "100%"> </center> --- ## Html code * Brings up the html code * Highlights the piece of html code related to your click * Hover over html code to see other features of the web page <center> <img src="images/lecture-10/inspect_panel.png" width = "90%"> </center> --- ## Inpsect button * Similarly, click the top left button in the side panel * Explore related features of the webpage and html code <center> <img src="images/lecture-10/inspect_button.png" width = "50%"> </center> --- ## Basic html types By browsing you observe the basic **structure** of html webpages <br> Opening and closing [tags](https://www.w3schools.com/tags/) wrapped around content to define its purpose and appearance on a webpage. e.g. < tag \> lorem ipsem text < /tag \> <br> Some basic **tag types** are: div - Division or section table - Table p - Paragraph elements h - Heading --- ## Read a webpage ```r library(rvest) author_url <- "https://en.wikipedia.org/wiki/Tim_Winton" wiki_data <- read_html(author_url) # Read the webpage into R str(wiki_data) ``` ``` ## List of 2 ## $ node:<externalptr> ## $ doc :<externalptr> ## - attr(*, "class")= chr [1:2] "xml_document" "xml_node" ``` ```r wiki_data ``` ``` ## {html_document} ## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ... ``` --- ## How to scrape a table - html_table() So we can read data from the website into R, but we need the data in a form we can use. ```r table_data <- wiki_data |> rvest::html_table(header = FALSE) #Get all tables on the webpage length(table_data) ``` ``` ## [1] 5 ``` ```r table_data[[1]] ``` ``` ## # A tibble: 9 × 2 ## X1 X2 ## <chr> <chr> ## 1 Tim WintonAO Tim WintonAO ## 2 Winton at the launch of Breath in London, 2008 Winton at the launch of Breath… ## 3 Born Timothy John Winton4 August 19… ## 4 Occupation Novelist ## 5 Nationality Australian ## 6 Period 1982–present ## 7 Genre Literature, children's, non-fi… ## 8 Notable works Cloudstreet Dirt Music Breath … ## 9 Notable awards Miles Franklin 1984, 1992, 20… ``` --- ## Other approaches - html_nodes() ```r table_data_eg1 <- wiki_data |> rvest::html_nodes("table") |> # get all the nodes of type table purrr::pluck(1) |> #pull out the first one rvest::html_table(header = FALSE) #convert it to table type table_data_eg1 ``` ``` ## # A tibble: 9 × 2 ## X1 X2 ## <chr> <chr> ## 1 Tim WintonAO Tim WintonAO ## 2 Winton at the launch of Breath in London, 2008 Winton at the launch of Breath… ## 3 Born Timothy John Winton4 August 19… ## 4 Occupation Novelist ## 5 Nationality Australian ## 6 Period 1982–present ## 7 Genre Literature, children's, non-fi… ## 8 Notable works Cloudstreet Dirt Music Breath … ## 9 Notable awards Miles Franklin 1984, 1992, 20… ``` --- ## Other approaches - html_node() Lots of functions in `rvest` give you the option to return the first match or to return all matches. ```r table_data_eg2 <- wiki_data |> rvest::html_node("table") |> # just get the first table match rvest::html_table(header = FALSE) #convert it to table type table_data_eg2 ``` ``` ## # A tibble: 9 × 2 ## X1 X2 ## <chr> <chr> ## 1 Tim WintonAO Tim WintonAO ## 2 Winton at the launch of Breath in London, 2008 Winton at the launch of Breath… ## 3 Born Timothy John Winton4 August 19… ## 4 Occupation Novelist ## 5 Nationality Australian ## 6 Period 1982–present ## 7 Genre Literature, children's, non-fi… ## 8 Notable works Cloudstreet Dirt Music Breath … ## 9 Notable awards Miles Franklin 1984, 1992, 20… ``` --- ## Get the Nationality One more step - Need to get the nationality from the table ```r author_nationality = table_data_eg2 |> dplyr::rename(Category = X1, Response = X2) |> dplyr::filter(Category == "Nationality") |> dplyr::select(Response) |> as.character() author_nationality ``` ``` ## [1] "Australian" ``` -- <center> .idea-box.tl.w-50[ Now the real challenge: **Can we generalise?** ] --- # Breakout Session <center> .aim-box.tl.w-70[ Try it yourself time: - Pick your an author and find their wikipedia page - Explore the structure of the webpage - Download their infocard into R - Can you get their nationality from the infocard? ] </center> --- ## Let's try a different author "https://en.wikipedia.org/wiki/Jane_Austen" <center> <img src="images/lecture-10/wiki_JaneAusten.png" width = "100%"> </center> --- ## Generalise the web page ```r author_first_name = "Jane" author_last_name = "Austen" author_url <- paste("https://en.wikipedia.org/wiki/", author_first_name, "_", author_last_name, sep = "") wiki_data <- read_html(author_url) ``` --- ## Let's get that table ```r table_data <- wiki_data |> rvest::html_nodes(".infobox.vcard") |> #search for a class rvest::html_table(header = FALSE) |> purrr::pluck(1) head(table_data) ``` ``` ## # A tibble: 6 × 2 ## X1 X2 ## <chr> <chr> ## 1 Jane Austen Jane Austen ## 2 Portrait, c. 1810[a] Portrait, c. 1810[a] ## 3 Born (1775-12-16)16 December 1775Steventon Rectory, Hampshire… ## 4 Died 18 July 1817(1817-07-18) (aged 41)Winchester, Hampshire,… ## 5 Resting place Winchester Cathedral, Hampshire ## 6 Period 1787–1817 ``` --- ## Web scraping is tricky ```r table_data |> dplyr::select(1) |> unlist() |> as.vector() ``` ``` ## [1] "Jane Austen" "Portrait, c. 1810[a]" "Born" ## [4] "Died" "Resting place" "Period" ## [7] "Relatives" "Signature" "" ``` <center> .aim-box.tl.w-80[ **No nationality** category in Jane Austen's infocard ] -- .idea-box.tl.w-80[ Her nationality is in the webpage text. Let's scrape the nationality from there. ] --- ## Try another way ```r para_data <- wiki_data |> rvest::html_nodes("p") # get all the paragraphs head(para_data) ``` ``` ## {xml_nodeset (6)} ## [1] <p class="mw-empty-elt">\n\n\n\n</p> ## [2] <p><b>Jane Austen</b> (<span class="rt-commentedText nowrap"><span class= ... ## [3] <p>The anonymously published <i><a href="/wiki/Sense_and_Sensibility" tit ... ## [4] <p>Since her death Austen's novels have rarely been out of print. A signi ... ## [5] <p>The scant biographical information about Austen comes from her few sur ... ## [6] <p>The first Austen biography was <a href="/wiki/Henry_Thomas_Austen" tit ... ``` <center> .aim-box.tl.w-80[ But where exactly do we find her nationality in all this text? <br> Let's go back to exploring the webpage. ] </center> --- ## Get the text - html_text() ```r text_data <- para_data |> purrr::pluck(2) |> # get the second paragraph rvest::html_text() # convert the paragraph to text head(text_data) ``` ``` ## [1] "Jane Austen (/ˈɒstɪn, ˈɔːstɪn/ OST-in, AW-stin; 16 December 1775 – 18 July 1817) was an English novelist known primarily for her six novels, which implicitly interpret, critique, and comment upon the British landed gentry at the end of the 18th century. Austen's plots often explore the dependence of women on marriage for the pursuit of favourable social standing and economic security. Her works are an implicit critique of the novels of sensibility of the second half of the 18th century and are part of the transition to 19th-century literary realism.[2][b] Her deft use of social commentary, realism and biting irony have earned her acclaim among critics and scholars.\n" ``` -- <center> .aim-box.tl.w-80[ Let's look at two other ways we could do this. ]</center> --- ## Xpath Example * Right click html code, copy, copy Xpath <center> <img src="images/lecture-10/xpath.png" width = "100%"> </center> --- ## Using an Xpath ```r para_xpath = '//*[@id="mw-content-text"]/div/p[2]' text_data <- wiki_data |> rvest::html_nodes(xpath = para_xpath) |> rvest::html_text() text_data ``` ``` ## [1] "Jane Austen (/ˈɒstɪn, ˈɔːstɪn/ OST-in, AW-stin; 16 December 1775 – 18 July 1817) was an English novelist known primarily for her six novels, which implicitly interpret, critique, and comment upon the British landed gentry at the end of the 18th century. Austen's plots often explore the dependence of women on marriage for the pursuit of favourable social standing and economic security. Her works are an implicit critique of the novels of sensibility of the second half of the 18th century and are part of the transition to 19th-century literary realism.[2][b] Her deft use of social commentary, realism and biting irony have earned her acclaim among critics and scholars.\n" ``` --- ## JSpath Example * Right click html code, copy, copy JS path <center> <img src="images/lecture-10/jspath.png" width = "100%"> </center> --- ## Using CSS ID ```r para_css = "#mw-content-text > div > p:nth-child(5)" text_data <- wiki_data |> rvest::html_nodes(css = para_css) |> rvest::html_text() text_data ``` ``` ## [1] "Jane Austen (/ˈɒstɪn, ˈɔːstɪn/ OST-in, AW-stin; 16 December 1775 – 18 July 1817) was an English novelist known primarily for her six novels, which implicitly interpret, critique, and comment upon the British landed gentry at the end of the 18th century. Austen's plots often explore the dependence of women on marriage for the pursuit of favourable social standing and economic security. Her works are an implicit critique of the novels of sensibility of the second half of the 18th century and are part of the transition to 19th-century literary realism.[2][b] Her deft use of social commentary, realism and biting irony have earned her acclaim among critics and scholars.\n" ``` --- ## Text Analysis Still need to get her nationality, use `str_count` ```r possible_nationalities <- c("Australian", "Chinese", "Mexican", "English", "Ethiopian") # Do any of these nationalities appear in the text? count_values = str_count(text_data, possible_nationalities) count_values == TRUE # Which ones were matched ``` ``` ## [1] FALSE FALSE FALSE TRUE FALSE ``` ```r possible_nationalities[count_values == TRUE] #Get the matching nationalities ``` ``` ## [1] "English" ``` <center> .aim-box.tl.w-80[ - What do you think of my solution? - Any guesses why I didn't use `str_match`? ] </center> --- <br><br><center> .aim-box.tl.w-90[ ## Learnt so far - Know how to explore a web page with inspect - Know some basics about how to get data Also know: - Can be hard to generalise - Formats aren't always standard ]</center> --- class: transition # Back to the original question --- ## Need to get our list <center> <img src="images/lecture-10/book_list_url.png" width = "100%"> </center> --- ## Read the book list from a website ```r book_list_url <- "https://mizparker.wordpress.com/the-lists/1001-books-to-read-before-you-die/" paragraph_data <- read_html(book_list_url) |> # read the web page rvest::html_nodes("p") # get the paragraphs paragraph_data[1:12] ``` ``` ## {xml_nodeset (12)} ## [1] <p>This list has appeared in several places around the internet, and is ... ## [2] <p>If you would like to download a spreadsheet of the list and keep trac ... ## [3] <p><strong>21st Century:</strong></p> ## [4] <p>1. Never Let Me Go – Kazuo Ishiguro<br>\n2. Saturday – Ian McEwan<b ... ## [5] <p><strong>20th Century:</strong></p> ## [6] <p>70. Timbuktu – Paul Auster<br>\n71. The Romantics – Pankaj Mishra<br> ... ## [7] <p><strong>19th Century</strong></p> ## [8] <p>786. Some Experiences of an Irish R.M. – Somerville and Ross<br>\n787 ... ## [9] <p><strong>18th Century</strong></p> ## [10] <p>943. Hyperion – Friedrich Hölderlin<br>\n944. The Nun – Denis Diderot ... ## [11] <p><strong>Pre-1700</strong></p> ## [12] <p>989. Oroonoko – Aphra Behn<br>\n990. The Princess of Clèves – Marie-M ... ``` --- ## Get the book list from the paragraphs This list is in pieces, but the format seems mostly consistent ```r book_string <- paragraph_data |> purrr::pluck(4) |> # get the first part of the book list html_text(trim = TRUE) |> # convert it to text, remove excess white space str_replace_all("\n", "") head(book_string) ``` ``` ## [1] "1. Never Let Me Go – Kazuo Ishiguro2. Saturday – Ian McEwan3. On Beauty – Zadie Smith4. Slow Man – J.M. Coetzee5. Adjunct: An Undigest – Peter Manson6. The Sea – John Banville7. The Red Queen – Margaret Drabble8. The Plot Against America – Philip Roth9. The Master – Colm Toibin10. Vanishing Point – David Markson11. The Lambs of London – Peter Ackroyd12. Dining on Stones – Iain Sinclair13. Cloud Atlas – David Mitchell14. Drop City – T. Coraghessan Boyle15. The Colour – Rose Tremain16. Thursbitch – Alan Garner17. The Light of Day – Graham Swift18. What I Loved – Siri Hustvedt19. The Curious Incident of the Dog in the Night-Time – Mark Haddon20. Islands – Dan Sleigh21. Elizabeth Costello – J.M. Coetzee22. London Orbital – Iain Sinclair23. Family Matters – Rohinton Mistry24. Fingersmith – Sarah Waters25. The Double – Jose Saramago26. Everything is Illuminated – Jonathan Safran Foer27. Unless – Carol Shields28. Kafka on the Shore – Haruki Murakami29. The Story of Lucy Gault – William Trevor30. That They May Face the Rising Sun – John McGahern31. In the Forest – Edna O’Brien32. Shroud – John Banville33. Middlesex – Jeffrey Eugenides34. Youth – J.M. Coetzee35. Dead Air – Iain Banks36. Nowhere Man – Aleksandar Hemon37. The Book of Illusions – Paul Auster38. Gabriel’s Gift – Hanif Kureishi39. Austerlitz – W.G. Sebald40. Platform – Michael Houellebecq41. Schooling – Heather McGowan42. Atonement – Ian McEwan43. The Corrections – Jonathan Franzen44. Don’t Move – Margaret Mazzantini45. The Body Artist – Don DeLillo46. Fury – Salman Rushdie47. At Swim, Two Boys – Jamie O’Neill48. Choke – Chuck Palahniuk49. Life of Pi – Yann Martel50. The Feast of the Goat – Mario Vargos Llosa51. An Obedient Father – Akhil Sharma52. The Devil and Miss Prym – Paulo Coelho53. Spring Flowers, Spring Frost – Ismail Kadare54. White Teeth – Zadie Smith55. The Heart of Redness – Zakes Mda56. Under the Skin – Michel Faber57. Ignorance – Milan Kundera58. Nineteen Seventy Seven – David Peace59. Celestial Harmonies – Peter Esterhazy60. City of God – E.L. Doctorow61. How the Dead Live – Will Self62. The Human Stain – Philip Roth63. The Blind Assassin – Margaret Atwood64. After the Quake – Haruki Murakami65. Small Remedies – Shashi Deshpande66. Super-Cannes – J.G. Ballard67. House of Leaves – Mark Z. Danielewski68. Blonde – Joyce Carol Oates69. Pastoralia – George Saunders" ``` --- ## More string manipulations Web scraping often means string handling We want to split the string by any numbers followed by a full stop Careful: * don't want to split book titles with numbers, like Catch 22, * don't want to split authors with full stops, like J.R.R Tolkien Actually bit tricky! -- <center> .aim-box.tl.w-90[ ### Resources: * Lecture 4 * stringr cheatsheet from RStudio * generative AI is also great resource ] </center> --- ## Do some string handling ```r eg_string = "9. book - author 10. book - author" str_view(eg_string, "[:digit:]") #Match by any digit ``` ``` ## [1] │ <9>. book - author <1><0>. book - author ``` ```r str_view(eg_string, "[:digit:]+") #Match by one or more digits ``` ``` ## [1] │ <9>. book - author <10>. book - author ``` ```r str_view(eg_string, "\\.") #Match by fullstop ``` ``` ## [1] │ 9<.> book - author 10<.> book - author ``` ```r str_view(eg_string, "[[:digit:]]+?\\.") #Match digits followed by a fullstop ``` ``` ## [1] │ <9.> book - author <10.> book - author ``` --- ## Get the list as a dataframe ```r books_df <- book_string |> str_split("[[:digit:]]+?\\.") |> as.data.frame(stringsAsFactors = FALSE) names(books_df) = "Book_Author" books_df = books_df |> dplyr::filter(Book_Author != "") |> # remove any empty rows tidyr::separate(Book_Author, sep = "\\–", into = c("book", "author")) # splits into two columns ``` --- ## Result ``` ## book author ## 1 Never Let Me Go Kazuo Ishiguro ## 2 Saturday Ian McEwan ## 3 On Beauty Zadie Smith ## 4 Slow Man J.M. Coetzee ## 5 Adjunct: An Undigest Peter Manson ## 6 The Sea John Banville ``` <center> .aim-box.tl.w-70[ #### We are very lucky! - Easily split our author and book into columns - Thanks to whoever coded this webpage using a long hash! ] </center> --- ## Need to repeat it: So wrap code in a function ```r Convert_book_string_to_df <- function(para_ind, pargraph_data){ book_string <- paragraph_data |> purrr::pluck(para_ind) |> html_text(trim = TRUE) |> str_replace_all("\n", "") books_df <- book_string |> str_split("[[:digit:]]+?\\.") |> as.data.frame(stringsAsFactors = FALSE) names(books_df) = "Book_Author" books_df = books_df |> dplyr::filter(Book_Author != "") |> tidyr::separate(Book_Author, sep = "\\–", into = c("book", "author")) return(books_df)} ``` --- ## Get the final list Need to run this function for every second paragraph starting from number 4 until paragraph 12. ```r book_data <- lapply(seq(4,12,2) %>% as.list(), Convert_book_string_to_df, paragraph_data) %>% do.call(rbind, .) %>% dplyr::mutate(author = str_trim(author)) nrow(book_data) # Has 1001 rows ``` ``` ## [1] 1001 ``` --- ## Check what it looks like Randomly pick 10 rows ``` ## book author ## 1 Le Père Goriot Honoré de Balzac ## 2 The Year of the Death of Ricardo Reis José Saramago ## 3 Aithiopika Heliodorus ## 4 Shroud John Banville ## 5 The Case of Comrade Tulayev Victor Serge ## 6 The 120 Days of Sodom Marquis de Sade ## 7 The Godfather Mario Puzo ## 8 The Drowned World J.G. Ballard ## 9 Go Down, Moses William Faulkner ## 10 Drop City T. Coraghessan Boyle ``` -- <center> .aim-box.tl.w-50[ # Still not done Need the nationalities of all the authors! ]</center> --- # Pseduo code Want a function to: 1. Read the wiki webpage using the author name. 2. Read the info card. 3. Get the nationality from the infocard. If no nationality or infocard, want a function to: 4. Find which html paragraphs have text in them. 5. Guess the nationality from the text. 6. A function that bring the above all together. --- ## More wrapping of code chunks Want a function to read the wiki webpage using the author name ```r Read_wiki_page <- function(author_name){ author_name_no_space = str_replace_all(author_name, "\\s+", "_") wiki_url <- paste("https://en.wikipedia.org/wiki/", author_name_no_space, sep = "") wiki_data <- read_html(wiki_url) return(wiki_data) } ``` --- ## More wrapping of code chunks Want a function to read the info card ```r Get_wiki_infocard <- function(wiki_data){ infocard <- wiki_data |> rvest::html_nodes(".infobox.vcard") |> rvest::html_table(header = FALSE) |> purrr::pluck(1) return(infocard) } ``` --- ## More wrapping of code chunks Want a function to get the nationality from the infocard ```r Get_nationality_from_infocard <- function(infocard){ nationality <- infocard %>% dplyr::rename(Category = X1, Response = X2) %>% dplyr::filter(Category == "Nationality") %>% dplyr::select(Response) %>% as.character() return(nationality) } ``` --- ## More wrapping of code chunks Need a function to find which html paragraphs have text in them ```r Get_first_text <- function(wiki_data){ paragraph_data <- wiki_data %>% rvest::html_nodes("p") i = 1 no_text = TRUE while(no_text){ text_data <- paragraph_data %>% purrr::pluck(i) %>% rvest::html_text() check_text = gsub("\\s+", "", text_data) if(check_text == ""){ i = i + 1 }else{ no_text = FALSE } } return(text_data) } ``` --- ## More wrapping of code chunks Need another function to get the nationality from the text ```r Guess_nationality_from_text <- function(text_data, possible_nationalities){ num_matches <- str_count(text_data, possible_nationalities) prob_matches <- num_matches/sum(num_matches) i = which(prob_matches > 0) if(length(i) == 1){ prob_nationality = possible_nationalities[i] }else if(length(i) > 0){ warning(paste(c("More than one match for the nationality:", possible_nationalities[i], "\n"), collapse = " ")) match_locations = str_locate(text_data, possible_nationalities[i]) #gives locations of matches j = i[which.min(match_locations[,1])] prob_nationality = possible_nationalities[j] }else{ return(NA) } return(prob_nationality) } ``` --- ## More wrapping of code chunks One function that brings that all together ```r Query_nationality_from_wiki <- function(author_name, possible_nationalities){ wiki_data <- Read_wiki_page(author_name) infocard <- Get_wiki_infocard(wiki_data) if(is.null(infocard)){ # nationality <- "Missing infocard" first_paragraph <- Get_first_text(wiki_data) nationality <- Guess_nationality_from_text(first_paragraph, possible_nationalities) return(nationality) } if(any(infocard[,1] == "Nationality")){ # info card exists and has nationality nationality <- Get_nationality_from_infocard(infocard) }else{ # find nationality in text first_paragraph <- Get_first_text(wiki_data) nationality <- Guess_nationality_from_text(first_paragraph, possible_nationalities) } return(nationality) } ``` --- ## Examples ```r Query_nationality_from_wiki("Tim Winton", c("English", "British", "Australian")) ``` ``` ## [1] "Australian" ``` ```r Query_nationality_from_wiki("Jane Austen" , c("English", "British", "Australian")) ``` ``` ## [1] "English" ``` ```r Query_nationality_from_wiki("Zadie Smith", c("English", "British", "Australian")) ``` ``` ## [1] "English" ``` -- <center> .aim-box.tl.w-90[ ## Still not done We need a list of nationalities to search for in our text ]</center> --- ## What nationalities to search for? ```r # Get table of nationalities url <- "http://www.vocabulary.cl/Basic/Nationalities.htm" xpath <- "/html/body/div[1]/article/table[2]" nationalities_df <- url %>% read_html() %>% html_nodes(xpath = xpath) %>% html_table() %>% as.data.frame() possible_nationalities = nationalities_df[,2] head(possible_nationalities) ``` ``` ## [1] "Afghan" "Albanian" "Algerian" ## [4] "ArgentineArgentinian" "Australian" "Austrian" ``` --- ## Manual fixing ```r fix_entry = "ArgentineArgentinian" i0 = which(nationalities_df == fix_entry, arr.ind = TRUE) new_row = nationalities_df[i0[1], ] nationalities_df[i0] = "Argentine" new_row[,2] = "Argentinian" nationalities_df = rbind(nationalities_df, new_row) fix_footnote1 = "Colombia *" i1 = which(nationalities_df == fix_footnote1, arr.ind = TRUE) nationalities_df[i1] = strsplit(fix_footnote1, split = ' ')[[1]][1] fix_footnote2 = "American **" i2 = which(nationalities_df == fix_footnote2, arr.ind = TRUE) nationalities_df[i2] = strsplit(fix_footnote2, split = ' ')[[1]][1] possible_nationalities = nationalities_df[,2] saveRDS(possible_nationalities, "data/possible_nationalities.rds") ``` --- ## Get Nationalities ```r nationality_from_author = sapply(book_data$author[1:20], function(author_name){ nataionality = tryCatch( # Just in case! Query_nationality_from_wiki(author_name, possible_nationalities), error = function(e) NA) }) %>% unlist() nationality_from_author ``` ``` ## Kazuo Ishiguro Ian McEwan ## "Japanese" "British" ## Zadie Smith J.M. Coetzee ## "English" "South AfricanAustralian (since 2006)" ## Peter Manson John Banville ## "Scottish" "Irish" ## Margaret Drabble Philip Roth ## "English" "American" ## Colm Toibin David Markson ## "Irish" "American" ## Peter Ackroyd Iain Sinclair ## "British" "British" ## David Mitchell T. Coraghessan Boyle ## NA "American" ## Rose Tremain Alan Garner ## "English" "English" ## Graham Swift Siri Hustvedt ## "English" "American" ## Mark Haddon Dan Sleigh ## "English" "South African" ``` --- class: transition # How diversely do we read --- ## Run it! ```r nationality_from_author = sapply(book_data$author |> unique(), function(author_name){ print(author_name) nationality = tryCatch( # Just in case! Query_nationality_from_wiki(author_name, possible_nationalities), error = function(e) NA) }) author_nationality_df <- data.frame( author = book_data$author |> unique(), nationality = nationality_from_author) book_data_with_nationality <- book_data |> dplyr::left_join(author_nationality_df) head(book_data_with_nationality) saveRDS(book_data_with_nationality, "data/book_data.rds") ``` --- ## Result ``` ## # A tibble: 6 × 2 ## nationality count ## <chr> <int> ## 1 <NA> 106 ## 2 American 93 ## 3 English 87 ## 4 French 35 ## 5 British 31 ## 6 Irish 20 ``` --- ## Let's take a look ```r library(plotly) pie_plot <- table_nationalities %>% plot_ly(labels = ~nationality, values = ~count) %>% add_pie(hole = 0.6) %>% layout(title = "Nationalities", showlegend = F, xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)) ``` --- ## Plotting result ```r pie_plot ```
--- ## A Few thoughts <br><br> .aim-box.tl.w-70[ - Many ways to approach this problem - Approach here was to use the standard rvest toolbox - Not perfect - much needed cleaning of nationality strings - A bit of quessing of nationalities ] --- ## What else could we have done - Can scrape more data from goodreads website - Goodreads has an API - Check out the repository by famguy/rgoodreads to get started - Using this API makes querying things like year or gender more straightforward - But goodreads has no nationality, so this solution still is useful! --- ## What else for webscraping * There are easier ways to answer this same question * Namely, RSelenium for pages with javascript * Learning the hard way can be good sometimes though! -- We should stop before we start scraping and think about whether we should. * Check the terms and conditions and terms of use * Look for a data licence * Consider ethics. Am I violating data privacy? * Be considerate of the volume of queries and query rate limit * Can look at the robots.txt file for the website --- class: center middle bg-gray .aim-box.tl.w-70[ #Summary - We've learnt the basics of web scraping - Know how to scrape from a statics website in R - Work towards an automating our web scraping - And we found out there are many challenges! ] --- class: transition ## Slides developed by Dr Kate Saunders --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Kate Saunders* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC5512.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 12 <br> ]