In this tutorial, you will learn
Make sure to downlod the R package rvest
to be ready for
today’s class
Alone is an American survival competition series on History. It follows the self-documented daily struggles of 10 individuals (seven paired teams in season 4) as they survive alone in the wilderness for as long as possible using a limited amount of survival equipment. With the exception of medical check-ins, the participants are isolated from each other and all other humans. They may “tap out” at any time, or be removed due to failing a medical check-in. The contestant who remains the longest wins a grand prize of $500,000 (USD) (increasing to $1 million in season 7).
From Alone wikipedia
Visit the wikipedia page for the Alone TV Series and identify the basic elements that make up the page.
# First read the web page into R
library(rvest)
alone_url <- "https://en.wikipedia.org/wiki/Alone_(TV_series)"
alone_web_data <- read_html(alone_url)
alone_tables <- alone_web_data |>
html_nodes("table")
num_tables = length(alone_tables)
On the webpage there are 5 tables.
alone_paragraphs <- alone_web_data |>
html_nodes("p")
num_paras = length(alone_paragraphs)
On the webpage there are 5 tables.
table_series_winners = alone_tables |>
purrr::pluck(2) |>
html_table(header = TRUE)
library(kableExtra)
table_series_winners |>
kbl() |>
kable_styling()
Season | Subtitle | Location | Episodes | Episodes | Originally aired | Originally aired | Days Lasted | Winner(s) |
---|---|---|---|---|---|---|---|---|
Season | Subtitle | Location | Episodes | Episodes | First aired | Last aired | Days Lasted | Winner(s) |
1 | — | Port Hardy, British Columbia | 11 | 11 | June 18, 2015 (2015-06-18) | August 20, 2015 (2015-08-20) | 56 | Alan Kay |
2 | — | Port Hardy, British Columbia | 15 | 15 | April 21, 2016 (2016-04-21) | July 14, 2016 (2016-07-14) | 66 | David McIntyre |
3 | — | Patagonia, Argentina | 12 | 12 | December 8, 2016 (2016-12-08) | February 9, 2017 (2017-02-09) | 87 | Zachary Fowler |
4 | Lost & Found | Quatsino, British Columbia | 12 | 12 | June 15, 2017 (2017-06-15) | August 17, 2017 (2017-08-17) | 75 | Jim and Ted Baird |
5 | Redemption | Selenge Province, Mongolia | 12 | 12 | June 14, 2018 (2018-06-14) | August 16, 2018 (2018-08-16) | 60 | Sam Larson |
6 | The Arctic | Great Slave Lake, Northwest Territories | 11 | 11 | June 6, 2019 (2019-06-06) | August 22, 2019 (2019-08-22) | 77 | Jordan Jonas |
7 | Million Dollar Challenge | Great Slave Lake, Northwest Territories | 11 | 11 | June 11, 2020 (2020-06-11) | August 20, 2020 (2020-08-20) | 100 | Roland Welker |
8 | Grizzly Mountain | Chilko Lake, British Columbia | 11 | 11 | June 3, 2021 (2021-06-03) | August 19, 2021 (2021-08-19) | 74 | Clay Hayes |
9 | — | Nunatsiavut, Labrador | 11 | 11 | May 26, 2022 (2022-05-26) | August 4, 2022 (2022-08-04) | 78 | Juan Pablo Quiñonez |
10 | — | Reindeer Lake, Saskatchewan | 11 | 11 | June 8, 2023 (2023-06-08) | August 17, 2023 (2023-08-17) | 66 | Alan Tenta |
Specials | — | — | 8 | 8 | April 14, 2016 (2016-04-14) | June 8, 2023 (2023-06-08) | — | — |
library(tidyverse)
para_2 = alone_paragraphs |>
purrr::pluck(2) |>
html_text()
If you want to get the extract text from the background, you’ll need some string handling.
bkgd_str_end = str_locate(para_2, "season 7\\)\\.")[2]
background_str = str_sub(para_2, 1, bkgd_str_end)
Explore the data you’ve pulled down from the webpage.
# Remove the duplicate row
table_series_winners = table_series_winners[-1, ]
time_in_wild = table_series_winners |>
select(Season, `Days Lasted`) |>
mutate(`Days Lasted` = as.numeric(`Days Lasted`),
Season = as.numeric(Season)) |>
filter(Season != "Specials")
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Days Lasted = as.numeric(`Days Lasted`)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
ggplot(time_in_wild, aes(x = Season, y = `Days Lasted`)) +
geom_point() +
theme_bw()
There is no clear trend that the number of days in the wild is increasing with the number of seasons.
This result is interesting as one might expect (i) participants to learn from each other as the show goes on and (ii) that the pool of skilled candidates would increase with show popularity. Both of which would increase time in the wild. Perhaps, these factors do not matter relative to the difficulty of surviving in the wild.
Note there is also a confounding variable here of location. Different locations increase the difficulty of surviving in the wild.
Start by looking at the different season websites, their web address and how the website is structured. Look for common elements and any edge case exceptions that may create challenges when writing code.
It appears that there is table under a header called Results, that stores all the data about the contestants. In this table there is a column containing the number of days in the wild. However the entries of this column are not numeric, eg. 56 days. To get the number from this string we will need string handling. We will also need to be careful of whether the string was days or hours.
Step 1: Generalise the url string
season_number = 1
season_url_start = "https://en.wikipedia.org/wiki/Alone_season_"
season_url = paste(season_url_start, season_number, sep = "")
season_data = read_html(season_url)
Step 2: Get code to pull the table from the website
table_ref = ".wikitable"
table_data_raw = season_data |>
html_nodes(table_ref) |> #xpath = table_xpath
html_table(header = TRUE) |>
purrr::pluck(2)
# This is not robust coding - could easily break
# Ideally one should look for the results header and find the next table
Step 3: Create a new column for days lasted
# Get season table
table_data = table_data_raw |>
mutate(Days = as.numeric(str_match(Status, "[:digit:]+"))) |>
mutate(Days = if_else(str_detect(Status, "[Hh]ours"), Days/24, Days))
Step 4a: Combine the tables together
# Combine the above code into one function we can use to run the code for the
#different seasons
Wrapper_function <- function(season_number){
season_url_start = "https://en.wikipedia.org/wiki/Alone_season_"
season_url = paste(season_url_start, season_number, sep = "")
season_data = read_html(season_url)
table_ref = ".wikitable"
table_data_raw = season_data |>
html_nodes(table_ref) |> #xpath = table_xpath
html_table(header = TRUE) |>
purrr::pluck(2)
table_data <- table_data_raw |>
mutate(Days = as.numeric(str_match(Status, "[:digit:]+"))) |>
mutate(Days = if_else(str_detect(Status, "[Hh]ours"), Days/24, Days))
return(table_data)
}
# Run the code for each season
num_seasons = 10
season_table_list = vector("list", num_seasons)
for(i in 1:num_seasons){
print(paste("Season", i))
season_table_list[[i]] = Wrapper_function(season_number = i) |>
mutate(Season = i) |> # add a season reference
mutate(Order = row_number()) # add a row reference
}
## [1] "Season 1"
## [1] "Season 2"
## [1] "Season 3"
## [1] "Season 4"
## [1] "Season 5"
## [1] "Season 6"
## [1] "Season 7"
## [1] "Season 8"
## [1] "Season 9"
## [1] "Season 10"
Step 5: Wrangle and extract the data from the tables for plotting
# Combine the data together into a single tidy data frame
season_table_combined = bind_rows(season_table_list) |>
mutate(Season = as.factor(Season))
# Example plot - Spaghetti
ggplot(season_table_combined) +
geom_line(aes(x = Order, y = Days,
col = Season, group = Season)) +
geom_point(aes(x = Order, y = Days,
col = Season, group = Season)) +
theme_bw()
We see there is one season that has a different number of contestants than another. Closer inspection reveals this season the contestants competed in pairs. We would need to handle this season as a different case.
From the plot we observe the Season 1 contestants went home earlier compared with other seasons. We also observe that in the later seasons the first contestants last longer compared with the earlier seasons.
No, there is no easy way to do this. One approach might be to perform a text analysis on the reasons people leave looking for common words / categories. You could also define these categories manually. Then you could perform a “fuzzy” matching, which is a partial string matching to see which reasons people leave best match the main categories. Grouping any niche reasons into an other category.