In this tutorial, you will learn
Make sure to downlod the R package rvest
to be ready for
today’s class
Alone is an American survival competition series on History. It follows the self-documented daily struggles of 10 individuals (seven paired teams in season 4) as they survive alone in the wilderness for as long as possible using a limited amount of survival equipment. With the exception of medical check-ins, the participants are isolated from each other and all other humans. They may “tap out” at any time, or be removed due to failing a medical check-in. The contestant who remains the longest wins a grand prize of $500,000 (USD) (increasing to $1 million in season 7).
From Alone wikipedia
Visit the wikipedia page for the Alone TV Series and identify the basic elements that make up the page.
How many tables are on there on the page?
How many paragraphs are there?
Identify and scrape the table containing the past Series winners.
Identify and scrape the text that was used to create the Background text for this tutorial.
Explore the data you’ve pulled down from the webpage.
Process the table and extract how long the winners spent in the Wild.
Plot your result. Is the time spent in the wild increasing as the seasons go on?
Discuss in groups, how you would you automate getting the time all contestants spent in the wild from Seasons 1 - 10.
Write some pseudo code and identify potential edge cases that would need to be handled to web scrape the time contestants spent in the wild from Seasons 1 - 10.
Pull this data into R and plot how long people were in the wild on the different seasons. Start with one season. ADVANCED generalise your approach to all seasons.
The reasons people leave the show can be quite varied, from medical reasons, to fear, accidents and to missing family. Is there any easy way to scrape and analyse the common reasons people leave? Discuss the challenges.