Week 10 Tutorial

Learning Objectives

In this tutorial, you will practice:

  • identifying html structures on a web page
  • reading data from a web page into R

Background

Alone is an American survival competition series on History. It follows the self-documented daily struggles of 10 individuals (seven paired teams in season 4) as they survive alone in the wilderness for as long as possible using a limited amount of survival equipment. With the exception of medical check-ins, the participants are isolated from each other and all other humans. They may “tap out” at any time, or be removed due to failing a medical check-in. The contestant who remains the longest wins a grand prize of $500,000 (USD) (increasing to $1 million in season 7).

From Alone wikipedia

Exercise 1

Visit the wikipedia page for the Alone TV Series and identify the basic elements that make up the page.

  1. How many tables are on there on the page?

  2. How many paragraphs are there?

  3. Identify and scrape the table containing the past Series winners.

  4. Identify and scrape the text that was used to create the Background text for this tutorial.

Exercise 2

Explore the data you’ve pulled down from the webpage.

  1. Process the table and extract how long the winners spent in the Wild on each season.

  2. Plot your result. Is the time winners spent in the wild increasing as the seasons go on?

In your own time: Exercise 3

Think about how you would you automate getting the time all contestants spent in the wild from Seasons 1 - 10.

  1. Write some pseudo code.

  2. Identify potential edge cases that would need to be handled to web scrape the time contestants spent in the wild from Seasons 1 - 10.

  3. Pull this data into R and plot how long contestants were in the wild on season one. (This will require some more advanced string handling.)

  4. ADVANCED Generalise your approach to all seasons.

  5. ADVANCED The reasons people leave the show can be quite varied, from medical reasons, to fear, accidents and to missing family. Is there any easy way to scrape and analyse the common reasons people leave? Think about the challenges.