🎯 Objectives

In this tutorial, you will learn

🔧 Preparation

install.packages(c("tidyverse", "here", "galah"))

Note that tidyverse is a collection of packages, and installing it will install quite a lot of packages.

If you are new to R, you need to know that once you install a package its there for you to use in the future. You only need to install a package once, or whenever the package has been changed.

You do need to tell the R session to load the package each time you start R, though, using the library() function.

You can think about this like the lightbulb analogy below.

⚙️ Tutorial

Exercise 1: US air traffic

a. Download

  • Navigate to the airline ontime performance data base by going to https://www.transtats.bts.gov/
  • Select “Aviation” from left box
  • and then “Airline On-Time Performance Data”.
  • In the table for “Reporting Carrier On-Time Performance (1987-present)” click “Download”

This will bring you to an interface for choosing a subset.

⚠️ THE DATA IS VERY BIG SO FOLLOW THE INSTRUCTIONS BELOW TO DOWNLOAD A SMALL SUBSET

  • Choose 2020 and January (before the pandemic hit the USA)
  • Select these variables: Year, Month, DayofMonth, DayOfWeek, FlightDate, Reporting_Airline, Tail_Number, Origin, Dest, CRSDepTime, DepTime, DepDelay, CRSArrTime, ArrTime, ArrDelay.
  • Click the “Download” button to get it onto your laptop. (No need to check pre-zipped.)
  • The resulting file is about 50Mb, and the column names are slightly different from the form names, but recongisable as the requested variables: YEAR, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, FL_DATE, OP_UNIQUE_CARRIER, TAIL_NUM, ORIGIN, DEST, CRS_DEP_TIME, DEP_TIME, DEP_DELAY,CRS_ARR_TIME,ARR_TIME, ARR_DELAY
library(tidyverse)
library(here)
flights <- read_csv(here::here("data/504717774_T_ONTIME_REPORTING.csv")) %>% select(YEAR:ARR_DELAY)
flights %>% count(YEAR)
flights %>% count(MONTH)

b. How was the data collected?

You can check the “Data profile” to help answer for these questions.

  1. Who has the oversight for the data provision?
  2. Who reports the data to the data provider?
  3. How is the data collected?
  4. Is this open data? What type of license is provided? What are you allowed to do with the data?
  5. What information is in each row of the data set?

c. Data quality checks

  1. Read in the data
  2. Check dates range for the month of January 2020.
  3. Count number of flights by carrier. Who has the most flights?
  4. Which airport has the most traffic? Does every airport have the same number of incoming and outgoing flights?
flights %>% select(FL_DATE) %>% summary()
flights %>% count(OP_UNIQUE_CARRIER, sort = TRUE)
flights %>% count(ORIGIN, sort = TRUE)
outgoing <- flights %>% count(ORIGIN) %>% rename(outbound = n)
incoming <- flights %>% count(DEST) %>% rename(inbound = n)
traffic <- full_join(outgoing, incoming, by=c("ORIGIN" = "DEST"))
ggplot(traffic, aes(x=outbound, y=inbound)) + geom_point() +
  coord_equal()

Exercise 2: National Longitudinal Survey of Youth

❗ The following data contains information presenting sex as a binary variable, and also race as categorical. We realise that sex is not binary. And also that race is an “arbitrary system of visual clasification that does not demarcate distinct subspecies of the human population” (Mindy Thompson Fullilove). Please skip this question if it disturbs you.

a. Data download

  1. Point your browser to https://www.nlsinfo.org/content/cohorts/nlsy79/get-data and login in.
  2. Choose the NLSY79 study. The first step is to “Choose TagSets”. We’ll use the default selection of variables, which has three demographic variables. (For fun, you could browse the many, many variables available to download to see what information is available.)
  3. Go to “Save/download” to download the data, choose “Advanced Download”. Change the name from “default” to “NLSY”. This will make a file available to download, that you need to check a box and click “download” to get it to your computer. It will create a folder called “NLSY” on your computer, containing the data in different formats. One of them is a “csv” file. Make sure this folder is in the data directory of your ETC5512 R project.
  4. The data arrives with four variables R0000100, R0173600, R0214700, R0214800. Read the codebook to find out what these are.

b. License and usage

  1. At the bottom of the web site are links that can help to determine what are the allowed uses. What information does the data provider keep about you?
  2. Is there a license provided with the data? What sort of open data is this? The documentation says that this is public use data. What do you think “public use” means?

c. About the data

  1. How was this data collected?
  2. Check the levels in the downloaded data. Do they match the codebook? How many individuals are in the sample?
nlsy <- read_csv(here::here("data/NLSY/NLSY.csv"))
nlsy %>% count(R0214700)
nlsy %>% count(R0214800)
nlsy %>% count(R0173600)
nlsy %>% tally()

Exercise 3: Atlas of Living Australia

The Atlas of Living Australia is a major resource for occurrence data on animals, plants, insects, fish.

a. Download

  1. Point your browser to https://www.ala.org.au. Check the terms of use. Does it have a license?
  2. Using the galah library, and the function occurrences extract the records for platypus. To download the data from this API you will need to register with your email first.
library(galah)

galah_config(email = "YOUREMAILADDRESS",
             download_reason_id = 10, 
             verbose = TRUE)

platypus <- galah_call() %>% 
  galah_identify("Ornithorhynchus anatinus") %>% 
  atlas_occurrences()

platypus <- platypus %>% 
  rename(Longitude = decimalLongitude,
         Latitude = decimalLatitude) %>%
  mutate(eventDate = as.Date(eventDate)) %>%
  filter(!is.na(eventDate)) %>%
  filter(!is.na(Longitude)) %>%
  filter(!is.na(Latitude))
save(platypus, file=here::here("data/platypus.rda"))

b. Data quality checks

  1. Plot the locations of sightings. Where is Australia are platypus found?
  2. What dates of sightings are downloaded?
load(here::here("data/platypus.rda"))
ggplot(platypus, aes(x=Longitude, y=Latitude)) +
  geom_point()
platypus %>% select(eventDate) %>% summary()     

c. Data collection methods

How is this data collected? Explain the ways that a platpus sighting would be added to the database. Also think about what might be missing from the data?

Materials developed by Professor Di Cook