Week 3 Tutorial

Learning Objectives

The goal of today is to create your own insights from the US Airline Traffic data.

You will practice:

Processing variables and data wrangling
Visualising your results
Using an R Project and quarto file to organise your code and results in a reproducible way.

Before your tutorial

Download the project template from Moodle. Go to Week 3 > Real Time > Download the .zip file.

For this week and next week, you should ensure you have worked through the startR modules on Tidy Data and the module on Quarto Basics.

These should take you ~ 2 hours.

Package Installation

We will need the below packages for this week’s tutorial. If you haven’t already, install these packages.

# Pages reading in your data and for wrangling
library(tidyverse)
library(here)

# Packages for time zone matching
library(lutz)
library(lubridate) # also part of tidyverse

# Packages for maps
library(maps)
library(ggthemes)

# Other packages for plotting 
library(patchwork)
library(ggthemes)

Cheat Sheets

You can find a link to a pdf cheat sheet on data wrangling here.

You may also like to look at the other cheat sheets on this website, specifically tidyr and lubridate on this same website.

These are great and give you an overview of functions you may find useful.

Exercise 1

Explore your data! These tasks should be done using the dplyr and tidy wrangling verbs. (These are similar to SQL functions for wrangling.)

To get you started there is already data in the data folder for you to use. Refer to the US website for the data dictionary.

Find the carrier that had the most flights during the month. Determine which carrier this is?

Which airport had the most departing traffic?

Compute the smallest, largest and median departure delay for the busiest airport. What would it mean if the median departure delay was negative?

Make a side-by-side boxplot of the delays for each carrier, at the busiest airport.
1. Think about transforming delay because it has a skewed distribution. (If you use a transformation on the axis, check the number of missings. It may be that a lof of data is excluded and you need to do the transformation with mutate.)
2. Sort the carrier axis by the median delay (this is tricky! Hint: use the forcats package).
3. Make nice labels on the axis
4. Write a paragraph on what is learned about the delays by carrier

How many records, of the busiest airport, have missing values for departure delay?

Exercise 2

Here we are going to make a map of flights.

Plot the airport locations on a map. You should filter the airports to only the latest location. Airports sometimes move.

Now the fun part, lets take a day’s worth of flights, and plot all the flights. You will need to join the day of flights data with the airport locations, using both the origin and destination.

Choose the two major carriers for your day of data, and make two separate maps of flights, one for each carrier. Compare and contrast the carrier flight patterns.

ADVANCED: Now we are going to examine change in patterns over the course of a day. You will need to convert departure time into a standard time. Then break it into one of four categories: midnight-6am, 6am-noon, noon-6pm, 6pm-midnight. Using all the carriers again, make separate maps for each quarter of the day. Compare the traffic over these four time blocks.

ADVANCED: Use the standardised times to follow the path of one plane during the day.