# Pages reading in your data and for wrangling
library(tidyverse)
library(here)
# Packages for time zone matching
library(lutz)
library(lubridate) # also part of tidyverse
# Packages for maps
library(maps)
library(ggthemes)
# Other packages for plotting
library(patchwork)
library(ggthemes)
Week 3 Tutorial
Learning Objectives
The goal of today is to create your own insights from the US Airline Traffic data.
You will practice:
- Processing variables and data wrangling
- Visualising your results
- Using an R Project and quarto file to organise your code and results in a reproducible way.
Before your tutorial
Download the project template from Moodle. Go to Week 3 > Real Time > Download the .zip file.
For this week and next week, you should ensure you have worked through the startR modules on Tidy Data and the module on Quarto Basics.
These should take you ~ 2 hours.
Package Installation
We will need the below packages for this week’s tutorial. If you haven’t already, install these packages.
Cheat Sheets
You can find a link to a pdf cheat sheet on data wrangling here.
You may also like to look at the other cheat sheets on this website, specifically tidyr
and lubridate
on this same website.
These are great and give you an overview of functions you may find useful.
Exercise 1
Explore your data! These tasks should be done using the dplyr and tidy wrangling verbs. (These are similar to SQL functions for wrangling.)
To get you started there is already data in the data folder for you to use. Refer to the US website for the data dictionary.
Everything needed to complete the tutorial was in the tutorial-03.zip on Moodle. Here are some of the things you will find when running the code.
Find the carrier that had the most flights during the month. Determine which carrier this is?
WN = Southwest had the most flights.
- Which airport had the most departing traffic?
LAX = Los Angeles had the most departing flights.
- Compute the smallest, largest and median departure delay for the busiest airport. What would it mean if the median departure delay was negative?
LAX is the busiest airport. The smallest departure delay for the month was much less than 0, which means the flight left early, quite early. The longest delay was a day later. (This looks like a time zone calculation error.) The median delay was is less than 0. That means that 50% of the flights left before scheduled.
Make a side-by-side boxplot of the delays for each carrier, at the busiest airport.
Think about transforming delay because it has a skewed distribution. (If you use a transformation on the axis, check the number of missings. It may be that a lof of data is excluded and you need to do the transformation with
mutate
.)Sort the carrier axis by the median delay (this is tricky! Hint: use the
forcats
package).Make nice labels on the axis
Write a paragraph on what is learned about the delays by carrier
Overall, there is not much difference in the median delays, and the variation in delays between carriers. American Airlines (AA) and Southwest (WN) have the highest median delay and Frontier Airlines (F9) has the lowest median delay. Skywest (OO) has the flight with the longest delay.
- How many records, of the busiest airport, have missing values for departure delay?
Exercise 2
Here we are going to make a map of flights.
- Plot the airport locations on a map. You should filter the airports to only the latest location. Airports sometimes move.
- Now the fun part, lets take a day’s worth of flights, and plot all the flights. You will need to join the day of flights data with the airport locations, using both the origin and destination.
- Choose the two major carriers for your day of data, and make two separate maps of flights, one for each carrier. Compare and contrast the carrier flight patterns.
I chose Delta and Southwest. It looks a little like Delta has more of a hub system. We see this because both airlines are high volume carriers, but spatially it looks like Southwest has more flights than Delta. Delta flights are operating between fewer airports, and Southwest is more distributed, serving many more airports.
- ADVANCED: Now we are going to examine change in patterns over the course of a day. You will need to convert departure time into a standard time. Then break it into one of four categories: midnight-6am, 6am-noon, noon-6pm, 6pm-midnight. Using all the carriers again, make separate maps for each quarter of the day. Compare the traffic over these four time blocks.
There’s not a lot to see in four big groups like this. Its an exercise in working with time. And also in ordering the four groups appropriately.
- ADVANCED: Use the standardised times to follow the path of one plane during the day.
Plane N243WN have 8 flights during the day, but it goes back and forth between three airports.