🎯 Objectives

In this tutorial, you will learn

🔧 Preparation

  1. The zip file provided with this tutorial has flights and airports data taken from https://transtats.bts.gov. You don’t need to do this yourself, for this tutorial, but it would be useful for you to refresh your memory on downloading a month of data, as you did in tutorial 1. Also to download the airport data you need to Click on “Aviation” on the left list, and then “Aviation Support Tables” followed by “Download” at bottom right. This has geographic coordinates for all the airports.
  2. Ensure that you have these packages (suites of packages) installed on your computer: DBI, RSQLite, tidyverse, patchwork, ggmap, ggthemes, forcats, lubridate, lutz.
  3. Unzip the tutorial-03.zip file provided, and notice the file structure for this week looks like:
etc5512-week03
├── exercise.Rmd
├── data
│   ├── flights.csv
│   ├── airports.csv
└── etc5512-week03.Rproj

NOTE: All the code to complete these exercises is in the exercise.Rmd in the zip file.

💽 Exercise 1

Build an sqlite database from the month of data.

  1. Read the data into R. It’s possible to do this, because this is a small amount of data.

  2. Create an sqlite database called “sqlflights” from the data.

  3. Save the data we read in from the .csv file into the database as “flights”.

  4. Practice reading the flights data back from the database using the functions tbl() and collect().

  5. Do a simple query, collecting all the records for the first day of the month.

🧮 Exercise 2

Explore your data! These tasks can be done using the dplyr interface, so that the tidy wrangling verbs can be used instead of raw SQL functions.

  1. Find the carrier that had the most flights during the month. Determine which carrier this is?

  2. Which airport had the most departing traffic?

  3. Compute the smallest, largest and median departure delay for the busiest airport. What would it mean if the median departure delay was negative?

  4. Make a side-by-side boxplot of the delays for each carrier, at the busiest airport.

    1. Think about transforming delay because it has a skewed distribution. (If you use a transformation on the axis, check the number of missings. It may be that a lof ot data is excluded and you need to do the transformation with mutate.)
    2. Sort the carrier axis by the median delay (this is tricky! Hint: use the forcats package).
    3. Make nice labels on the axis
    4. Write a paragraph on what is learned about the delays by carrier.
  5. How many records, of the busiest airport, have missing values for departure delay?

📍 Exercise 3

Here we are going to add a new table with airport information, and use this to make a map of flights.

  1. Read the airport location data into R, and add a table to your database.

  2. Plot the locations on a map. You should filter the airports to only the latest location. Airports sometimes move 🤭. An Open Street Map can be downloaded using the get_map() function in the ggmap package.

  3. Now the fun part, lets take a day’s worth of flights, and plot all the flights. You will need to join the day of flights data with the airport locations, using both the origin and destination.

  4. Choose the two major carriers for your day of data, and make two separate maps of flights, one for each carrier. Compare and contrast the carrier flight patterns.

  5. ADVANCED: Now we are going to examine change in patterns over the course of a day. You will need to convert departure time into a standard time. Then break it into one of four categories: midnight-6am, 6am-noon, noon-6pm, 6pm-midnight. Using all the carriers again, make separate maps for each quarter of the day. Compare the traffic over these four time blocks.

  6. ADVANCED: Use the standardised times to follow the path of one plane during the day.

Material developed originally by Prof Di Cook and maintained by Dr Kate Saunders