ETC5512

Case study: US air traffic

Lecturer: Lecturer: Kate Saunders

Department of Econometrics and Business Statistics


  • ETC5512.Clayton-x@monash.edu
  • Wild Caught Data
  • wcd.numbat.space


Today’s lecture

What we’ll cover:

  • We are going to get curious about US airline traffic

  • Guide you to think about important variables in the data

  • Show you how others have gotten curious before you

  • Answer our curious questions using data visualisation

Case Study: US airline data

US Airline Traffic Data

You can find the data here

Download the data

Steps

  • Navigate to the airline ontime performance data base by going to https://www.transtats.bts.gov/

  • Select “Aviation” from left box

  • Then select “Airline On-Time Performance Data”

  • In the table, find “Reporting Carrier On-Time Performance (1987-present)” click “Download”

  • This will bring you to an interface for choosing a subset.

Download the Data

THE DATA IS VERY BIG!!!

  • Not all of it might be relevant to the question you want to answer

  • To start you might like to download a sample to understand what the data looks like

Download the Data

If you were to download the data from the website -

Example

  • Choose 2020 and January (before the pandemic hit the USA)

  • Select these variables: Year, Month, DayofMonth, DayOfWeek, FlightDate, Reporting_Airline, Tail_Number, Origin, Dest, CRSDepTime, DepTime, DepDelay, CRSArrTime, ArrTime, ArrDelay.

  • Click the “Download” button to get it onto your laptop. (No need to check pre-zipped.)

  • The resulting file is about 50Mb, and the column names are slightly different from the form names, but recongisable as the requested variables: YEAR, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, FL_DATE, OP_UNIQUE_CARRIER, TAIL_NUM, ORIGIN, DEST, CRS_DEP_TIME, DEP_TIME, DEP_DELAY,CRS_ARR_TIME,ARR_TIME, ARR_DELAY

Sneak Peak

Lucky for us there is an R package, nycflights13, with a sample of this data for us to take a quick look.

library(nycflights13)
data(airlines)
data(airports)
data(flights)
data(planes)
data(weather)

What’s in a row?

airlines[1, ]
# A tibble: 1 × 2
  carrier name             
  <chr>   <chr>            
1 9E      Endeavor Air Inc.
flights[1, ]
# A tibble: 1 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
airports[1, ]
# A tibble: 1 × 8
  faa   name                lat   lon   alt    tz dst   tzone           
  <chr> <chr>             <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
1 04G   Lansdowne Airport  41.1 -80.6  1044    -5 A     America/New_York
planes[1, ]
# A tibble: 1 × 9
  tailnum  year type               manufacturer model engines seats speed engine
  <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…

Important to share

Documenting and communicating

  • There is a lag in records appearing on the site, can be several months

  • There is a data dictionary that explains all the variables

  • Links at bottom of the site tells you what web site collects about you when you visit (Privacy Policy)

  • There is no clear license or policy on usage - Hunt around and check!

What would you be curious about?

Breakout discussion

Remember: Start with a question

  • Decide what you be curious about?

  • Are the variables you need available in the data?

  • Loook at the meta data and the data dictionary.

  • Also take a quick look for a licence

  • Discuss what type of data collection is this? (e.g. experimental or observational? Census, survey sampling or occurrence?)

  • Discuss what is the population? And what does a representative sample look like?

Motivation

Motivation

American Statistical Association Statistical Graphics and Computing Sections 2009 Data Expo provided all of the commercial flight records for air travel in the USA from October 1987 to April 2008 as part of a competition.

About the competition data

  • nearly 120 million records
  • 12Gb of space uncompressed
  • 1.6Gb compressed

The data for the competition is still available because it was given a DOI: https://doi.org/10.7910/DVN/HG7NV7. 🤸‍♂️

Organisers provided instructions on how to set up an sqlite database, and access from R.



Read about accessing databases from R at this RStudio site is a good starting place to read about working with a sqlite database.

Questions provided

What one could get curious about

  • When is the best time of day/day of week/time of year to fly to minimise delays?
  • Do older planes suffer more delays?
  • How does the number of people flying between different locations change over time?
  • How well does weather predict plane delays?
  • Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?
  • Participants could also decide for themselves what to analyse.

Processing

Breakout-discussion

How would you start to process the data to answer …

  • When is the best time of day/day of week/time of year to fly to minimise delays?

  • Are some carriers operating more efficiently?

  • Do some carriers operate more broadly than others?

  • Do older planes suffer more delays?

What did the prize winners do?

First prize 🏆 1

What is in your data?

Its good practice to show a useful view of entire data, to get a rough sense of major patterns.

Think about

  • [Temporal trend:]{style=“color:”#006DAE”;} A major component of this data is traffic patterns over time.

  • [Spatial pattern:]{style=“color:”#006DAE”;} Airports are distributed across the country, explore how the traffic operates relative to this geography.

  • [Carriers:]{style=“color:”#006DAE”;} Are some carriers operating more widely, or more efficiently?

High Level Overview

What you can expect

Overview Figure: Delays

Think about it 🤔



Choices

Delay was used in providing an overview.

  • What other aggregates could have been used?

  • Why was delay chosen?

Temporal trend

Temporal trend

Spatial

Carrier

Take-home messages

Same data: Another approach

Second prize 🏅1

Analysis overview

  • Overview: flight paths over country
  • Analysis:
    • Traffic patterns over time, including 911, and strikes, bankruptcies
    • Delays over time, and by day, hour
    • Airport efficiency
    • Carrier efficiency
    • Ghost flights: what’s a ghost flight?
    • Mapping traffic spatially, and animating
  • Curious findings

Processing

Think about the steps

As we work through the summary plots, think about:

  • what needs to be done to the data to get to this summary

  • what do you learn from each display, what’s expected, what’s surprising

  • what other ways might the same information be presented, or other calculations made

Traffic patterns over time

Number of flights in millions per year: steadily increasing volume until 2001, with a big drop in 2002. Volume recovered in 2003, and flattens 2004-7, with another drop in 2008. What happened in 2001? What was happening in 2008?

Traffic patterns at selected airports

Delays

Delays, by year

Delays, by carrier

Delays, by airport

Delays, by day

Fuel use by carrier

Fuel efficiency

Ghost flights

Ghost flights, wasted fuel

Deeper look:

If you want to look at what tools were used and why you can review the following paper site.

It contains a subset of the analysis materials including data and code for downloading.

Summary

Working with wild data

Working with wild data can be daunting!

  1. Start with questions that might be answered using the data.

  2. Map out a pipeline to process the data, to address the question.

  3. Think about what might be expected, so results can be “externally validated”.

Summary

First Case Study

  • Generated insights from wild data!

  • Reviewed two approaches to the same data set

  • Saw the beauty of working with wild data - Same data, different approaches, different insights!

For your Assignment:

  • Saw an example of documenting your download process

  • Discussed what a representative sample would be in this example

  • Discussed how we would need to process our data to answer our questions

  • Importantly considered data collection and data limitations. In this example, watch out for Ghost Flights!

Drop In

Drop In

What we cover:

  • Explore the NSW Live Traffic Transport Data

  • Discuss the assignment template

  • Review R Projects and Quarto documents

  • Look at file pathways to avoid direct referencing