Lecturer: Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
What we’ll cover:
We are going to get curious about US airline traffic
Guide you to think about important variables in the data
Show you how others have gotten curious before you
Answer our curious questions using data visualisation
You can find the data here
Steps
Navigate to the airline ontime performance data base by going to https://www.transtats.bts.gov/
Select “Aviation” from left box
Then select “Airline On-Time Performance Data”
In the table, find “Reporting Carrier On-Time Performance (1987-present)” click “Download”
This will bring you to an interface for choosing a subset.
THE DATA IS VERY BIG!!!
Not all of it might be relevant to the question you want to answer
To start you might like to download a sample to understand what the data looks like
If you were to download the data from the website -
Example
Choose 2020 and January (before the pandemic hit the USA)
Select these variables: Year, Month, DayofMonth, DayOfWeek, FlightDate, Reporting_Airline, Tail_Number, Origin, Dest, CRSDepTime, DepTime, DepDelay, CRSArrTime, ArrTime, ArrDelay.
Click the “Download” button to get it onto your laptop. (No need to check pre-zipped.)
The resulting file is about 50Mb, and the column names are slightly different from the form names, but recongisable as the requested variables: YEAR
, MONTH
, DAY_OF_MONTH
, DAY_OF_WEEK
, FL_DATE
, OP_UNIQUE_CARRIER
, TAIL_NUM
, ORIGIN
, DEST
, CRS_DEP_TIME
, DEP_TIME
, DEP_DELAY
,CRS_ARR_TIME
,ARR_TIME
, ARR_DELAY
Lucky for us there is an R package, nycflights13
, with a sample of this data for us to take a quick look.
# A tibble: 1 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
# A tibble: 1 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 1 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_York
# A tibble: 1 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
Documenting and communicating
There is a lag in records appearing on the site, can be several months
There is a data dictionary that explains all the variables
Links at bottom of the site tells you what web site collects about you when you visit (Privacy Policy)
There is no clear license or policy on usage - Hunt around and check!
Breakout discussion
Remember: Start with a question
Decide what you be curious about?
Are the variables you need available in the data?
Loook at the meta data and the data dictionary.
Also take a quick look for a licence
Discuss what type of data collection is this? (e.g. experimental or observational? Census, survey sampling or occurrence?)
Discuss what is the population? And what does a representative sample look like?
American Statistical Association Statistical Graphics and Computing Sections 2009 Data Expo provided all of the commercial flight records for air travel in the USA from October 1987 to April 2008 as part of a competition.
The data for the competition is still available because it was given a DOI: https://doi.org/10.7910/DVN/HG7NV7. 🤸♂️
Organisers provided instructions on how to set up an sqlite database, and access from R.
Read about accessing databases from R at this RStudio site is a good starting place to read about working with a sqlite database.
What one could get curious about
Breakout-discussion
How would you start to process the data to answer …
When is the best time of day/day of week/time of year to fly to minimise delays?
Are some carriers operating more efficiently?
Do some carriers operate more broadly than others?
Do older planes suffer more delays?
Its good practice to show a useful view of entire data, to get a rough sense of major patterns.
Think about
[Temporal trend:]{style=“color:”#006DAE”;} A major component of this data is traffic patterns over time.
[Spatial pattern:]{style=“color:”#006DAE”;} Airports are distributed across the country, explore how the traffic operates relative to this geography.
[Carriers:]{style=“color:”#006DAE”;} Are some carriers operating more widely, or more efficiently?
Choices
Delay was used in providing an overview.
What other aggregates could have been used?
Why was delay chosen?
Think about the steps
As we work through the summary plots, think about:
what needs to be done to the data to get to this summary
what do you learn from each display, what’s expected, what’s surprising
what other ways might the same information be presented, or other calculations made
Number of flights in millions per year: steadily increasing volume until 2001, with a big drop in 2002. Volume recovered in 2003, and flattens 2004-7, with another drop in 2008. What happened in 2001? What was happening in 2008?
If you want to look at what tools were used and why you can review the following paper site.
It contains a subset of the analysis materials including data and code for downloading.
Working with wild data can be daunting!
Start with questions that might be answered using the data.
Map out a pipeline to process the data, to address the question.
Think about what might be expected, so results can be “externally validated”.
First Case Study
Generated insights from wild data!
Reviewed two approaches to the same data set
Saw the beauty of working with wild data - Same data, different approaches, different insights!
For your Assignment:
Saw an example of documenting your download process
Discussed what a representative sample would be in this example
Discussed how we would need to process our data to answer our questions
Importantly considered data collection and data limitations. In this example, watch out for Ghost Flights!
What we cover:
Explore the NSW Live Traffic Transport Data
Discuss the assignment template
Review R Projects and Quarto documents
Look at file pathways to avoid direct referencing
ETC5512