Tutorial Solution

Exercise 1: US air traffic

a. Download

b. How was the data collected?

  1. Who has the oversight for the data provision?

Office of Airline Information, Bureau of Transportation Statistics

  1. Who reports the data to the data provider?

Reporting carriers are required to (or voluntarily) report on-time data for flights they operate

  1. How is the data collected?

This is like a census for all ‘US certified air carriers that account for at least one percent of domestic scheduled passenger revenues’, containing information on each commercial flight that carried passengers.

  1. Is this open data? What type of license is provided? What are you allowed to do with the data?

Yes, this is open data. There is no obvious license, but it falls under the US open data policy guidelines. Check the Web policies link at bottom.

  1. What information is in each row of the data set?

One row is one operated flight. You can think of this as event data. Ususally you will want to aggregate it in different ways for analysis.

c. Data quality checks

  1. Read in the data
  2. Check dates range for the month of January 2020.
  3. Count number of flights by carrier. Who has the most flights?
  4. Which airport has the most traffic? Does every airport have the same number of incoming and outgoing flights?
##    FL_DATE         
##  Length:538837     
##  Class :character  
##  Mode  :character
## # A tibble: 15 × 2
##    OP_UNIQUE_CARRIER      n
##    <chr>              <int>
##  1 WN                112430
##  2 DL                 75174
##  3 AA                 74999
##  4 UA                 56657
##  5 OO                 50347
##  6 YX                 24476
##  7 B6                 23249
##  8 NK                 21876
##  9 AS                 19801
## 10 MQ                 18849
## 11 9E                 16926
## 12 OH                 15456
## 13 F9                 13285
## 14 G4                  8615
## 15 HA                  6697

Helpful hint: If you are having difficulty running this code check the file name matches, check your file path and check your project directory.

WN has the largest number of flights. This is the low cost carrier Southwest.

## # A tibble: 351 × 2
##    ORIGIN     n
##    <chr>  <int>
##  1 ATL    32190
##  2 ORD    25661
##  3 DFW    24339
##  4 DEN    20398
##  5 CLT    19995
##  6 LAX    17799
##  7 PHX    15325
##  8 IAH    14792
##  9 LAS    14186
## 10 LGA    13836
## # … with 341 more rows

The number of outgoing flights is the same (or very close to) as the number of incoming flights, as we would expect. ATL, which is Atlanta, Georgia is the busiest airport.

Exercise 2: National Longitudinal Survey of Youth

a. Data download

  1. The data arrives with four variables R0000100, R0173600, R0214700, R0214800. Read the codebook to find out what these are.

R0000100 is a unique id for each individual, R0173600 is a sex and race survey question with values from 1-20, R0214700 is race with 3 categories, R0214800 is sex with only two categories.

b. License and usage

  1. At the bottom of the web site are links that can help to determine what are the allowed uses. What information does the data provider keep about you?

Only your email address.

  1. Is there a license provided with the data? What sort of open data is this? The documentation says that this is public use data. What do you think “public use” means?

https://data.gov/privacy-policy.html#license says “U.S. Federal data available through Data.gov is offered free and without restriction.”

c. About the data

  1. How was this data collected?

Observational data collected by survey sampling: “a cross-sectional sample of 6,111 respondents designed to represent the noninstitutionalized civilian segment of people living in the United States in 1979 and born between January 1, 1957, and December 31, 1964 (ages 14-21 as of December 31, 1978)”

  1. Check the levels in the downloaded data. Do they match the codebook? How many individuals are in the sample?
## # A tibble: 3 × 2
##   R0214700     n
##      <dbl> <int>
## 1        1  2002
## 2        2  3174
## 3        3  7510
## # A tibble: 2 × 2
##   R0214800     n
##      <dbl> <int>
## 1        1  6403
## 2        2  6283
## # A tibble: 20 × 2
##    R0173600     n
##       <dbl> <int>
##  1        1  2236
##  2        2   203
##  3        3   346
##  4        4   218
##  5        5  2279
##  6        6   198
##  7        7   405
##  8        8   226
##  9        9   742
## 10       10  1105
## 11       11   729
## 12       12   901
## 13       13  1067
## 14       14   751
## 15       15   609
## 16       16   162
## 17       17    53
## 18       18   342
## 19       19    89
## 20       20    25
## # A tibble: 1 × 1
##       n
##   <int>
## 1 12686

Exercise 3: Atlas of Living Australia

  1. Point your browser to https://www.ala.org.au. Check the terms of use. Does it have a license?

Yes, creative commons.

  1. Using the galah library, and the function occurrences extract the records for platypus.

Helpful tip: To run the code to download the data you must register using your email address on the ALA webpage first

b. Data quality checks

  1. Plot the locations of sightings. Where is Australia are platypus found?

Platypus are mostly found along the ast coast of Australia, and also Tasmania.

  1. What dates of sightings are downloaded?

1770 through to 2022

c. Data collection methods

How is this data collected? Explain the ways that a platpus sighting would be added to the database. Also think about what might be missing from the data?

This data is mostly provided on a voluntary basis, some by researchers, some by citizen scientists. This means that it is not systematic, so there may be locations where platypus are found that people don’t go. These places would then not be represented in the database.

Material developed by Prof Di Cook and is maintained by Dr. Kate Saunders