In this tutorial, you will learn
- utilise and access open data sources
- assess the collection methods and the quality of the data
- write computer code to wrangle and analyse data
Exercise 2A: download data
Download education data from the U.S. Census Bureau using the link https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=. Complete the following tasks:
- Select Geographies; select a geographic type: County 050
- Select all the following states: Alabama, Kentucky, Missouri, South Carolina, Texas, West Virginia
- Add all counties within these states to your selection.
- Select Topics; People, Education, Educational Attainment
- Select 2010 and download data
- Put the downloaded files in your working directory and use read.csv to load the data in your workspace
edu <- read.csv(???, header = T)
Exercise 2B: data collection
Answer the following questions for the downloaded data:
- Which sampling method is used in the survey data?
- How many interviews are used to construct the data for Alabama?
- What is the coverage rate for Texas, and what does this number mean?
- Which four primary sources of nonsampling error are addressed in constructing the data set?
Exercise 2C: data collection
Provide the following metadata for the downloaded data:
- Temporal coverage
- Geographic coverage
Exercise 2D: data analysis
Complete the following tasks:
- Select the variable “Total; Estimate; Percentage bachelor’s degree or higher” and the variable with county names.
- Delete the row with the column descriptions
- Delete observations with missing values
edu_clean <- edu %>%
select(County = GEO.display.label, PercentBachelorOrHigher = ???) %>%
- Find the county with the largest percentage of people with bachelor’s degree or higher
- Find the county with the smallest percentage of people with bachelor’s degree or higher
- Report the mean and standard deviation over all counties for Percentage bachelor’s degree or higher
- Why did we look at the 2010 survey instead of the most recent data?
Exercise 2E: merge data
- Follow the same steps as in Exercise 2A, but now replace (Education and Educational Attainment) from the topics list with (Age & Sex and Age) to download the QT-P1 2010 SF1 100% Data.
- Follow the same steps as in Exercise 2D, but now replace “Total; Estimate; Percentage bachelor’s degree or higher” with “Percent - Both sexes; Total population - Under 18 years”.
- Merge the education data with the age data.
age <- read.csv(???, header = T)
age_clean <- age %>%
select(County = GEO.display.label, Percentunder18 = ???) %>%
edu_age <- inner_join(???,???,???)