🎯 Objectives

In this tutorial, you will learn

πŸ”§ Preparation

  1. Set up your project folder

For the last few weeks, we’ve been sharing a folder with a project. It’s now time for you to start setting up your own to work through the exercise.

Below is an example of how to get started.

  1. Getting the census data

    • Go to https://www.abs.gov.au/census/find-census-data/datapacks selecting the following options:

      • 2021 Census Datapacks
      • General Community Profile
      • All geographies
      • Vic

      The downloaded file would be called 2021_GCP_ALL_for_Vic_short-header.zip.

    • Unzip this file and save the folder in the data folder you set up.

  2. Installing the here R-package

We use this for referencing files relative to your current working directory.

install.packages("here")
  1. Check your project structure

    Your project structure should look like below.

tutorial-04
β”œβ”€β”€ analysis
β”‚Β Β  └── exercise-04.Rmd
β”œβ”€β”€ data
β”‚Β Β  └── 2021_GCP_ALL_for_Vic_short-header
β”‚Β Β      └── 2021\ Census\ GCP\ All\ Geographies\ for\ VIC
β”‚Β Β          β”œβ”€β”€ SA1
β”‚Β Β          β”‚Β Β  └── VIC
β”‚Β Β          β”‚Β Β      β”œβ”€β”€ 2021Census_G17A_VIC_SA1.csv
β”‚Β Β          β”‚Β Β      β”œβ”€β”€ 2021Census_G17B_VIC_SA1.csv
β”‚Β Β          β”‚Β Β      └── 2021Census_G17C_VIC_SA1.csv
β”‚Β Β          └── STE
β”‚Β Β              └── VIC
β”‚Β Β                  β”œβ”€β”€ 2021Census_G17A_VIC_STE.csv
β”‚Β Β                  β”œβ”€β”€ 2021Census_G17B_VIC_STE.csv
β”‚Β Β                  └── 2021Census_G17C_VIC_STE.csv
└── tutorial-04.Rproj

🚢 Exercise 4A

Practice regular expression

Consider the following vector A and B. The first letter has information on the sex, followed by minimum income, maximum income, minimum age, and maximum age separated by _, with some exceptions for certain age and income categories (e.g.Β Neg_Nil is negative or nil income, 85ov is 85 years or over). Wrangle this vector to extract the information on its own vector or a column in a data frame. Try using the functions str_split(), str_replace() and or str_remove() to do this.

Hint
# some examples to help you
library(tidyverse)
str_split("A_B_C_D", "_")
## [[1]]
## [1] "A" "B" "C" "D"
str_replace("Neg_nil_income", "Neg", "-Inf")
## [1] "-Inf_nil_income"
str_replace("Neg_nil_income", "nil", "0")
## [1] "Neg_0_income"
str_replace("we_want_fourthings", "rt", "r_t")
## [1] "we_want_four_things"
str_remove("we_want_to_remove_the_extra_stuff", "to_remove_the_extra_")
## [1] "we_want_stuff"
A <- c("F_300_399_55_64_yrs", 
       "F_1_149_35_44_yrs", 
       "F_150_299_25_34_yrs", 
       "M_800_999_75_84_yrs", 
       "M_400_499_65_74_yrs")

B <- c(A, 
       "F_1_149_15_19_yrs", 
       "F_150_299_85ov", 
       "M_1500_1749_85ov", 
       "M_Neg_Nil_income_20_24_yrs", 
       "M_150_299_25_34_yrs")

πŸ‹ Exercise 4B

Tidy your census data

Start with loading the tidyverse package and setting the path ways to the data.

library(tidyverse)
census_path <- here::here("data/2021_GCP_all_for_VIC_short-header/2021 Census GCP All Geographies for VIC")

SA1_paths <- glue::glue(census_path, "/{geo}/VIC/2021Census_G17{alpha}_VIC_SA1.csv",
                        geo = "SA1", alpha = c("A", "B", "C"))

STE_paths <- glue::glue(census_path, "/{geo}/VIC/2021Census_G17{alpha}_VIC_STE.csv",
                        geo = "STE", alpha = c("A", "B", "C"))

To load the data we will use the here function from the here package. The path argument parsed to here::here will be referenced with respect to the .Rproj file. This is particularly helpful if your project structure becomes deeply nested.

tutorial-04
β”œβ”€β”€ analysis
β”‚Β Β  └── exercise-04.Rmd      # << analysis are here
β”œβ”€β”€ data                  # << data are here 
β”‚Β Β              
β”‚Β Β 
└── tutorial-04.Rproj  # << make reference point `here`

First focus on 2021Census_G17A_VIC_STE.csv, and develop an understanding of what R reads in using the function str. This will help you develop an idea of the data structure.

STE_G17A <- read_csv(STE_paths[1])
str(STE_G17A)

Part of our mission today is to make this data into a tidy form. The first step to this is pivoting the data from it’s wide format into a long format. Check out the tidyverse cheat sheet for some extra help understanding how the data is being combined. Use View() to compare STE_G17A before and STE_G17A_long after.

STE_G17A_long <- STE_G17A |>
  pivot_longer(cols = -1, names_to = "category",
               values_to = "count")
data_paths <- STE_paths
  
# Read in each of the three tables
tbl_G17A_long <- read_csv(data_paths[1]) |>
  pivot_longer(cols = -1, names_to = "category",
               values_to = "count")

tbl_G17B_long <- read_csv(data_paths[2]) |>
  pivot_longer(cols = -1, names_to = "category",
               values_to = "count")

tbl_G17C_long <- read_csv(data_paths[3]) |>
  pivot_longer(cols = -1, names_to = "category",
               values_to = "count")

# Combine all the data together
tbl_G17_long <- bind_rows(tbl_G17A_long, tbl_G17B_long, tbl_G17C_long)

We can sanity check our operations by looking at the changes in the dimension of the data.

nrow(tbl_G17A_long) + nrow(tbl_G17B_long) + nrow(tbl_G17C_long)
nrow(tbl_G17_long) 

The next step is to separate the category column out into multiple different variables for sex, age_min, age_max income_min, and income_max. There is a handy function that can help us do that called separate_wider_delim(), which is similar to str_split() but works for splitting columns in data frames.

The catch with using separate_wider_delim() is that we want all entries in the column to be of a similar format. However, there are lot of weird cases that we would need to change to make look more standard.

  1. Neg_Nil_income \(\rightarrow\) change to -Inf_0.
    1*) Negtve_Nil_income \(\rightarrow\) change to -Inf_0.
  2. more \(\rightarrow\) Inf.
  3. PI_NS \(\rightarrow\) don’t want these columns for today (like NA_NA).
  4. 85ov \(\rightarrow\) change to 85_110. 4*) 85_yrs_ov \(\rightarrow\) change to 85_110.
  5. Tot \(\rightarrow\) don’t want these columns (can reproduce them from the data).

One way to find these edge cases is by looking through the meta data. Another is to count the underscores of each entry in the column category.

underscore_count_per_category = str_count(string = tbl_G17_long$category, pattern = "_")
table(underscore_count_per_category)
## underscore_count_per_category
##   2   3   4   5   6 
##   6  84  30 339  51
tbl_G17_long_formatted <- tbl_G17_long |>
  filter(!str_detect(string = category, pattern = "Tot"),
         !str_detect(category, "PI_NS"))
tbl_G17_long_formatted <- tbl_G17_long_formatted |> 
  mutate(
    category = str_replace(category, "Neg_Nil_income", "-Inf_0"),
    category = str_replace(category, "Neg_Nil_incme", "-Inf_0"),
    category = str_replace(category, "Negtve_Nil_incme", "-Inf_0"),
    category = str_replace(category, "more", "Inf"),
    category = str_replace(category, "85ov", "85_110_yrs"),
    category = str_replace(category, "85_yrs_ovr", "85_110_yrs"))

We can do a quick check to see if our changes make the entries in columns category have the same number of β€œ_” separators.

underscore_count_per_category = str_count(tbl_G17_long_formatted$category, "_")
table(underscore_count_per_category)
## underscore_count_per_category
##   5 
## 405

Looks good, so let’s separate out those different entries into columns. Remember to check your result using View().

tbl_G17_tidy <- tbl_G17_long_formatted |>
  mutate(category = str_remove(category, "_yrs")) |>
  separate_wider_delim(cols = category, delim = "_",
                       names = c("sex", "income_min", "income_max", "age_min", "age_max")) 
View(tbl_G17_tidy)

Later on we are going to do some graphing using different categorical ranges. So let’s also create two new variables for age bracket and income bracket.

tbl_G17_tidy = tbl_G17_tidy |>
  unite("income", c(income_min, income_max), remove = FALSE) |>
  unite("age", c(age_min, age_max), remove = FALSE)
View(tbl_G17_tidy)

Take time to note here just how many different steps we needed to do to wrangle our data into a useful format for our filtering and plotting needs. Figuring out all these steps yourself would take a lot of time! To get quicker at understanding what is in your data and how to format it for an analysis will come with time and practice.

πŸƒ Exercise 4C

Quality check your census data

Note to make your labels appear in the correct order on your axes you will need to run the following code.

tbl_G17_tidy$income <- fct_reorder(tbl_G17_tidy$income,
                                   as.numeric(tbl_G17_tidy$income_min))
Hint: click here for a code for the 7th barplot
tbl_G17_tidy %>%
  filter(sex != "P") |>
  ggplot(aes(x = age, y = count, fill = sex)) +
  geom_col(position = "dodge") +
  facet_wrap(~income) +
  theme_bw(base_size = 12) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.3)) +
  ggtitle("7. Income, sex & age distribution in Victoria from 2021 Census")

πŸ’ƒ Exercise 4D

Extract statistics from your tidy census data

According to the 2021 Census data:

Material originally developed by Dr.Β Emi Tanaka and maintained by Dr.Β Kate Saunders