ETC5512: Wild Caught Data

.info-box.w-50.bg-white[
These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-04.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. 
]

<br>

---

# .monash-blue[ETC5512: Wild Caught Data]

<br>

<h2 style="font-weight:900!important;">Australian census</h2>

.bottom_abs.width100[

Lecturer: *Kate Saunders*

Department of Econometrics and Business Statistics

<i class="fas fa-envelope"></i>  ETC5512.Clayton-x@monash.edu

<i class="fas fa-calendar-alt"></i> Week 4

<br>

]

---

# Recall from lecture 2:

Therefore, we often can only collect data about a subset of the population.
]

<br><br>
<center>
If we can collect data about the entire population, that is called a .monash-blue[**census**]
</center>
---

<div class="w-10 absolute right-0 rotate" style="top:30%;">
<b>Advantages</b>
</div>

<div class=" w-10 absolute right-0 rotate" style="top:70%;">
<b>Disadvantages</b>
</div>

.flex.h-40[
.w-50.monash-bg-blue.pa2.white[
]
.w-45.bg-white[
]]

.flex.h-45[
.w-50.bg-white[
]
.w-45.monash-bg-blue.white[

]]

---

<div class="w-10 absolute right-0 rotate" style="top:30%;">
<b>Advantages</b>
</div>

<div class=" w-10 absolute right-0 rotate" style="top:70%;">
<b>Disadvantages</b>
</div>

.flex.h-40[
.w-50.monash-bg-blue.pa2.white[

* Reduces cost
* Timely collection of data

]
.w-45.bg-white[

* Data available, even for small geographical areas or subpopulations
* Statistics are not subject to sampling error
* Better accuracy and details

]]

.flex.h-45[
.w-50.bg-white[

* Lack of data on sub-population (particularly minorities) or small geographical areas
* Requires careful construction of sampling design
* Estimates are subject to sampling error
* The estimates may not be accurate or reliable 
* Estimating and communicating precision of estimates is difficult

]
.w-45.monash-bg-blue.white[

* Expensive or infeasible
* Time consuming to collect all data

]

---

.aim-box.w-70.tl[
Today is all about the Australian Census:

* Learn about what the census is and how it is collected 
* Learn what data on population demographics is collected
* Learn how the census data is stored and how to access it

]

.aim-box.w-70.tl[
From a coding perspective:

* Learn about organising your data the **tidy data** way. 
* Learn to **manipulate strings** and a bit about regular expressions.

]

---

## What is the Australian census?

.flex[
.w-50[
* The first Australian census was held in 1911. 
{{content}}
]
.w-50.center[
<img src="images/lecture-07/census-form.png" width = "70%"/>

]]

* Since 1961, the census occurs **every 5 years** in Australia  
{{content}}

<br>

* The next census is in 2026.
{{content}}
<br>

* Counts **every person and household** in Australia.  
*(well almost everyone, the 2021 census had a 96% participation rate but that is very high.)*
{{content}}

<br>

* Comprehensive snapshot of the country and **tells the story of how we are changing**. 
{{content}}

<br>

* The Australia Bureau of Statistics (ABS) is legislated to collect and disseminate census data under the *ABS Act 1975* and *Census and Statistics Act 1905*. 
{{content}}

* For more details refer to the [ (ABS) Website](https://www.abs.gov.au/census/about-census/australian-census).
{{content}}

---

# What is the Australian Bureau of Statistics (ABS)?

* ABS is the independent statistical agency of the Government of Australia. 
{{content}}

]
.w-30.center[

]]

* If you are from outside Australia, find the statistical government agency in your country <i class="fas fa-wrench gray"></i>, e.g. 
  * in 🇯🇵 Japan, this is the [Statistics Bureau of Japan](https://www.stat.go.jp/english/),
  * in 🇨🇳 China, the [National Bureau of Statistics of China](http://www.stats.gov.cn/english/),
  * in 🇮🇳 India, the [Ministry of Statistics and Programme Implementation](https://www.mospi.gov.in/), and
  * in 🇳🇿 New Zealand, the [Statistics New Zealand](https://www.stats.govt.nz/).

{{content}}
--

* ABS provides key statistics on a wide range of economic, population, environmental and social issues, to assist and encourage informed decision making, research and discussion within governments and the community.

---

## Why do we do a census?

* The census is not cheap to do. The 2021 census **cost of $565 million**. That's roughly $22 per person.

<br>

* However the **census provides value for money and it is important**.

<br>

* An [independent report](https://www.abs.gov.au/census/about-census/value-census) found that for every `$`1 invested in the Census, `$`6 of value is generated to the Australian economy.

<br>

* The census **data tells us about the economic, social and cultural** make-up of the country.

<br>

* Need census **data to make decisions and plan for the future**

<br>

* It informs planning for schools, health care, transport and infrastructure. It is also used to help plan local services for individuals, families and communities.

---

## How is the census conducted?

The ABS contacts households in a few different ways:

* **Letters and paper forms are delivered** in some areas
* In other areas, **visits were made to households**.

<br>

Then households complete the Census form, either submitting it online or sending it back in the mail.

<br>

**ABS provides a range of supports and resources to help everyone to fill in the census**.

<br>

.question-box.w-85[
Take a moment to think about .monash-blue[**what challenges might arise if you try to survey everyone**].

*Hint: Think about smaller communities, their sub-groups and their different needs.*
 
]

---

## How can we survey everyone?

It is no small task!

* Resources for people in the deaf/hard of hearing and blind/low vision communities  
*e.g. audio guides and braille information packs*

<br>
--

* To support Aboriginal and Torres Straight Islanders to fill in the census there are urban and regional pop-up hubs. 
*This includes extra face-to-face support*

<br>
--

* For migrants, refugees, and international visitors there are language supports available.

<br>
--

* Additional efforts are made to survey in locations to reach without a fixed address  
*e.g. FIFO workers (Fly in Fly Out), Grey Nomads, People experiencing homelessness.*

<br>
--

More details on the ABS website: [here](https://www.abs.gov.au/census/about-census/2021-census-overview/participation-2021-census)

---

## What is in the census?

There are questions about:

* age
* country of birth
* religion
* ancestry
* language used at home
* work 
* education

]
.w-60.center[

.idea-box.tl.w-100[
## Breakout Session
Investigate what data is collected in the census.

Use the quick stats summary for Clayton  [here](https://www.abs.gov.au/census/find-census-data/quickstats/2021/SAL20569).

*Are there any weird variables, or variables that surprise you? What do you learn about where you live?*

]

]]

---

# Getting the ABS Census Data

## .animated.flash[<i class="fas fa-database"></i> https://www.abs.gov.au/census/find-census-data]

There are two main types of data that you can download:

* **DataPacks** <i class="fas fa-download"></i> https://www.abs.gov.au/census/find-census-data/datapacks
* **GeoPackages** <i class="fas fa-download"></i> https://www.abs.gov.au/census/find-census-data/geopackages

---

# Navigating ABS Census data

* DataPacks are only available for the 2011, 2016 and 2021 census.

* ABS aims for **census data to be comparable and compatible with previous censuses**.

* Questions and classifications are reviewed to reflect changes in the Australian society.  
*e.g. In 2021, ABS did not to ask about home internet connection as people now have other options like mobile devices and that data was no longer considered relevant to society.*

--
* There are small differences in the available data between years.  
*Variables can be added, updated and removed.*

--
* There are also sometimes [data corrections](https://www.abs.gov.au/census/guide-census-data/2021-census-data-corrections) at a later date.
{{content}}

--
* Here are links to:   
[(i) what's new in 2021](https://www.abs.gov.au/census/guide-census-data/census-dictionary/2021/whats-new-2021) - there were 56 new additions!     
[(ii) consultation for changes in 2026](https://www.abs.gov.au/census/2026-census-topic-review/overview-2026-census-topic-review) and   
[(iii) an example of a 2026 proposed change]([https://www.sbs.com.au/news/article/why-more-robust-information-on-australias-sexuality-and-gender-identity-could-be-coming/5iqjqgryj])

---

## Reality of any data analysis

<br>

<div class="idea-box">
Navigating data and deducing what it is often requires you to do some <b>"detective work"</b> 🕵️‍♀️</li>
</div>

<br>

* Much like real detective work, **just locating the data and understanding the data variables can take a long time**

<br>

* **Cleaning and wrangling of the data is not glamorous**;   
There's far more attention in "catching criminals" / praise for the cool discoveries from statistical analysis.

<br><br><br>
<center>
.monash-blue[**Let's get delve into 'grunt work' of an analysis with the census data!**]
</center>

---

# Data Structure and what's in it?

---

## Datapack data structure

].w-35[

* The data is nested within folders.  
*Click on the folder name to see folders and files nested within.*
<br><br>

* Preserve the data in the original structure as much as you can!  
*Good practice not to modify the raw data and it's structure*
<br><br>

<center>
.monash-blue[Where do we get started??]
</center>

<br><br>
<center>
.monash-blue[What is stored in each of these folders/files??]
</center>

]]

---

## Read Me and Meta Data

Download the [2021 Census data](https://www.abs.gov.au/census/find-census-data/datapacks) containing the General Community Profile for all geographies in Victoria. 
<br>

We need some description or understanding of the variables.  
*It will be near impossible to extract meaningful information from the data without it.*

<br><center>
.idea-box.tl.w-80[
## Breakout Session

Then take some time to review the read me and the meta data folders.

* Which folder contains demographic information about each suburb?

* What is LGA short for?

* Where can I find information about how much rent people pay?

* What is contained in variable G17?

]
</center>

---

# Table G17

There are few things to note:

* There are 201 columns in G17A and G17B and 81 columns in G17C.

* Perhaps there is an export limitation for a data that contains more than 200 columns, thus it is broken up into different csv files.

* Which means that you have to join the tables G17A, G17B and G17C as one  
*(you'll do this in the tutorial <i class="fas fa-wrench gray"></i>)*.

---

# Tables G17A-G17C

```
##   STE_CODE_2021 M_Neg_Nil_income_15_19_yrs M_Neg_Nil_income_20_24_yrs
## 1             2                      88386                      21186
```

<br>

```
##   STE_CODE_2021 F_300_399_15_19_yrs F_300_399_20_24_yrs
## 1             2                8810               19537
```

<br>

```
##   STE_CODE_2021 P_650_799_15_19_yrs P_650_799_20_24_yrs
## 1             2                7670               45029
```

---

# Tidy Data

---

# What is Tidy Data?

<br>

## Tidy Data Principles

1. Each variable must have its own column
2. Each observation must have its own row
3. Each value must have its own cell
]

So what about the ABS Census Data?

* The table header in fact contains information!
* E.g. <span class="monash-blue2">`F_400_499_15_19_yrs`</span> is female aged 15-19 years old who earn $400-499 per week (in Victoria).
* The number in the cells are the **counts**.
* Is the data tidy?

---

# Tidying the ABS 2016 Census Data

* Ideally we want the data to look like:

<br>

```
##   age_min age_max gender income_min income_max count
## 1      15      19 female        400        499  4020
```

* Putting data into a tidy format makes the data analysis easier.

* You can include other information, e.g. geography code (useful if combining with other geographical area) or average age/income.

* Note some categories do not have upper bounds, e.g. .monash-blue[`M_3000_more_85ov`]. In R, `-Inf` and `Inf` are used to represent `$-\infty$` and `$\infty$`, respectively.

* You'll wrangle the data into the tidy form in tutorial  <i class="fas fa-wrench gray"></i>

* This will require getting the pieces of information from the column names and organising them using string manipulation.

---

# Manipulating strings

---

# Manipulating strings

* The `stringr` package provides a set of functions designed to help with string manipulation.

```r
library(tidyverse) # includes `stringr`
```

.footnote.f5[
Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for
  Common String Operations. R package version 1.4.0.
  
Gagolewski M. and others (2020). R package stringi: Character
  string processing facilities.
]

* Main functions in `stringr` begin with the **prefix with `str_`**  and the first input into the functions is a string (or a vector of strings)

* What do you think `str_trim` and `str_squish` do?

```r
str_trim(c("    Apple ", "  Goji    Berry     "))
```

```
## [1] "Apple"         "Goji    Berry"
```

```r
str_squish(c("    Apple ", "  Goji    Berry     "))
```

```
## [1] "Apple"      "Goji Berry"
```

* [Click here](https://github.com/rstudio/cheatsheets/blob/main/strings.pdf) for a cheat sheet for `stringr` functions.

---

## Some other examples

These are `stringr` functions we'll need for our census application.

Splitting strings by a pattern:

```r
str_split(string = "Hi_everyone_in_ETC5512", pattern = "_")
```

```
## [[1]]
## [1] "Hi"       "everyone" "in"       "ETC5512"
```
Replacing parts of strings with a different pattern:

```r
str_replace(string = "we_want_fourwords", pattern = "rw", replace = "r_w")
```

```
## [1] "we_want_four_words"
```

Deleting parts of strings that aren't imporant:

```r
str_remove(string = "we_want_to_remove_the_extra_stuff", pattern = "to_remove_the_extra_")
```

```
## [1] "we_want_stuff"
```

To get more control over the kinds of patterns we can match, we need regular expressions.

---

# Regular expressions .font_small[.font_small[Part] 1]

* **Regular expression**, or **regex**, is a string of characters that define a search pattern for text
--

* Regular expression is... 
--
hard
--
, but comes up often enough that it's worth learning
--

```r
ozanimals <- c("koala", "kangaroo", "kookaburra", "numbat")
```
--
**.circle.bg-orange.white[=] Basic match**
.flex[
.w-50.pr3[

```r
str_detect(ozanimals, "oo")
```

```
## [1] FALSE  TRUE  TRUE FALSE
```

```r
str_extract(ozanimals, "oo")
```

```
## [1] NA   "oo" "oo" NA
```

]
.w-50[

```r
str_match(ozanimals, "oo")
```

```
##      [,1]
## [1,] NA  
## [2,] "oo"
## [3,] "oo"
## [4,] NA
```

]

---

# Regular expressions .font_small[.font_small[Part] 2]

**.circle.bg-orange.white[=] Meta-characters**

* `"."` a wildcard to match any character except a new line

```r
str_starts(c("color", "colouur", "colour", "red-column"), "col...")
```

```
## [1] FALSE  TRUE  TRUE FALSE
```
--

* `"(.|.)"` a marked subexpression with alternate possibilites marked with `|`

```r
str_replace(c("lovelove", "move", "stove", "drove"), "(l|dr|st)o", "ha")
```

```
## [1] "havelove" "move"     "have"     "have"
```
--

* `"[...]"` matches a single character contained in the bracket

```r
str_replace_all(c("cake", "cookie", "lamington"), "[aeiou]", "_")
```

```
## [1] "c_k_"      "c__k__"    "l_m_ngt_n"
```
---

# Regular expressions .font_small[.font_small[Part] 3]

**.circle.bg-orange.white[=] Meta-character quantifiers**

* `"?"` zero or one occurence of preceding element

```r
str_extract(c("color", "colouur", "colour", "red"), "colou?r")
```

```
## [1] "color"  NA       "colour" NA
```
--

* `"*"` zero or more occurence of preceding element

```r
str_extract(c("color", "colouur", "colour", "red"), "colou*r")
```

```
## [1] "color"   "colouur" "colour"  NA
```
--

* `"+"` one or more occurence of preceding element

```r
str_extract(c("color", "colouur", "colour", "red"), "colou+r")
```

```
## [1] NA        "colouur" "colour"  NA
```

---

# Regular expressions .font_small[.font_small[Part] 4]

* `"{n}"` preceding element is matched exactly `n` times

```r
str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){2}", "-")
```

```
## [1] "-"     "-na"   "bana"  "-nana"
```
--

* `"{min,}"` preceding element is matched `min` times or more

```r
str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){2,}", "-")
```

```
## [1] "-"    "-"    "bana" "-"
```
--

* `"{min,max}"` preceding element is matched at least `min` times but no more than `max` times

```r
str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){1,2}", "-")
```

```
## [1] "-"     "-na"   "-"     "-nana"
```

---

# Regular expressions .font_small[.font_small[Part] 5]

**.circle.bg-orange.white[=] Character classes**

* `[:alpha:]` or `[A-Za-z]` to match alphabetic characters
* `[:alnum:]` or `[A-Za-z0-9]` to match alphanumeric characters
* `[:digit:]` or `[0-9]` or `\\d` to match a digit
* `[^0-9]` to match non-digits  
* `[a-c]` to match a, b or c
* `[A-Z]` to match uppercase letters
* `[a-z]` to match lowercase letters
* `[:space:]` or `[ \t\r\n\v\f]` to match whitespace characters
* and more...

---

# View matches with regular expressions

```r
str_view(c("banana", "bananana", "bana", "banabanana"), "ba(na){1,2}")
```

```
## [1] │ <banana>
## [2] │ <banana>na
## [3] │ <bana>
## [4] │ <bana><banana>
```

]
.item[

```r
str_view_all(c("banana", "bananana", "bana", "banabanana"), "ba(na){1,2}")
```

```
## [1] │ <banana>
## [2] │ <banana>na
## [3] │ <bana>
## [4] │ <bana><banana>
```
]
]

---

# View matches with regular expressions

```r
str_view(c("banana", "bananana", "bana", "banabanana"), "ba(na){1,2}")
```

```
## [1] │ <banana>
## [2] │ <banana>na
## [3] │ <bana>
## [4] │ <bana><banana>
```

]
.item[

```r
str_view_all(c("banana", "bananana", "bana", "banabanana"), "ba(na){1,2}")
```

```
## [1] │ <banana>
## [2] │ <banana>na
## [3] │ <bana>
## [4] │ <bana><banana>
```
]
]

<div class="info-box" style="position:absolute;right:20px;margin-right:0px!important;top:140px;margin-left:0;width:900px;font-size: 20pt;">
<ul>
<li>When a function in <code>stringr</code> ends with <code>_all</code>, all matches of the pattern are considered</li>
<li>The one <i>without</i> <code>_all</code> only considers the first match</li>
</ul>
</div>

---

## Weird characters

Characters we use to define the regex, e.g. *,.,!,?,),] need to be defined differently when we are trying to match them.

This doesn't work:

```r
str_extract("Let's get the character and the brackets (A)", "([:alpha:])")
```

```
## [1] "L"
```

```r
str_view("Let's get the character and the brackets (A)", "([:alpha:])")
```

```
## [1] │ <L><e><t>'<s> <g><e><t> <t><h><e> <c><h><a><r><a><c><t><e><r> <a><n><d> <t><h><e> <b><r><a><c><k><e><t><s> (<A>)
```
But this does.

```r
str_extract("Let's get the character and the brackets (A)", "\$[:alpha:]\$" )
```

```
## [1] "(A)"
```
To match a bracket `(` we need to use `\\(` in stringr. 
It tells R we are looking for the bracket as part of the pattern and not to look for the backslash. The same goes for other special characters:

---

# Back to Census

---

# Raw Data vs. Aggregated Data

* Although the data collected was from individual households, with each person in the household surveyed (see sample form [here](https://www.abs.gov.au/system/files/documents/12486ae64f0f0ea2d056ee6aa54adc34/Sample%202021%20Census%20Household%20Form%20%5B1.1MB%5D.pdf)), the downloaded data are .monash-blue[**aggregated**].
* Aggregate data presents summary statistics from the .monash-blue[**raw data**]. 
*(e.g. a common summary statistic is the mean)*. 
* When the summary statistics are counts then it is often called .monash-blue[**frequency data**].
* The raw data collected would be similar to the form

<div class="datatables html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-0494992d3a04bee8ec55" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-0494992d3a04bee8ec55">{"x":{"filter":"none","vertical":false,"data":[[1,1,1,1,2,2],["John Smith","Jane Smith","David Smith","Mary Smith","John Citizen","Jane Citizen"],["F","M","M","F","M","F"],[40,39,10,8,32,33],["Married","Married","Never married","Never married","Never married","Never married"],["400-499","300-399","Nil","Nil","400-499","1750-1999"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>household_id<\/th>\n      <th>person<\/th>\n      <th>gender<\/th>\n      <th>age<\/th>\n      <th>maritial_status<\/th>\n      <th>income_per_week<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"lengthChange":false,"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,3]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---

# What you lose in aggregate data

* For aggregate data, there are less scope for you to draw insights conditioned on other variables.   
* *e.g. Based on frequency data alone, you cannot answer questions like: How many middle income families have 2 children?*
* Raw data are desirable if you can get hold of it!
--

## Trust and skepticism

* By the way, did you notice anything odd about the dummy data presented in the last slide?
--

* John Smith was recorded as female and Jane Smith as male. Data may have been incorrectly recorded. 
--

* How much do you trust the aggregate data?

* Remember to have a healthy dose of skepticism in your data.

---

# Data Confidentiality

* The data is not just aggregated, but it is also .monash-blue[anonymised]
* E.g. in .monash-blue[`2021_GCP_Sequential_Template_R2.xlsx`], Sheet "G17", footnote says "*Please note that there are **small random adjustments** made to all cell values to protect the confidentiality of data. These adjustments may cause the sum of rows or columns to differ by small amounts from table totals.*"

.question-box.w-60[
Do you think that you'll get the same numbers if you use the ones from different geographical code? E.g. `SA1` and `STE`. 
]

* You can check this in the tutorial 🔧

---

.idea-box.tl.w-70[
## Summary

* We went through how to locate and understand the data available in the 2021 Australian census.
* We know some limitations with this data. 
* We learnt about what tidy data is.
* We learnt a little about how to manipulate strings

]

---

## Answers to break out questions

* Which folder contains demographic information about each suburb?  
*In the file `2021AboutDataPacks_readme.txt` you find out that folders represent different geographical sub-regions. SAL represents suburbs and locaties and in the previous census was called SSC.*

* What is LGA short for?    
*Local Government Areas*

* Where can I find information about how much rent people pay?  
*In the file `2021_GCP_Sequential_Template_R2` there is a list of variables and what is contained in each table. G40 contains the rental information (organised by landlord type).*

* What is contained in variable G17?  
*G17 contains information about the total personal income organised by age and sex.*

---

#### Slides developed by Dr. Emi Tanaka and updated by Dr. Kate Saunders

---

background-size: cover
class: title-slide
background-image: url("images/bg-03.png")

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

.bottom_abs.width100[

Lecturer: *Kate Saunders*

Department of Econometrics and Business Statistics

<i class="fas fa-envelope"></i>  ETC5512.Clayton-x@monash.edu

<i class="fas fa-calendar-alt"></i> Week 4

<br>

]

---

# .orange[Case study <i class="fas fa-search-plus"></i>] Aussie Local Government Area

```r
LGA <- ozmaps::abs_lga %>% pull(NAME)
LGA[1:7]
```

```
## [1] "Broken Hill (C)" "Waroona (S)"     "Toowoomba (R)"   "West Arthur (S)"
## [5] "Moreton Bay (R)" "Etheridge (S)"   "Cleve (DC)"
```

<center>
<table style="width:90%">
  <tr>
    <td>C = Cities</td>
    <td>A = Areas</td>
    <td>RC = Rural Cities</td>
    
  </tr>
  <tr>
    <td>B = Boroughs</td>
    <td>S = Shires</td>
    <td>DC = District Councils</td>
  </tr>
  <tr>
    <td>M = Municipalities</td>
    <td>T = Towns</td>
    <td>AC = Aboriginal Councils </td>
  </tr>
  <tr>
  <td>RegC = Regional Councils</td>
  </tr>
</table>
</center>

<br>
.center[
🎯 **Extract the LGA status from the LGA names**

{{content}}
]

How?

---

# Extracting the string

```r
str_extract(LGA, "\$.+\$") 
```

```
##   [1] "(C)"        "(S)"        "(R)"        "(S)"        "(R)"       
##   [6] "(S)"        "(DC)"       "(R)"        "(DC)"       "(C)"       
##  [11] "(DC)"       "(S)"        "(S)"        "(S)"        "(DC)"      
##  [16] "(A)"        "(C)"        "(A)"        "(T)"        "(RC)"      
##  [21] "(A)"        "(S)"        "(S)"        "(S)"        "(C)"       
##  [26] "(DC)"       "(R)"        "(A)"        "(C)"        "(DC)"      
##  [31] "(S)"        "(S)"        "(A)"        "(S)"        "(S)"       
##  [36] "(R)"        "(M)"        "(A)"        "(C)"        "(S)"       
##  [41] "(S)"        "(C)"        "(A)"        "(S)"        "(C)"       
##  [46] "(AC)"       "(A)"        "(S)"        "(A)"        "(C)"       
##  [51] "(A)"        "(R)"        "(S)"        "(T)"        "(C)"       
##  [56] "(S)"        "(S)"        "(R)"        "(C)"        "(T)"       
##  [61] "(C)"        "(S)"        "(C)"        "(C)"        "(C)"       
##  [66] "(C)"        "(S)"        "(DC)"       "(DC)"       "(S)"       
##  [71] "(R)"        "(R)"        "(S)"        "(B)"        "(DC)"      
##  [76] "(M)"        "(A)"        "(C)"        "(S)"        "(S)"       
##  [81] "(S)"        "(S)"        "(S)"        "(S)"        "(S)"       
##  [86] "(C)"        "(A)"        "(C)"        "(A)"        "(S)"       
##  [91] "(C)"        "(A)"        "(S)"        "(S)"        "(S)"       
##  [96] "(S)"        "(DC)"       "(S)"        "(S)"        "(S)"       
## [101] "(C)"        "(C)"        "(DC)"       "(S)"        "(S)"       
## [106] "(C)"        "(S)"        "(DC)"       "(C)"        "(C)"       
## [111] "(S)"        "(S)"        "(S)"        "(S)"        "(S)"       
## [116] "(S)"        "(A)"        "(DC)"       "(S)"        "(A)"       
## [121] "(C)"        "(A)"        "(S)"        "(A)"        "(DC)"      
## [126] "(S)"        "(C)"        "(S)"        "(A)"        "(S)"       
## [131] "(M)"        "(S)"        "(DC)"       "(R)"        "(C)"       
## [136] "(C)"        "(S)"        "(C)"        "(S)"        "(T)"       
## [141] "(S)"        "(S)"        "(DC)"       "(S)"        "(T)"       
## [146] "(C)"        "(S)"        "(M)"        "(S)"        "(DC)"      
## [151] "(C)"        "(S)"        "(M)"        "(C)"        "(S)"       
## [156] "(C)"        "(C)"        "(R)"        "(S)"        "(C)"       
## [161] "(C)"        "(R)"        "(S)"        "(C)"        "(A)"       
## [166] "(T)"        "(S)"        "(RC)"       "(C)"        "(A)"       
## [171] "(A)"        "(A)"        "(S)"        "(A)"        "(S)"       
## [176] "(S)"        "(T)"        "(S)"        "(S)"        "(S)"       
## [181] "(A)"        "(DC)"       "(M)"        "(C)"        "(S)"       
## [186] "(A)"        "(T)"        "(A)"        "(C)"        "(S)"       
## [191] "(C)"        "(R)"        "(C)"        "(S)"        "(S)"       
## [196] "(S)"        "(S)"        "(R)"        "(C)"        "(DC)"      
## [201] "(A)"        "(DC)"       "(R)"        "(C)"        "(S)"       
## [206] "(S)"        "(C)"        "(C)"        "(R)"        "(S)"       
## [211] "(S)"        "(C)"        "(A)"        "(S)"        "(S)"       
## [216] "(C)"        "(DC)"       "(S)"        "(M) (Tas.)" "(M) (Tas.)"
## [221] "(C) (Vic.)" "(C) (Vic.)" "(S)"        "(DC)"       "(S)"       
## [226] "(RC)"       "(S)"        "(DC)"       "(S)"        "(S)"       
## [231] "(R)"        "(S)"        "(A)"        "(C)"        "(C)"       
## [236] "(A)"        "(A)"        "(RC)"       "(S)"        "(C)"       
## [241] "(S)"        "(S)"        "(S)"        "(C)"        "(C)"       
## [246] "(S)"        "(C)"        "(C)"        "(C)"        "(A)"       
## [251] "(C)"        "(S)"        "(S)"        "(S)"        "(S)"       
## [256] "(S)"        "(A)"        "(A)"        "(A)"        "(S)"       
## [261] "(A)"        "(A)"        "(S)"        "(S)"        "(C)"       
## [266] "(A)"        "(M)"        "(S)"        "(S)"        "(C)"       
## [271] "(R)"        "(S)"        "(R)"        "(DC)"       "(R)"       
## [276] "(C)"        "(S)"        "(S)"        "(C)"        "(S)"       
## [281] "(A)"        "(R)"        "(DC)"       "(A)"        "(C)"       
## [286] "(A)"        "(S)"        "(S)"        "(A)"        "(C)"       
## [291] "(C)"        "(A)"        "(T)"        "(S)"        "(C)"       
## [296] "(A)"        "(A)"        "(S)"        "(S)"        "(T)"       
## [301] "(C)"        "(A)"        "(A)"        "(DC)"       "(A)"       
## [306] "(C)"        "(M)"        "(M)"        "(S)"        "(A)"       
## [311] "(A)"        "(C)"        "(C)"        "(S)"        "(DC)"      
## [316] "(S)"        "(C)"        "(S)"        "(S)"        "(DC)"      
## [321] "(RegC)"     "(C)"        "(S)"        "(S)"        NA          
## [326] "(A)"        "(S)"        "(A)"        "(S)"        "(A)"       
## [331] "(S)"        "(C)"        "(R)"        "(C)"        "(S)"       
## [336] "(A)"        "(DC)"       "(S)"        "(A)"        "(R)"       
## [341] "(S)"        "(S)"        "(RC)"       "(T)"        "(A)"       
## [346] "(M)"        "(A)"        "(S)"        "(S)"        "(S)"       
## [351] "(S)"        "(A)"        "(RC)"       "(S)"        "(A)"       
## [356] "(R)"        "(S)"        "(S)"        "(C)"        "(S)"       
## [361] "(DC)"       "(M)"        "(M)"        "(AC)"       "(DC)"      
## [366] "(A)"        "(A)"        "(S)"        "(S)"        "(A)"       
## [371] "(C)"        "(S)"        "(S)"        "(C)"        "(R)"       
## [376] "(S)"        "(S)"        NA           "(A)"        "(T)"       
## [381] "(S)"        "(A)"        "(C)"        "(C)"        "(A)"       
## [386] "(C)"        "(DC)"       "(C)"        "(A)"        "(A)"       
## [391] "(A)"        "(S)"        "(DC)"       "(DC)"       "(S)"       
## [396] "(M)"        "(R)"        "(DC)"       "(C)"        "(S)"       
## [401] "(S)"        "(C)"        "(C)"        "(C)"        "(C)"       
## [406] "(C)"        "(S)"        "(A)"        NA           "(S)"       
## [411] "(C)"        "(S)"        "(M)"        "(C)"        "(S)"       
## [416] "(S)"        NA           "(C)"        "(S)"        "(C)"       
## [421] "(DC)"       "(S)"        "(C)"        "(S)"        "(C)"       
## [426] "(M)"        "(A)"        "(A)"        "(A)"        "(S)"       
## [431] "(C)"        "(S)"        "(S)"        "(S)"        "(A)"       
## [436] "(A)"        "(A)"        "(S)"        "(S)"        "(S)"       
## [441] "(C)"        "(S)"        "(C)"        "(C)"        "(C)"       
## [446] "(C) (NSW)"  "(S) (Qld)"  "(R) (Qld)"  "(DC) (SA)"  "(C) (SA)"  
## [451] "(M) (Tas.)" "(M) (Tas.)" "(C)"        "(R)"        "(M)"       
## [456] "(C)"        "(R)"        "(S)"        "(RC)"       "(S)"       
## [461] "(M)"        "(C)"        "(R)"        "(C)"        "(DC)"      
## [466] "(C)"        "(C)"        "(M)"        "(C)"        "(S)"       
## [471] "(C)"        "(DC)"       "(M)"        "(S)"        "(C)"       
## [476] "(C)"        "(A)"        "(DC)"       "(R)"        "(C)"       
## [481] "(C)"        "(A)"        "(M)"        "(C)"        "(C)"       
## [486] "(S)"        "(S)"        "(S)"        "(A)"        "(R)"       
## [491] "(M)"        "(A)"        "(R)"        "(A)"        "(A)"       
## [496] "(R)"        "(R)"        "(R)"        "(S)"        "(C)"       
## [501] "(C)"        "(S)"        "(A)"        "(S)"        "(M)"       
## [506] "(M)"        "(S)"        "(A)"        "(A)"        "(S)"       
## [511] "(A)"        "(C)"        "(DC)"       "(S)"        "(S)"       
## [516] NA           "(A)"        NA           "(R)"        "(C)"       
## [521] "(S)"        "(C)"        "(S)"        "(A)"        "(A)"       
## [526] "(A)"        "(A)"        "(C)"        "(A)"        "(A)"       
## [531] "(A)"        "(A)"        "(C) (NSW)"  "(A)"        "(C)"       
## [536] "(R)"        "(S)"        "(A)"        "(R)"        "(C)"       
## [541] "(A)"        "(S)"        "(A)"        "(A)"
```

<div class="info-box" style="position:absolute;right:20px;margin-right:0px!important;bottom:50px;margin-left:0;width:900px;font-size: 20pt;">
<ul>
<li>What is <code>"\$.+\$"</code>???</li>
{{content}}
</ul>
</div>

<li>This is a pattern expressed as <b>regular expression</b> or <b>regex</b> for short</li>

<li>Note in R, you have to add an extra <code>\</code> when  <code>\</code> is included in the pattern <span class="font_small">(yes this means that you can have a lot of backslashes... just keep adding <code>\</code> until it works! Enjoy <a href="https://xkcd.com/1638/">this xkcd comic</a>.)</span></li>

<li>From R v4.0.0 onwards, you can use raw string to elimiate all the extra <code>\</code>, e.g. <code>r"(<span class="monash-blue">$.+$</span><code>)"</code> is the same as <code class="monash-blue">"\$.+\$"</code></li>

---

# .font_small[Back to] Extracting the string

```r
str_extract(LGA, "\$.+\$")
```

---

# .font_small[Back to] Extracting the string

```r
str_extract(LGA, "\$.+\$") %>% 
  table()
```

```
## .
##        (A)       (AC)        (B)        (C)  (C) (NSW)   (C) (SA) (C) (Vic.) 
##        100          2          1        120          2          1          2 
##       (DC)  (DC) (SA)        (M) (M) (Tas.)        (R)  (R) (Qld)       (RC) 
##         40          1         23          4         38          1          7 
##     (RegC)        (S)  (S) (Qld)        (T) 
##          1        182          1         12
```

<blockquote>
Where the same Local Government Area name appears in different States or Territories, the State or Territory abbreviation appears in parenthesis after the name. Local Government Area names are therefore unique.<br>
<a href="https://www.abs.gov.au/ausstats/abs@.nsf/Lookup/by%20Subject/1270.0.55.003~June%202020~Main%20Features~Local%20Government%20Areas%20(LGAs)~3" style="float:right">-Australian Bureau of Statistics</a>
</blockquote>

---

# .font_small[Retry] Extracting the string

```r
str_extract(LGA, "\$[^)]+\$") %>% 
  table()
```

```
## .
##    (A)   (AC)    (B)    (C)   (DC)    (M)    (R)   (RC) (RegC)    (S)    (T) 
##    100      2      1    125     41     27     39      7      1    183     12
```

---

# .font_small[Retry] Extracting the string

```r
str_extract(LGA, "\$[^)]+\$") %>% 
  # remove the brackets
  str_replace_all("[\$\$]", "") %>% 
  table()
```

```
## .
##    A   AC    B    C   DC    M    R   RC RegC    S    T 
##  100    2    1  125   41   27   39    7    1  183   12
```

* `"[]"` for single character match
* We want to match `(` and `)` but these are meta-characters
* So we need to escape it to have it as a literal: `$` and `$`
* But we must escape the escape character... so it's actually `\$` `\$`

---

# .font_small[R v4.0.0] Extracting the string

<pre>
<code class="r hljs remark-code">
<div class="remark-code-line">str_extract(LGA, <span style="background-color:yellow">r<span class="hljs-string">"(</span></span><span class="hljs-string">$[^)]+$<span style="background-color:yellow">)"</span></span>) %&gt;% </div>
<div class="remark-code-line">  <span class="hljs-comment"># remove the brackets</span></div>
<div class="remark-code-line">  str_replace_all(<span style="background-color:yellow">r<span class="hljs-string">"(</span></span><span class="hljs-string">[]<span style="background-color:yellow">)"</span></span>, <span class="hljs-string">""</span>) %&gt;% </div>
<div class="remark-code-line">  table()</div>
</code>
</pre>
<pre>
<code class="r hljs remark-code"><div class="remark-code-line"><span class="hljs-comment">## .</span></div>
<div class="remark-code-line"><span class="hljs-comment">##    A   AC    B    C   DC    M    R   RC RegC    S    T </span></div>
<div class="remark-code-line"><span class="hljs-comment">##  100    2    1  125   41   27   39    7    1  183   12</span></div>
</code>
</pre>

* If using R v4.0.0 onwards, you can use the raw string version instead