class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-11.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-03.png") # .monash-blue[ETC5512: Wild Caught Data] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Case Study: COVID-19 Decision Making</h2> .bottom_abs.width100[ Lecturer: *Michael Lydeamore* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC5512.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 10 <br> ] --- class: center middle bg-gray .aim-box.tl.w-70[ Today you will: - Learn about exploratory data techniques and analysis - Consider ethical standpoints when making recommendations - Combine multiple sources of data to increase information ] --- # Bring your minds back It's August 2020. You're in Victoria, working as a public health advisor to the state government. Movement restrictions are the tighest they've ever been, and for a long time too. The number of COVID-19 cases is staying stubbornly high. You've been tasked with working out what should be done, and when. .question-box.w-100[ What should you do? ] --- # Let's get some data COVID-19 was an unusual succses of open data, **but** * No standard format * No standard reporting window * No standard distribution method These days, a lot of the COVID data sources are no longer being (regularly) updated. The Victorian government has a dataset of cases available [on their website](https://discover.data.vic.gov.au/dataset/victorian-coronavirus-data). --- ```r library(tidyverse) cases <- read_csv("./data/ncov_cases_by_postcode_lga.csv") glimpse(cases) ``` ``` ## Rows: 295,148 ## Columns: 7 ## $ diagnosis_date <date> 2020-01-25, 2020-01-28, 2020-01-30, 2020-01-31, 2020-02-22, 2020-02-22, 2020-02-25, 2020-03-01, 2020-03-01, 2020-03-04, 2020-03-06, 2020-03-07, 2020-03-08, 2020-03-08,… ## $ postcode <dbl> 3149, 3150, 3006, 3008, 3058, 3931, 3109, 3104, 3802, 3006, 3143, 3006, 3809, 3942, 3124, 3143, 3206, 3027, 3124, 3125, 3130, 3183, 3207, 3217, 3032, 3056, 3072, 3122, … ## $ lga_name <chr> "Monash (C)", "Monash (C)", "Melbourne (C)", "Melbourne (C)", "Merri-bek (C)", "Mornington Peninsula (S)", "Manningham (C)", "Boroondara (C)", "Casey (C)", "Melbourne (… ## $ lga_code <dbl> 24970, 24970, 24600, 24600, 25250, 25340, 24210, 21110, 21610, 24600, 26350, 24600, 21450, 25340, 21110, 26350, 25900, 27260, 21110, 24970, 26980, 25900, 25900, 22750, … ## $ RAT_case_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ PCR_case_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ Total_case_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ``` -- .question-box[ What is the structure of this dataset? What do the columns mean? ] --- # Some definitions * Local Government Area (LGA): A geographic area mostly represented by a single council. * PCR: Polymerase-chain reaction. A method for replicating virus to detect it's presence/absence. * RAT: Rapid antigen test. A less specific and sensitive test that provides presence/absence of an organism. --- # Data exploration Today is August 1, 2020. Let's have a look back at the last 2 months: ```r date_interval <- interval(ymd("2020-09-30") - months(2), ymd("2020-09-30")) cases %>% filter(diagnosis_date %within% date_interval) %>% group_by(diagnosis_date) %>% ggplot(aes(x = diagnosis_date, y = Total_case_count)) + geom_line() + labs(x="Date", y="Cases") ``` <img src="images/lecture-11/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" /> --- # Data exploration Cases are listed separately by geography, so we need to group them up: ```r cases %>% filter(diagnosis_date %within% date_interval) %>% group_by(diagnosis_date) %>% summarise(n = sum(Total_case_count)) %>% ggplot(aes(x = diagnosis_date, y = n)) + geom_line() + labs(x = "Date", y = "Cases") ``` <img src="images/lecture-11/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" /> --- class: transition We've generated evidence of the problem. What are some ideas for solutions? --- # Spatial variation There is much talk about differences in geographical response to COVID. It started as differences between the states, now people are talking about within the state itself. Senior people are talking about restrictions "not working": what does that mean? -- Let's have a look at spatially disaggregated data. --- # Spatial variation ```r cases %>% filter(diagnosis_date %within% date_interval) %>% group_by(diagnosis_date, lga_name) %>% summarise(n = sum(Total_case_count)) %>% ggplot(aes(x = diagnosis_date, y = n, colour = lga_name)) + geom_line() ``` <img src="images/lecture-11/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" /> -- A little hard to see... --- # Spatial variation .f4[ ```r cases %>% filter(diagnosis_date %within% date_interval) %>% group_by(lga_name) %>% arrange(diagnosis_date) %>% mutate(n = cumsum(Total_case_count)) %>% ggplot(aes(x = diagnosis_date, y = n, colour = lga_name)) + geom_line() + guides(colour="none") ``` <img src="images/lecture-11/unnamed-chunk-6-1.png" width="432" style="display: block; margin: auto;" /> ] What do you see? --- class: transition ### The order of the lines doesn't change much Perhaps restrictions aren't as effective in certain places --- # Mobility data Google released a series of "mobility reports" that attempted to quantify changes in mobility patterns using mobile phone data. The reports are no longer updated but are available online: [https://www.google.com/covid19/mobility/](https://www.google.com/covid19/mobility/) --- # Let's take a peak .f4[ ```r mobility <- read_csv("data/2020_AU_Region_Mobility_Report.csv") head(mobility) ``` ``` ## # A tibble: 6 × 15 ## country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreatio…¹ grocery_and_pharmacy…² parks_percent_change…³ ## <chr> <chr> <chr> <chr> <lgl> <chr> <lgl> <chr> <date> <dbl> <dbl> <dbl> ## 1 AU Australia <NA> <NA> NA <NA> NA ChIJ38W… 2020-02-15 4 3 -2 ## 2 AU Australia <NA> <NA> NA <NA> NA ChIJ38W… 2020-02-16 3 5 9 ## 3 AU Australia <NA> <NA> NA <NA> NA ChIJ38W… 2020-02-17 -1 0 -6 ## 4 AU Australia <NA> <NA> NA <NA> NA ChIJ38W… 2020-02-18 -3 -2 -13 ## 5 AU Australia <NA> <NA> NA <NA> NA ChIJ38W… 2020-02-19 -1 -1 -6 ## 6 AU Australia <NA> <NA> NA <NA> NA ChIJ38W… 2020-02-20 0 1 5 ## # ℹ abbreviated names: ¹retail_and_recreation_percent_change_from_baseline, ²grocery_and_pharmacy_percent_change_from_baseline, ³parks_percent_change_from_baseline ## # ℹ 3 more variables: transit_stations_percent_change_from_baseline <dbl>, workplaces_percent_change_from_baseline <dbl>, residential_percent_change_from_baseline <dbl> ``` Quite big: 85330 rows and 15 columns. Let's try to find Victoria... ] --- # Let's take a peak .f4[ ```r vic_mobility <- mobility %>% filter(sub_region_1 == "Victoria", !is.na(sub_region_2)) head(vic_mobility) ``` ``` ## # A tibble: 6 × 15 ## country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreatio…¹ grocery_and_pharmacy…² parks_percent_change…³ ## <chr> <chr> <chr> <chr> <lgl> <chr> <lgl> <chr> <date> <dbl> <dbl> <dbl> ## 1 AU Australia Victoria Alpine Shire NA <NA> NA ChIJC8X… 2020-06-08 NA NA NA ## 2 AU Australia Victoria Alpine Shire NA <NA> NA ChIJC8X… 2020-06-09 NA NA NA ## 3 AU Australia Victoria Alpine Shire NA <NA> NA ChIJC8X… 2020-06-10 NA NA NA ## 4 AU Australia Victoria Alpine Shire NA <NA> NA ChIJC8X… 2020-06-11 NA NA NA ## 5 AU Australia Victoria Alpine Shire NA <NA> NA ChIJC8X… 2020-06-12 NA NA NA ## 6 AU Australia Victoria Alpine Shire NA <NA> NA ChIJC8X… 2020-06-15 NA NA NA ## # ℹ abbreviated names: ¹retail_and_recreation_percent_change_from_baseline, ²grocery_and_pharmacy_percent_change_from_baseline, ³parks_percent_change_from_baseline ## # ℹ 3 more variables: transit_stations_percent_change_from_baseline <dbl>, workplaces_percent_change_from_baseline <dbl>, residential_percent_change_from_baseline <dbl> ``` ] Closer... -- Incidentally, this is the classic quest of the analyst. A lot of time manipulating data to answer a seemingly straightforward question. --- # Let's take a peak ```r vic_mobility %>% filter(date %within% date_interval) %>% ggplot(aes(x = date, y = workplaces_percent_change_from_baseline, colour = sub_region_2)) + geom_line() + guides(colour = "none") ``` <img src="images/lecture-11/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" /> -- Unreadable. -- Suggestions for fixes? --- # So back to the question After playing around with our data, I find it useful to return to the question. For us that was: What should be done to try to reduce COVID cases, and why? -- You've heard talks of movement restrictions being further tightened and are expecting to be asked about that. Let's see if we can get a handle on how explanatory mobility is on cases... --- # Mobility and cases In theory, mobility should be tied closely to cases. For an infectious disease to spread between two people: * There has to be contact between two people * One has to be _susceptible_ * One has to be infectious * Virus has to somehow move from the susceptible person to the infectious person It is hard to control points 2, 3 and 4, so a lot of effort goes to point 1. Mobility restrictions _decrease_ the amount of time an individual spends in community, so while they may still infect those close to them, they should be less likely to see transmission outside their household. --- # Mobility and cases Let's join together our two datasets: .f4[ ```r cases %>% left_join(mobility, by = join_by(diagnosis_date == date, lga_name == sub_region_1)) ``` ``` ## # A tibble: 295,148 × 20 ## diagnosis_date postcode lga_name lga_code RAT_case_count PCR_case_count Total_case_count country_region_code country_region sub_region_2 metro_area iso_3166_2_code census_fips_code place_id ## <date> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lgl> <chr> <lgl> <chr> ## 1 2020-01-25 3149 Monash (C) 24970 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 2 2020-01-28 3150 Monash (C) 24970 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 3 2020-01-30 3006 Melbourne … 24600 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 4 2020-01-31 3008 Melbourne … 24600 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 5 2020-02-22 3058 Merri-bek … 25250 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 6 2020-02-22 3931 Mornington… 25340 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 7 2020-02-25 3109 Manningham… 24210 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 8 2020-03-01 3104 Boroondara… 21110 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 9 2020-03-01 3802 Casey (C) 21610 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## 10 2020-03-04 3006 Melbourne … 24600 0 1 1 <NA> <NA> <NA> NA <NA> NA <NA> ## # ℹ 295,138 more rows ## # ℹ 6 more variables: retail_and_recreation_percent_change_from_baseline <dbl>, grocery_and_pharmacy_percent_change_from_baseline <dbl>, parks_percent_change_from_baseline <dbl>, ## # transit_stations_percent_change_from_baseline <dbl>, workplaces_percent_change_from_baseline <dbl>, residential_percent_change_from_baseline <dbl> ``` ] --- # Mobility and cases We know the focus is on metropolitan Melbourne, so let's focus down on just those at least: ```r metro_lgas <- c( "Melbourne", "Port Phillip", "Yarra", "Stonnington", "Bayside", "Boroondara", "Glen Eira", "Melton", "Brimbank", "Hobsons Bay", "Wyndham", "Moonee Valley", "Maribyrnong", "Mitchell (part)", "Banyule", "Whittlesea", "Nillumbik", "Hume", "Moreland", "Darebin", "Manningham", "Whitehorse", "Knox", "Yarra Ranges", "Maroondah", "Monash", "Kingston", "Frankston", "Cardinia", "Casey", "Greater Dandenong", "Mornington Peninsula" ) ``` -- What's coming next is the classic plight of the data analyst... -- Inconsistent naming conventions. --- # Challenge: Making the LGA names match Download the mobility data and the case data from these two websites: * Case data: [https://discover.data.vic.gov.au/dataset/victorian-coronavirus-data](https://discover.data.vic.gov.au/dataset/victorian-coronavirus-data) * Mobility data: [https://www.google.com/covid19/mobility/](https://www.google.com/covid19/mobility/) Have a go at getting `sub_region_2` to match `lga_name`. When you make some progress, shout it out - let's try to solve it together. --- # Challenge: Making the LGA names match Let's look at the total cases over the previous month to now, and compare that to mobility. ```r cases_vs_mobility <- cases %>% group_by(diagnosis_date, lga_name) %>% summarise(total_cases = sum(Total_case_count)) %>% left_join(vic_mobility, by = join_by(lga_name == new_lga_name, diagnosis_date == date)) %>% group_by(lga_name) %>% summarise( n = sum(total_cases), mean_mobility_change = mean(workplaces_percent_change_from_baseline, na.rm = TRUE) ) ``` --- # Challenge: Making the LGA names match .f4[ ```r cases_vs_mobility ``` ``` ## # A tibble: 30 × 3 ## lga_name n mean_mobility_change ## <chr> <dbl> <dbl> ## 1 Banyule 90 -51.6 ## 2 Bayside 160 -57.0 ## 3 Boroondara 71 -59.5 ## 4 Brimbank 761 -44.5 ## 5 Cardinia 109 -37.7 ## 6 Casey 470 -38.1 ## 7 Darebin 335 -52.2 ## 8 Frankston 131 -40.1 ## 9 Glen Eira 81 -56.8 ## 10 Greater Dandenong 242 -40.3 ## # ℹ 20 more rows ``` ] -- Let's try and plot --- # Plotting our exploratory analysis ```r cases_vs_mobility %>% ggplot(aes(x = n, y = mean_mobility_change)) + geom_point() ``` <img src="images/lecture-11/unnamed-chunk-15-1.png" width="432" style="display: block; margin: auto;" /> -- Thoughts? --- # Plotting our exploratory analysis An aside: Let's think about how to format this for a report. Non-technical readers will struggle with the previous plot, and we can make the point much more clearly. ```r cases_vs_mobility %>% ggplot(aes(x = lga_name, y = mean_mobility_change)) + geom_col() ``` <img src="images/lecture-11/unnamed-chunk-16-1.png" width="432" style="display: block; margin: auto;" /> -- Column graphs like this are very easy to read and understand. This one could do with some work. --- # Challenge: Create this plot <img src="images/lecture-11/unnamed-chunk-17-1.png" width="936" style="display: block; margin: auto;" /> --- # Interpreting our results Before we go to writing our report, it is good practice to think about the ethical dilemma of this situation. ✅ We have a deidentified, aggregated dataset, which we have further aggregated ✅ We have filtered the dataset to only "high" numbers of cases But our conclusions here have real-world impacts: People in a position of power are relying on this evidence to decide whether to increase lockdowns. -- [One doesn't have to look far to see when this can go wrong](https://www.ombudsman.vic.gov.au/our-impact/news/public-housing-tower-lockdown/) -- This is a different type of ethics to what we've discussed previously, but is an important part of generating new knowledge. An ethics committee will only approve research if **the potential benefits of the project justify any risks**. -- Do you think that is the case here? --- # Interpreting our results What is the conclusion here? We have _some_ (weak) evidence that mobility is associated with lower cases. -- **But** there is little to no evidence that further decreasing mobility will decrease cases more. -- #### Why? --- class: transition # Conclusion Challenge for 5 minutes: Write a two sentence executive summary outlining what you think the findings are. Then, swap with someone else in the room and see if you agree. --- # Conclusion and communication The reality is, most stakeholders won't read past your executive summary and a key picture. So, **spend time on the things that matter**! Here, we set out to think about what COVID measures may or may not further impact transmission. We have _answered what probably won't work_ but not provided any solutions. * In most cases this will get sent back with a question: "So what should we do?" How do you deal with that? -- #### Keep looking! -- More exploration will hopefully lead to more possibilities. --- class: bg-gray middle center .idea-box.tl.w-70[ ## Summary * We looked at using open data to answer a specific question * We generated evidence _against_ a specific policy * We thought about how to communicate this, both written and visually * We learnt about the plight of no standards being adhered to ] --- class: transition #### Slides created by Dr Michael Lydeamore --- background-size: cover class: title-slide background-image: url("images/bg-03.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Michael Lydeamore* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC5512.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 10 <br> ]