🎯 Objectives

In this tutorial, you will learn

🔧 Preparation

  1. The install the following R packages. (You can copy and paste the code below into your RStudio Console window to do this.):
install.packages(c("lubridate", "ggthemes", "forcats"))

and if you missed tutorial 1, you will also need to install packages listed in those instructions.

  1. Complete the assigned reading, and the associated quiz (in flux).

👋 Getting started

Say hello to your tutor and to your neighbours!

Tutorial

⚙️ Exercise 1

This question relates to the Tidy Tuesday Data on locations of alternative fuel recharging stations. Have a read through this site, and also visit the link to the data providers, DOT.

a. Read the details about the data at DOT. How is this data collected, do you think?

b. What type of data is this? (observational, experimental, survey, census)

c. Describe the population, and what is the sample.

d. Download the data and plot the fueling locations on a map, coloured by fuel type.

library(tidyverse)
library(ggthemes)
library(lubridate)
stations <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-01/stations.csv')
# Filter to continental USA, but this cannot 
# be done using states because some records have 
# erroneous lat/long
# stations <- stations %>%
#  filter(!(STATE %in% c("AK", "HI", "PR", "ON")))

# Map the sites
usa <- map_data("state")
# Filter to continental USA using map boundary
stations <- stations %>%
  filter(between(LONGITUDE, min(usa$long)-1, max(usa$long)+1),
         between(LATITUDE, min(usa$lat)-1, max(usa$lat)+1))
ggplot() + 
  geom_path(data=usa, aes(x=long, y=lat, group=group), colour="grey80") + 
  geom_point(data=stations, aes(x=LONGITUDE, 
                                y=LATITUDE,
                                colour=FUEL_TYPE_CODE),
             alpha=0.8) +
  facet_wrap(~FUEL_TYPE_CODE, ncol=4) +
  coord_map() +
  theme_map() +
  theme(legend.position = "none")

e. Count the number of new stations by month, and make a time series plot by fuel type.

# Time line of opening
stations %>% 
  mutate(OPEN_DATE = as.Date(OPEN_DATE)) %>%
  filter(!is.na(OPEN_DATE)) %>%
  mutate(m = month(OPEN_DATE),
         yr = year(OPEN_DATE)) %>%
  #filter(y < 2022) %>%
  mutate(open_yrmth = as.Date(paste(yr, m, "01", sep="-"), "%Y-%M-%d")) %>% 
  group_by(open_yrmth, FUEL_TYPE_CODE) %>%
  summarise(nopen = n(), .groups = "drop") %>%
ggplot(aes(x=open_yrmth, 
           y=nopen, 
           colour=FUEL_TYPE_CODE)) +
  geom_line() +
  facet_wrap(~FUEL_TYPE_CODE, ncol=4, scales="free_y") +
  ylab("# opening") +
  scale_x_date("", date_labels="%y") +
  theme(legend.position = "none")

g. If the question to answer is “which alternative fuel vehicle is the fastest growing?” what is the explanatory (independent, predictor) variable and what is the response variable?

⚙️ Exercise 2

Here we will look at the Chocolate bar ratings. Details (brief) of how the data was collected are provided here and more about the data itself is here.

chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

a. What type of data is this? (observational, experimental, survey, census)

b. How is the data collected?

c. Describe the population.

d. For the question “Which country of origin of the bean obtains the best rating?” state the response and predictor variables.

e. Make a plot to answer the previous question. (Only use countries with more than 10 records.)

library(forcats)
keep <- chocolate %>% 
  count(country_of_bean_origin, sort = TRUE) %>%
  filter(n>10) %>%
  pull(country_of_bean_origin)
chocolate %>%
  filter(country_of_bean_origin %in% keep) %>%
  ggplot(aes(x=fct_reorder(country_of_bean_origin, rating, mean), 
             y=rating)) +
    geom_jitter(width=0.1) + 
    stat_summary(fun = mean, fun.min = median, fun.max = median,
                 geom = "point", colour = "orange") +
    xlab("") + 
    coord_flip()

⚙️ Exercise 3

Read the description of the study titled “Clearing the Fog: Is Hydroxychloroquine Effective in Reducing COVID-19 Progression (COVID-19)”.

a. What type of data is this? (observational, experimental, survey, census)

b. How many subjects participated in the study at the start, and to completion?

c.What are the:

  • experimental units?
  • factor?
  • blocking factors?
  • response variable (outcome measures)?

d. How are subjects assigned to treatment groups?

e. What were the results of the study?

f. Construct the data from the results reported. Compute the proportion of subjects with progression of COVID after 5 days, for the two treatments. Include the standard error of the estimate.

hcq <- tibble(trt = c("standard", "standard", "hcq", "hcq"),
              progression = c("all", "yes", "all", "yes"), 
              count = c(151, 5, 349, 11))
hcq %>% 
  pivot_wider(names_from = "progression", values_from = "count") %>%
  mutate(p = yes/all) %>%
  mutate(se = sqrt(p*(1-p)/all))

g. Based on the proportions and their standard errors, why would the result of the study be that HCQ does NOT improve the outcomes of COVID patients?

h. What is the population for this experiment?

Optional extra learning exercise

This tutorial will prepare you for the material of next week. Please follow the following steps:

  1. Enter the address https://olc.worldbank.org/ in your browser.

  2. Click on “Register” and create a profile.

  3. Once you login, the following link can be used to enroll in the course Open Data for Data Users (Self-Paced), https://olc.worldbank.org/content/open-data-data-users-self-paced

  4. Please allow the pop up blocker at the browser settings to run the course successfully.

  5. Follow the whole course and answer the questions throughout.

🛑 Wrapping up

Talk to your tutor about what you think you learned today, what was easy, what was fun, what you found hard.

Material originally developed by Prof. Di Cook and maintained by Dr. Kate Saunders