```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, 
                      eval = TRUE,
                      cache = FALSE, 
                      warning = FALSE, 
                      message = FALSE,
                      fig.retina = 3)
options(width=80,digits=3)
```  

# Tutorial Solution

## `r emo::ji("gear")` Exercise 1 

<!-- observational or experiment, and according to my taxonomy, describe types of variables, brainstorm questions, given a question define population-->

This question relates to the [Tidy Tuesday Data on locations of alternative fuel recharging stations](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-03-01/readme.md). Have a read through this site, and also visit the link to the data providers, DOT.

### a. Read the details about the data at [DOT](https://data-usdot.opendata.arcgis.com/datasets/usdot::alternative-fueling-stations/about). How is this data collected, do you think? 

*"The U.S. Department of Energy collects this data in partnership with Clean Cities coalitions and their stakeholders to help fleets and consumers find alternative fueling stations." The implication is that alternative fueling stations provide the details of their service station to the database.*

### b. What type of data is this? (observational, experimental, survey, census)

*This is closer to a census than any other type. It is beneficial for a fuel provider to be listed in the database, so we would expect that data is available for (almost) all fuel providers.*

### c. Describe the population, and what is the sample. 

*This data could be considered to be the population, not a sample, for a particular time point. It is collected over time, so the data will continue to expand to include new records.*

### d. Download the data and plot the fueling locations on a map, coloured by fuel type.

```{r fig.width=8, fig.height=3, out.width="100%"}
library(tidyverse)
library(ggthemes)
library(lubridate)
stations <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-01/stations.csv')
# Filter to continental USA, but this cannot 
# be done using states because some records have 
# erroneous lat/long
# stations <- stations %>%
#  filter(!(STATE %in% c("AK", "HI", "PR", "ON")))

# Map the sites
usa <- map_data("state")
# Filter to continental USA using map boundary
stations <- stations %>%
  filter(between(LONGITUDE, min(usa$long)-1, max(usa$long)+1),
         between(LATITUDE, min(usa$lat)-1, max(usa$lat)+1))
ggplot() + 
  geom_path(data=usa, aes(x=long, y=lat, group=group), colour="grey80") + 
  geom_point(data=stations, aes(x=LONGITUDE, 
                                y=LATITUDE,
                                colour=FUEL_TYPE_CODE),
             alpha=0.8) +
  facet_wrap(~FUEL_TYPE_CODE, ncol=4) +
  coord_map() +
  theme_map() +
  theme(legend.position = "none")
```

### e. Count the number of new stations by month, and make a time series plot by fuel type. 

```{r fig.width=8, fig.height=4, out.width="100%"}
# Time line of opening
stations %>% 
  mutate(OPEN_DATE = as.Date(OPEN_DATE)) %>%
  filter(!is.na(OPEN_DATE)) %>%
  mutate(m = month(OPEN_DATE),
         yr = year(OPEN_DATE)) %>%
  #filter(y < 2022) %>%
  mutate(open_yrmth = as.Date(paste(yr, m, "01", sep="-"), "%Y-%M-%d")) %>% 
  group_by(open_yrmth, FUEL_TYPE_CODE) %>%
  summarise(nopen = n(), .groups = "drop") %>%
ggplot(aes(x=open_yrmth, 
           y=nopen, 
           colour=FUEL_TYPE_CODE)) +
  geom_line() +
  facet_wrap(~FUEL_TYPE_CODE, ncol=4, scales="free_y") +
  ylab("# opening") +
  scale_x_date("", date_labels="%y") +
  theme(legend.position = "none")
```

### g. If the question to answer is "which alternative fuel vehicle is the fastest growing?" what is the explanatory (independent, predictor) variable and what is the response variable?

*The response variable will be the count of new stations (or cumulative count or rate of growth) relative to time, and the explanatory variable is fuel type. The data suggests that electric is the fastest growing, and it is rapidly expanding.*


## `r emo::ji("gear")` Exercise 2

Here we will look at the [Chocolate bar ratings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-01-18/readme.md). Details (brief) of how the data was collected are provided [here](https://github.com/rfordatascience/tidytuesday/issues/408) and more about the data itself is [here](http://flavorsofcacao.com/chocolate_database.html).

```{r}
chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')
```

### a. What type of data is this? (observational, experimental, survey, census)

*Observational*

### b. How is the data collected?

*"The Manhattan Chocolate Society’s Brady Brelinski has reviewed 2,500+ bars of craft chocolate since 2006, and compiles his findings into a copy-paste-able table that lists each bar’s manufacturer, bean origin, percent cocoa, ingredients, review notes, and numerical rating. Related: Craft chocolate makers in the US and Canada, also compiled by Brelinski."*

*These are ratings given for chocolates, probably sourced and chosen by one person, over a period of time.*

### c. Describe the population.

*This is a tough one. We'd like to think that these ratings might be useful for our own tastes. However, only one person did the tasting. This likely yields more comparable ratings than if many people tasted them all. Technically the population is that one person, and their perception of the chocolates. Thus the ratings may not infer how other people perceive the chocolates.*

### d. For the question "Which country of origin of the bean obtains the best rating?" state the response and predictor variables.

*The response (dependent) variable is rating, and country of bean origin is the predictor (or explanatory or independent variable).*

### e. Make a plot to answer the previous question. (Only use countries with more than 10 records.)

```{r}
library(forcats)
keep <- chocolate %>% 
  count(country_of_bean_origin, sort = TRUE) %>%
  filter(n>10) %>%
  pull(country_of_bean_origin)
chocolate %>%
  filter(country_of_bean_origin %in% keep) %>%
  ggplot(aes(x=fct_reorder(country_of_bean_origin, rating, mean), 
             y=rating)) +
    geom_jitter(width=0.1) + 
    stat_summary(fun = mean, fun.min = median, fun.max = median,
                 geom = "point", colour = "orange") +
    xlab("") + 
    coord_flip()
```

*There is a lot of variability from country to country, and some chocolates have received very low ratings. Overall Congo has the highest average rating by Brelinski, followed by Cuba, Vietnam, and our near neighbour Papua New Guinea.*

## `r emo::ji("gear")` Exercise 3

Read the description of the study titled ["Clearing the Fog: Is Hydroxychloroquine Effective in Reducing COVID-19 Progression (COVID-19)"](https://clinicaltrials.gov/ct2/show/NCT04491994). 

### a. What type of data is this? (observational, experimental, survey, census)

*This is clearly data from a designed experiment.*

### b. How many subjects participated in the study at the start, and to completion?

*180 started in the control group and 360 in the HCQ group. These dropped to 151 and 349, respectively. Thus, 540 subjects started, and 500 completed the experiment.*

### c. What are the:

- experimental units? 
- factor? 
- blocking factors?
- response variable (outcome measures)?

*The experimental units are the subjects. The factor is the type of treatment received (control or HCQ).*

*"Randomization rules were designed by Dr. Wasim Alamgir together with principal investigators and implemented by an independent statistician who was not involved in data analysis. Stratified random sampling was applied to stratify all eligible patients according to age, gender and comorbidities." Participants were assigned to treatment groups by stratified sampling using age, gender and comorbidities as the strata, so these would be the blocks.* 

*"After start of treatment, development of fever > 101 F for > 72 hours, shortness of breath by minimal exertion (10-Step walk test), derangement of basic lab parameters (ALC < 1000 or raised CRP) or appearance of infiltrates on CXR during course of treatment was labeled as progression irrespective of PCR status." The response variable was whether these symptoms persisted at 5 days.*

### d. How are subjects assigned to treatment groups?

*"Randomization rules were designed by Dr. Wasim Alamgir together with principal investigators and implemented by an independent statistician who was not involved in data analysis. Stratified random sampling was applied to stratify all eligible patients according to age, gender and comorbidities." Participants were assigned to treatment groups by stratified sampling using age, gender and comorbidities as the strata.*

### e. What were the results of the study?

*There was no difference between the subjects treated with HCQ relative to the control. Interestingly, and something of an aside to the main results, is that most patients did not have a progression (see next question, and the numbers reported).*

### f. Construct the data from the results reported. Compute the proportion of subjects with progression of COVID after 5 days, for the two treatments. Include the standard error of the estimate.

```{r}
hcq <- tibble(trt = c("standard", "standard", "hcq", "hcq"),
              progression = c("all", "yes", "all", "yes"), 
              count = c(151, 5, 349, 11))
hcq %>% 
  pivot_wider(names_from = "progression", values_from = "count") %>%
  mutate(p = yes/all) %>%
  mutate(se = sqrt(p*(1-p)/all))
```

### g. Based on the proportions and their standard errors, why would the result of the study be that HCQ does NOT improve the outcomes of COVID patients?

*The proportions for each group are really similar, and the standard error would place both estimates within one standard error of the other. This is strong evidence that both groups of subjects have similar outcomes.*

### h. What is the population for this experiment?

*This is tough. It is a designed experiment, where age, gender and comobidities have been controlled to be equal between the two groups. The experiment was conducted in Pakistan, with the local population and local conditions. The results may be more applicable for these conditions. However, human DNA is very much the same across all sorts of demographics. That experiments are conducted in various communities is usually considered to be irrelevant, and the results are expected to be relevant broadly to other communities. So the population for this study would be considered to be adult humans, quite generally.*

##### Material originally developed by Prof. Di Cook
##### © Copyright 2023 Monash University