In this tutorial, you will
etc5512-week09
โโโ README.md
โโโ analysis
โย ย โโโ exercise.Rmd
โโโ data
โย ย โโโ blood_donors.csv
โย ย โโโ survey_data.csv
โโโ etc5512-week09.Rproj
where README.md
can contains a a short summary of what
is going to be done, and where to look for files, like โanalysis
contains the code to analyse the dataโ, โdata contains two files
downloaded from the Githubโ.
Read the survey_data.csv file into R. This is simulated data designed to represent a medical survey. The hypothetical premise is that participants were invited to attend a center to take part in a short study investigating blood sugar levels and eating habits. The data was openly released with variables including the date and time of testing (morning (9-11:59), afternoon(12:00-4:59), evening(5 - 9pm), coding whether the respondent had high blood sugar (measured in binary high/normal), their age (measured in years), their sex (M/F).
library(tidyverse)
## โโ Attaching core tidyverse packages โโโโโโโโโโโโโโโโโโโโโโโโ tidyverse 2.0.0 โโ
## โ dplyr 1.1.2 โ readr 2.1.4
## โ forcats 1.0.0 โ stringr 1.5.0
## โ ggplot2 3.4.2 โ tibble 3.2.1
## โ lubridate 1.9.2 โ tidyr 1.3.0
## โ purrr 1.0.1
## โโ Conflicts โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse_conflicts() โโ
## โ dplyr::filter() masks stats::filter()
## โ dplyr::lag() masks stats::lag()
## โน Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
survey_data <- read.csv(here::here("tutorials/images/survey_data.csv"))
# Make sure everything is in the correct type
survey_data <- survey_data %>%
mutate(time = as.factor(time),
blood_sugar = as.factor(blood_sugar),
sex = as.factor(sex),
date = as.Date(date))
summary(survey_data)
## time blood_sugar sex age date
## afternoon:17 high :23 F:21 Min. :24.00 Min. :2022-02-11
## evening :13 normal:17 M:19 1st Qu.:39.75 1st Qu.:2022-02-11
## morning :10 Median :48.50 Median :2022-02-11
## Mean :49.73 Mean :2022-02-11
## 3rd Qu.:60.25 3rd Qu.:2022-02-11
## Max. :78.00 Max. :2022-02-11
There is only one day covered
Given the data environment, this might be sufficient for public identification as we do not know much about the data population/sampling. However, a family member (or the participant themselves) could identify with private data. All they would need to know is the age of the respondent and that they participated on the 11th of February.
blood_donation <- read.csv(here::here("tutorials/images/blood_donation.csv"))
library(lubridate)
#make sure correct types
blood_donation <- blood_donation %>%
mutate(sex= as.factor(sex),
data_time = ymd_hms(date_time))
This is relatively simple. The different ages are unique in each dataset, so we can left_join by age and identify all individuals. Now all individuals are identifiable with public data information.
identified_data <- left_join(blood_donation, survey_data, by = "age")
## Warning in left_join(blood_donation, survey_data, by = "age"): Detected an unexpected many-to-many relationship between `x` and `y`.
## โน Row 2 of `x` matches multiple rows in `y`.
## โน Row 2 of `y` matches multiple rows in `x`.
## โน If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Using the tools and the example code from the lectures, attempt to de-identify the main survey data. There is not one solution to this, so you will need to consider:
Age made it possible to link all cases in the two datasets. However, there are other variables (time of day and sex) that also correspond to relatively rare cells.
Age was the variable that caused the most immediate difficulty. We could remove it from the dataset entirely, but this would result in considerable loss of information. As one of the challenges associated with age is that there are a number of different values it can be (increasing the chance of unique values), a first step is aggregation.
survey_data_di <- survey_data %>%
mutate(age_group = cut(age,breaks = c(18,30,45,65,100))) %>% # break age along these lines
select(-age) #remove age
There is still potential risk as the combination of time, sex and age_group will result in some small cells in the sample.
survey_data_di %>%
group_by(time, sex, age_group) %>%
mutate(Freq = n()) %>%
ungroup() %>%
filter(Freq == 1)
## # A tibble: 8 ร 6
## time blood_sugar sex date age_group Freq
## <fct> <fct> <fct> <date> <fct> <int>
## 1 morning normal M 2022-02-11 (18,30] 1
## 2 morning normal F 2022-02-11 (30,45] 1
## 3 afternoon normal F 2022-02-11 (65,100] 1
## 4 morning high F 2022-02-11 (65,100] 1
## 5 morning normal M 2022-02-11 (30,45] 1
## 6 evening high M 2022-02-11 (65,100] 1
## 7 afternoon high M 2022-02-11 (30,45] 1
## 8 morning normal M 2022-02-11 (65,100] 1
There are 8 individuals that are potentially identifiable based on these cross tabulations. It would decrease the utility of the dataset to remove time and sex from the data. We could instead consider fewer age groups to create larger groups to reduce this problem. Currently there are four age groups, what if we moved to three (losing information and utility)
survey_data_di2 <- survey_data %>%
mutate(age_group = cut(age,breaks = c(18,40,60,100))) %>% # break age along these lines
select(-age) %>%
group_by(time, sex, age_group) %>%
mutate(Freq = n()) %>%
ungroup() %>%
filter(Freq == 1)
survey_data_di2
## # A tibble: 5 ร 6
## time blood_sugar sex date age_group Freq
## <fct> <fct> <fct> <date> <fct> <int>
## 1 morning normal M 2022-02-11 (18,40] 1
## 2 evening high M 2022-02-11 (40,60] 1
## 3 afternoon normal F 2022-02-11 (60,100] 1
## 4 evening normal F 2022-02-11 (60,100] 1
## 5 morning normal M 2022-02-11 (60,100] 1
Lastly we will see if another data user could identify individuals in your data. Email, Github or otherwise exchange your new de-identified data with a friend in the class.
Assuming your friend has made the same decisions as I have, you shouldnโt be able to identify any individuals from the data.
The approach that I took meant that I lost quite a lot of age data in order to reduce risk. Other suitable approaches would have been to create a synthetic dataset or to add a small amount of noise to age in order to make it more difficult to identify individuals.
Peer review can be used within institutes to identify potential data risks and solutions.