ETC5512: Tutorial 09

🎯 Objectives

In this tutorial, you will

Use re-identification to explore challenges with open data
Use data modification tools to reduce the risk of identification
Test the data modification of your class mates

🔧 Preparation

This tutorial will use two synthetic datasets. These datasets are available on Github (https://github.com/joannytan/WCD_datarepo). You will need to download these datasets before the tutorial. As we are considering privacy this week, the datasets will be simulated, hypothetical datasets. They do not represent real data.
Create the project structure for this week’s tutorial.

etc5512-week09
├── README.md
├── analysis
│   └── exercise.Rmd
├── data
│   ├── blood_donors.csv
│   ├── survey_data.csv
└── etc5512-week09.Rproj

where README.md can contains a a short summary of what is going to be done, and where to look for files, like “analysis contains the code to analyse the data”, “data contains two files downloaded from the Github”.

💽 Exercise 9A

Read the survey_data.csv file into R. This is simulated data designed to represent a medical survey. The hypothetical premise is that participants were invited to attend a center to take part in a short study investigating blood sugar levels and eating habits. The data was openly released with variables including the date and time of testing (morning (9-11:59), afternoon(12:00-4:59), evening(5 - 9pm), coding whether the respondent had high blood sugar (measured in binary high/normal), their age (measured in years), their sex (M/F).

Read the data into R and use your exploratory data tools to explore the data.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

survey_data <- read.csv(here::here("tutorials/images/survey_data.csv"))

# Make sure everything is in the correct type
survey_data <- survey_data %>%
  mutate(time = as.factor(time),
         blood_sugar = as.factor(blood_sugar),
         sex = as.factor(sex),
         date = as.Date(date))

summary(survey_data)

##         time    blood_sugar sex         age             date           
##  afternoon:17   high  :23   F:21   Min.   :24.00   Min.   :2022-02-11  
##  evening  :13   normal:17   M:19   1st Qu.:39.75   1st Qu.:2022-02-11  
##  morning  :10                      Median :48.50   Median :2022-02-11  
##                                    Mean   :49.73   Mean   :2022-02-11  
##                                    3rd Qu.:60.25   3rd Qu.:2022-02-11  
##                                    Max.   :78.00   Max.   :2022-02-11

What do you notice about the date of study?

There is only one day covered

To de-identify the data, the researchers removed directly identifying information, aggregated blood sugar into a binary variable, aggregated testing time into clinically meaningful bins and released only a sample of their data, where they only released one day’s worth - 40 participants. Do you think this is sufficient given what you know of the data environment?

Given the data environment, this might be sufficient for public identification as we do not know much about the data population/sampling. However, a family member (or the participant themselves) could identify with private data. All they would need to know is the age of the respondent and that they participated on the 11th of February.

Now consider the second data set. As participants were leaving the trial, they were asked if they would consider giving blood to save lives. All said yes. As part of a promotion run by the center, blood donors were asked if they would consider giving their age and name to be published on the website to encourage others to give blood. All consented and data was uploaded in real time (which is represented by date and time in the dataset). Read in the second data set and explore.

blood_donation <- read.csv(here::here("tutorials/images/blood_donation.csv"))

library(lubridate)
#make sure correct types
blood_donation <- blood_donation %>%
  mutate(sex= as.factor(sex),
         data_time = ymd_hms(date_time))

Given this more complex data context, can you use your skills in tidyverse to identify the participants in the data?

This is relatively simple. The different ages are unique in each dataset, so we can left_join by age and identify all individuals. Now all individuals are identifiable with public data information.

identified_data <- left_join(blood_donation, survey_data, by = "age")

## Warning in left_join(blood_donation, survey_data, by = "age"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 2 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

🧮 Exercise 9B

Using the tools and the example code from the lectures, attempt to de-identify the main survey data. There is not one solution to this, so you will need to consider:

Which variables(s) made is possible to identify individuals in the data?

Age made it possible to link all cases in the two datasets. However, there are other variables (time of day and sex) that also correspond to relatively rare cells.

What de-identification tools would be most appropriate to solve the challenges caused by these variables?

Age was the variable that caused the most immediate difficulty. We could remove it from the dataset entirely, but this would result in considerable loss of information. As one of the challenges associated with age is that there are a number of different values it can be (increasing the chance of unique values), a first step is aggregation.

survey_data_di <- survey_data %>%
  mutate(age_group = cut(age,breaks = c(18,30,45,65,100))) %>% # break age along these lines
  select(-age) #remove age

Is it possible to remove all possible risk? Consider the balance between reducing risk and reducing utility.

There is still potential risk as the combination of time, sex and age_group will result in some small cells in the sample.

survey_data_di %>%
  group_by(time, sex, age_group) %>%
  mutate(Freq = n()) %>%
  ungroup() %>%
  filter(Freq == 1)

## # A tibble: 8 × 6
##   time      blood_sugar sex   date       age_group  Freq
##   <fct>     <fct>       <fct> <date>     <fct>     <int>
## 1 morning   normal      M     2022-02-11 (18,30]       1
## 2 morning   normal      F     2022-02-11 (30,45]       1
## 3 afternoon normal      F     2022-02-11 (65,100]      1
## 4 morning   high        F     2022-02-11 (65,100]      1
## 5 morning   normal      M     2022-02-11 (30,45]       1
## 6 evening   high        M     2022-02-11 (65,100]      1
## 7 afternoon high        M     2022-02-11 (30,45]       1
## 8 morning   normal      M     2022-02-11 (65,100]      1

There are 8 individuals that are potentially identifiable based on these cross tabulations. It would decrease the utility of the dataset to remove time and sex from the data. We could instead consider fewer age groups to create larger groups to reduce this problem. Currently there are four age groups, what if we moved to three (losing information and utility)

survey_data_di2 <- survey_data %>%
  mutate(age_group = cut(age,breaks = c(18,40,60,100))) %>% # break age along these lines
  select(-age)  %>%
  group_by(time, sex, age_group) %>%
  mutate(Freq = n()) %>%
  ungroup() %>%
  filter(Freq == 1)

survey_data_di2

## # A tibble: 5 × 6
##   time      blood_sugar sex   date       age_group  Freq
##   <fct>     <fct>       <fct> <date>     <fct>     <int>
## 1 morning   normal      M     2022-02-11 (18,40]       1
## 2 evening   high        M     2022-02-11 (40,60]       1
## 3 afternoon normal      F     2022-02-11 (60,100]      1
## 4 evening   normal      F     2022-02-11 (60,100]      1
## 5 morning   normal      M     2022-02-11 (60,100]      1

📍 Exercise 9C

Lastly we will see if another data user could identify individuals in your data. Email, Github or otherwise exchange your new de-identified data with a friend in the class.

Can you identify individuals using your friend’s de-identified data?

Assuming your friend has made the same decisions as I have, you shouldn’t be able to identify any individuals from the data.

Did you and your friend make the same decisions to de-identify the data? If not, what are the pros and cons of each?

The approach that I took meant that I lost quite a lot of age data in order to reduce risk. Other suitable approaches would have been to create a synthetic dataset or to add a small amount of noise to age in order to make it more difficult to identify individuals.

How do you think peer review can assist data de-identification procedures?

Peer review can be used within institutes to identify potential data risks and solutions.

ETC5512: Tutorial 09

Lecturer: Kate Saunders

Week 09 Identifiable data

🎯 Objectives

🔧 Preparation

💽 Exercise 9A

🧮 Exercise 9B

📍 Exercise 9C

Material developed by Dr. Joan Tan. Material maintained by Dr. Kate Saunders