🎯 Objectives

In this tutorial, you will

🔧 Preparation

  1. This tutorial will use two synthetic datasets. These datasets are available on [Github (https://github.com/joannytan/WCD_datarepo). You will need to download these datasets before the tutorial. As we are considering privacy this week, the datasets will be simulated, hypothetical datasets. They do not represent real data.
  2. Create the project structure for this week’s tutorial.
etc5512-week09
├── README.md
├── analysis
│   └── exercise.Rmd
├── data
│   ├── blood_donors.csv
│   ├── survey_data.csv
└── etc5512-week11.Rproj

where README.md can contains a a short summary of what is going to be done, and where to look for files, like “analysis contains the code to analyse the data”, “data contains two files downloaded from the Github”.

💽 Exercise 9A

Read the survey_data.csv file into R. This is simulated data designed to represent a medical survey. The hypothetical premise is that participants were invited to attend a center to take part in a short study investigating blood sugar levels and eating habits. The data was openly released with variables including the date and time of testing (morning (9-11:59), afternoon(12:00-4:59), evening(5 - 9pm), coding whether the respondent had high blood sugar (measured in binary high/normal), their age (measured in years), their sex (M/F).

  1. Read the data into R and use your exploratory data tools to explore the data.

  2. What do you notice about the date of study?

  3. To de-identify the data, the researchers removed directly identifying information, aggregated blood sugar into a binary variable, aggregated testing time into clinically meaningful bins and released only a sample of their data, where they only released one day’s worth - 40 participants. Do you think this is sufficient given what you know of the data environment?

  4. Now consider the second dataset. As participants were leaving the trial, they were asked if they would consider giving blood to save lives. All said yes. As part of a promotion run by the center, blood donors were asked if they would consider giving their age, sex and name to be published on the website to encourage others to give blood. All consented and data was uploaded in real time (which is represented by date and time in the dataset). Read in the second data set and explore.

  5. Given this more complex data context, can you use your skills in tidyverse to identify the participants in the data?

🧮 Exercise 9B

Using the tools and the example code from the lectures, attempt to de-identify the main survey data. There is not one solution to this, so you will need to consider:

  1. Which variables(s) made is possible to identify individuals in the data?

  2. What de-identification tools would be most appropriate to solve the challenges caused by these variables?

  3. Is it possible to remove all possible risk? Consider the balance between reducing risk and reducing utility.

📍 Exercise 9C

Lastly we will see if another data user could identify individuals in your data. Email, Github or otherwise exchange your new de-identified data with a friend in the class.

  1. Can you identify individuals using your friend’s de-identified data?

  2. Did you and your friend make the same decisions to de-identify the data? If not, what are the pros and cons of each?

  3. How do you think peer review can assist data de-identification procedures?

Material maintained by Dr. Kate Saunders. Original material developed by Dr. Joan Tan