ETC5512: Wild Caught Data

.info-box.w-50.bg-white[
These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-12.pdf>here for the PDF </a>. 
]

---

# .monash-blue[ETC5512: Wild Caught Data]

<h2 style="font-weight:900!important;">The proper care and feeding of wild data</h2>

.bottom_abs.width100[

Lecturer: *Kate Saunders*

Department of Econometrics and Business Statistics

ETC5512.Clayton-x@monash.edu

Week 12

]

---
# Time has come to wrap up this unit

Suppose you are the data curator. What should you know?

.aim-box.tl.w-70[
Today you will learn:

- About organising data into spreadsheets for analysis
- Rules for caring and feeding your data
- Realistic guide to making data available

]

</center>

We will also discuss SETUs and Assignment 4 today.

---

# Back in week 1 ...

We learnt **OPEN DATA** is a raw material for the digital age but,

unlike coal, timber or diamonds,

it can be used by anyone and everyone at the same time.

https://www.europeandataportal.eu/elearning/en/module1/#/id/co-01

Let's remind ourselves with an example why open data is important!

---
class: motivator

<div align="center">
<blockquote class="twitter-tweet">Today, three of the authors have retracted &quot;Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis&quot; Read the Retraction notice and statement from The Lancet <a href="https://t.co/pPNCJ3nO8n">https://t.co/pPNCJ3nO8n</a> <a href="https://t.co/pB0FBj6EXr">pic.twitter.com/pB0FBj6EXr</a>&mdash; The Lancet (@TheLancet) <a href="https://twitter.com/TheLancet/status/1268613313702891523?ref_src=twsrc%5Etfw">June 4, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</div>

---

background-image: \url(https://i.guim.co.uk/img/media/68b2cc3ee316f8a61e3df97b36954c1c6a20638c/0_0_4166_2500/master/4166.jpg?width=620&quality=45&auto=format&fit=max&dpr=2&s=e3358e1170461eb2904f7534dad4de8f)
background-size: cover

---

# Results published in The Lancet

-An article in The Lancet, "found Covid-19 patients who received the malaria drug, hydroxychloroquine, were dying at higher rates and experiencing more heart-related complications than other virus patients".

-The Lancet is *one of the oldest and best known journals* that publishes general medical articles,

-Within days, the World Health Organization had halted its support for trials of hydroxychloroquine.

-Australian infectious disease researchers began questioning the published results very quickly.

---

# Something fishy

-*The data the researchers used to draw their conclusions in the Lancet article is not readily available in Australian clinical databases*

-This lead many to ask where the data came from?

-The numbers for the Australian cases did not match the data that researchers here knew.

-Most journals require the data and software to be made available so that others can verify the results. This is becoming more and more the standard.

.idea-box.tl.w-70[

So the Australian infectious disease specialists made some phone calls ...

]

---

# Hello, can I ask you about your data?

The first call was to the National Notifiable Diseases Surveillance System, who confirmed that they were not the source of the data.

Next to health departments in NSW and Victoria, who also confirmed that they did not provide the data.

And then to the hospitals themselves

Which prompted this response

*Dr Allen Cheng, an epidemiologist and infectious disease doctor with Alfred Health in Melbourne, said the Australian hospitals involved in the study should be named. He said he had never heard of Surgisphere, and no one from his hospital, The Alfred, had provided Surgisphere with data. "Usually to submit to a database like Surgisphere you need ethics approval, and someone from the hospital will be involved in that process to get it to a database," he said. He said the dataset should be made public, or at least open to an independent statistical reviewer. If they got this wrong, what else could be wrong?" Cheng said.*

---
class: split-20

.column[
<img src="https://i.guim.co.uk/img/media/68b2cc3ee316f8a61e3df97b36954c1c6a20638c/0_0_4166_2500/master/4166.jpg?width=620&quality=45&auto=format&fit=max&dpr=2&s=e3358e1170461eb2904f7534dad4de8f" width="100%">

]

.column[
<blockquote class="twitter-tweet">Once I realised the data in That <a href="https://twitter.com/hashtag/LancetGate?src=hash&amp;ref_src=twsrc%5Etfw">#LancetGate</a> study was probably fabricated I couldn&#39;t do anything else and had to write a blog post about it. Not only is Surgisphere far too small to have software in 671 hospitals, their claimed awards are dodgy: <a href="https://t.co/Ro8vEvpZqc">https://t.co/Ro8vEvpZqc</a>&mdash; Peter Ellis (@ellis2013nz) <a href="https://twitter.com/ellis2013nz/status/1266739627701854208?ref_src=twsrc%5Etfw">May 30, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet">Investigation from me in Melbourne and Stephanie Kirchgaessner in the US: Governments and WHO changed Covid-19 policy based on suspect data from tiny US company named Surgisphere: <a href="https://t.co/LtyG5UnldX">https://t.co/LtyG5UnldX</a>&mdash; Melissa Davey (@MelissaLDavey) <a href="https://twitter.com/MelissaLDavey/status/1268135649615310849?ref_src=twsrc%5Etfw">June 3, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

]

---

]

<blockquote class="twitter-tweet">New piece on the <a href="https://twitter.com/hashtag/Surgisphere?src=hash&amp;ref_src=twsrc%5Etfw">#Surgisphere</a> saga from me: Unreliable data: how doubt snowballed over Covid-19 drug research that swept the world <a href="https://twitter.com/hashtag/opendata?src=hash&amp;ref_src=twsrc%5Etfw">#opendata</a> <a href="https://twitter.com/hashtag/openscience?src=hash&amp;ref_src=twsrc%5Etfw">#openscience</a> <a href="https://twitter.com/hashtag/hydroxychloroquine?src=hash&amp;ref_src=twsrc%5Etfw">#hydroxychloroquine</a> <a href="https://t.co/cI4VfcXeZy">https://t.co/cI4VfcXeZy</a>&mdash; Melissa Davey (@MelissaLDavey) <a href="https://twitter.com/MelissaLDavey/status/1268515172563341313?ref_src=twsrc%5Etfw">June 4, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet">Retracted studies may have damaged public trust in science, top researchers fear <a href="https://t.co/hNsEM1hYnx">https://t.co/hNsEM1hYnx</a>&mdash; Melissa Davey (@MelissaLDavey) <a href="https://twitter.com/MelissaLDavey/status/1269058847039090688?ref_src=twsrc%5Etfw">June 6, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
]

---
class: split-50
# Success story of open  data

- Data related to the COVID-19 pandemic has been collated by many organisations across the globe and made freely available.

<center> .font_large[👩🏽‍💻 👨🏽‍💻 👩🏼‍💻 👨🏾‍💻] </center>

- These numbers led to suspicions about the article's claims.

]
.column[

]

---
class: split-50

# Johns Hopkins COVID19

- [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19)
- Jan 23 (?) start of data collection 
- [COVID Live](https://covidlive.com.au/)
- [NYTimes](https://github.com/nytimes/covid-19-data) 
- [Monash team](https://github.com/covid-19-au/covid-19-au.github.io)

]
.column[

Vast number of people and organisations collating data, often (others) cross-checking numbers between sites.

]

---
class: split-50

# Difficulties

- Changing formats!

> *... collated by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) ... we will nevertheless scrape data from the relevant wikipedia pages, because it tends to be more detailed and better referenced than the equivalent JHU data ...* [Tim Churches blog](https://timchurches.github.io/blog/) Mar 1
]

- Changing links!
- So many links on the website - which data to use?
]

---
# Spreadsheets

]

]

---
class: split-50
# Spreadsheets for computer consumption

.footnote[Broman and Woo (2018) Data Organization in Spreadsheets https://doi.org/10.1080/00031305.2017.1375989]

- write dates like YYYY-MM-DD,
- do not leave any cells empty, 
- put just one thing in a cell, 
- organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), 
- create a data dictionary,

]

- do not include calculations in the raw data files, 
- do not use font color or highlighting as data, 
- choose good names for things, 
- make backups, 
- use data validation to avoid data entry errors, and 
- save the data in plain text files.
]

---
class: split-50

.pull-left[
<a href="https://imgs.xkcd.com/comics/iso_8601_2x.png"> <img src="https://imgs.xkcd.com/comics/iso_8601_2x.png" width="100%"> </a>
]
.pull-right[

- [Microsoft Excel’s treatment of dates can cause problems in data](https://storify.com/kara_woo/excel-date-system-fiasco)
- It stores them internally as a number, with different conventions on Windows and Macs
- Excel also has a tendency to turn other things into dates.

]

---
class: split-two

.row[
.pull-left[
 
 
<center>
.monash-blue[**The cells in your spreadsheet should each contain one piece of data. Do not put more than one thing in a cell.**]
</center>
]

<center>
You might have a column with "plate position" as "plate-well", it would be better to separate this into "plate" and "well" columns.
</center>
]
]
.row[

- Remember, airlines data, time zone on one column, departure time in another. This is partly technical because multiple time zones can't be stored in a single column. 
- Also, the data is distributed as Year, Month, Day columns, which is safer across systems
]

---

The census has an extensive data dictionary for each year distributed, giving variable names, and also explanation of levels in categorical variables.

But, these are stored totally separately to where you access the census, making it all a bit more of a hassle than it needs to be.
]

<img src="images/lecture-12/census_dictionary.png" width="100%"> 
 
]

---
class: refresher

background-image: \url(https://rmitconservationscience.files.wordpress.com/2016/08/feral-cat-and-phascogale-credit-fredy-mercay.jpg)
background-size: 70%

---
class: motivator middle

# You can validate the integrity of your csv file with

http://csvlint.io

---
class: motivator middle

# Goodman et al (2014) Ten Simple Rules for the Care and Feeding of Scientific Data

---

🤔 As we look at these rules, think about what this implies for business and government data.

---

# Care and feeding

1. Love Your Data, and Help Others Love It, Too
--

2. Share Your Data Online, with a Permanent Identifier
--

3. Conduct Science with a Particular Level of Reuse in Mind
--

4. Publish Workflow as Context
--

5. Link Your Data to Your Publications as Often as Possible
--

6. Publish Your Code (Even the Small Bits)
--

7. State How You Want to Get Credit
--

8. Foster and Use Data Repositories
--

9. Reward Colleagues Who Share Their Data Properly
--

10. Be a Booster for Data Science
--

---
# Love Your Data, and Help Others Love It, Too

**What are some ways to show your love?**

**What data have we seen that isn't loved?**

]

.w-45[
- Nurture: 
 - feed, 
 - hug, check on it
 - dress it nicely
 - give it a name
 
 
- Show it off: 
 - tell someone about it
 - demonstrate how it can be used
 
]
]

---
# Share Your Data Online, with a Permanent Identifier

- Give it a name: digital object identifier (DOI)
- Adequate documentation and metadata
- Employing good curation practices

]

Common resources:

- [Zenodo](http://zenodo.org/)
- [FigShare](http://figshare.com/)
- [Dataverse](http://thedata.org/)
- [Dryad](http://datadryad.org/)
]

---
# Conduct Science with a Particular Level of Reuse in Mind

Replace "science" with "data science", "data analysis", "analytics", "business intelligence".

- keep careful track of versions of data and code
- to be fully reproducible, then *provenance* information is a must 
    - working pipeline analysis code, 
    - a platform to run it on, and
    - verifiable versions of the data. 
- what types of re-use do you think others might make of your work?

---
# Reward Colleagues Who Share Their Data Properly

- Build promotion and award systems that count data and code-sharing activities.
- Consider this activity an important part of your own data science work. 
- Clear guidelines for credit

---
class: split-50

# Johns Hopkins COVID19

What's really nice 😄

- [Github page](https://github.com/CSSEGISandData/COVID-19)
- Compiled data from various sources, sources listed
- Update time stamp
- Versioning 
- Issues for two way conversations with users

]
.pull-right[

]
---
# BTS air traffic

[Bureau of Transportation Statistics](https://www.transtats.bts.gov/DataIndex.asp) (Assignment 1)

- Many, many different tables. The extent and value of the ontime performance database may not be immediately obvious. Need to know what you are looking for, many links, and several clicks deep ❌ 
- Sporadically missing chunks ❌  
- No API for other software, laborious to download large chunks ❌
- Data provided by airlines is required, regular reporting is incentivised. Regularly updated, time stamp ✅ 
- Small chunk `csv` file is nicely rectangular ✅

---
# Atlas of Living Australia

[CSIRO](https://www.ala.org.au/) (Assignment 1)

- Vast amount of data ✅ 
- Many different ways to access, including API ✅ 
- Hard to navigate the ways to access and what information is provided ❌ 
- Data stored is sporadic, on a volunteer basis ❌ 
- Data identifier (DOI) is provided with each download ✅

---
# ABS Census Data

- Updated regularly, for each census ✅ 
- Data packs, easy to find ✅ 
- Download has regular file structure ✅
- Finding  variable of interest is hard, though ❌ 
- Spreadsheet with a gazillion tables, and variables are coded into column headers ❌

---

## Summary

.aim-box.tl.w-100[
Wrapping up this unit you should:

1. Understand the definitions, allowed usage, digital identification and licensing of open data

2. Know about common open data sources, how they are used and effectively search for new sources

3. Explain the differences between data collection methods and the limitations for data analysis

4. Work with the range of different data formats of open data

5. Understand ethical constraints and privacy limits when working with open data

6. Recognise the components of effective curation needed for open data

]

</center>

---
background-image: \url(https://upload.wikimedia.org/wikipedia/commons/3/35/Grandpa_feeding_little_Beverley_Purd%27s_pet_kangaroo_%284461715862%29.jpg)
image-size: cover

.footnote[[Prof. Di Cook's grandpa feeding little Beverley Purd's pet kangaroo, 1930, State Library of Queensland](https://upload.wikimedia.org/wikipedia/commons/3/35/Grandpa_feeding_little_Beverley_Purd%27s_pet_kangaroo_%284461715862%29.jpg)]

---

## Slides developed by Prof. Di Cook. Maintained and updated by Dr. Kate Saunders

---

background-size: cover
class: title-slide
background-image: url("images/bg-12.png")

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

.bottom_abs.width100[

Lecturer: *Kate Saunders*

Department of Econometrics and Business Statistics

ETC5512.Clayton-x@monash.edu

Week 12

]