class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-01.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5512: Wild Caught Data] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Open data: definitions, sources and examples</h2> .bottom_abs.width100[ Lecturer: *Kate Saunders* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC5512.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 1 <br> ] --- class: informative middle # Open data is... .footnote[https://www.europeandataportal.eu/elearning/en/module1/#/id/co-01] > a raw material for the digital age but, -- > unlike coal, timber or diamonds, -- > it can be used by anyone and everyone at the same time. -- --- class: middle center <iframe src="https://player.vimeo.com/video/129196637?h=075ff25ed2&title=0&byline=0&portrait=0" width="640" height="360" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen></iframe> <p><a href="https://vimeo.com/129196637">Open Data Institute - Dave Tarrant - EDP Module 1</a> from <a href="https://vimeo.com/theodiuk">Open Data Institute</a> on <a href="https://vimeo.com">Vimeo</a>.</p> <!-- OLD LINK: https://player.vimeo.com/video/266308637 --> --- # What makes data open? <!-- Open data is measured by what it can be used for, not by how it is made available. --> * Limitations * No limitations that prevent particular uses. * Anyone free to use, modify, combine and share, even commercially. * Cost * Free to use does not mean that it must be free to access. * Cost to creating, maintaining and publishing usable data. * Live data and big data can incur ongoing costs. * Reuse * Free to use, reuse and redistribute it - even commercially. --- # Definition open data Open data can be freely used, modified, and shared by anyone for any purpose <br><br> There are two dimensions of data openness: * The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions. * The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. <br><br> http://opendefinition.org/ <br><br><br> Try the quizzes [here](https://www.europeandataportal.eu/elearning/en/module1/#/id/co-01) --- # Why do we need open data? * Help make governments more transparent. * Open data allowed citizens in Canada to save the government billions in fraudulent charitable donations * Building new business opportunities * Transport for London has released open data that developers have used to build over 800 transport apps. * Protecting the planet * Open data about weather can provide an early warning system for environmental disasters * Open data is also helping consumers to understand their personal impacts on the environment <br><br><br><br> https://opendatahandbook.org/guide/en/why-open-data/ --- class: middle # Open data from large organisations .flex[ .border-box[ * http://dataportals.org/search * http://data.un.org/ * https://datacatalog.worldbank.org/ * https://data.gov/ ] .border-box[ # Open data Australia: * https://opendataimpactmap.org/eap # By governement * http://www.data.gov.au/ * https://www.data.vic.gov.au/ * https://data.melbourne.vic.gov.au/ ] ] <!-- Metadata with good example -> develop case in lecture with australian data (for instance https://data.gov.au/data/dataset/australia-s-merchandise-trade-by-country-and-sitc-to-fy2017) --> <!-- and go over metadata and licenses. --> --- # Why license open data? <br><br> * Tells anyone that they can access, use and share data. -- * Without a licence, users may find themselves in a legal grey area. Data may be 'publicly available', but users may not have permission to access, use and share it under general copyright or database laws. -- * An open data licence is an explicit permission to use the data for both commercial and non-commercial purposes. -- * Open data publishers should provide easy access to the licence for all datasets that are available to access, use and share. -- <!-- Organizations and governments use Open Data licenses to clearly explain the conditions under which their data may be used. --> --- # Open data licenses <!-- Many licenses include both a summary version, intended to convey the most important concepts to all users, and a detailed version that provides the complete legal foundation. Examples include: --> * Standard re-usable license: consistent and broadly recognized terms of use * Creative Commons, particularly CC-By and CC0 https://creativecommons.org/ * Open Database License https://opendatacommons.org/licenses/odbl/ * Bespoke licenses: governments and international organizations developed * UK Open Government License http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/ * The World Bank Terms of Use https://data.worldbank.org/summary-terms-of-use <!-- Standard licenses can offer several advantages over bespoke licenses, including greater recognition among users, increased interoperability, and greater ease of compliance. --> <br><br><br><br><br> Try the quizzes [here](https://www.europeandataportal.eu/elearning/en/module4/#/id/co-01) --- # Metadata: data about data .flex[ .border-box[ Information necessary to use the data appropriately: * Source * Structure * Underlying methodology * Topical * Geographic and/or temporal coverage * License * When it was last updated * How it is maintained ] .border-box.w-70[ * Dublin Core Metadata Initiative (DCMI) provides a framework and core vocabulary of metadata terms. * https://www.dublincore.org/ * Governments develop metadata models to provide further uniformity to government-wide Open Data initiatives. * https://project-open-data.cio.gov/v1.1/schema/ * Australian government metadata standards * [National Archives of Australia](https://www.naa.gov.au/information-management/information-management-standards/australian-government-recordkeeping-metadata-standard), [Australian Institute of Health and Welfare](https://www.aihw.gov.au/about-our-data/metadata-standards) ] ] --- # Examples from Canadian government .flex[ - [Resettled refugees](https://open.canada.ca/data/en/dataset/4a1b260a-7ac4-4985-80a0-603bfe4aec11) - [Canada emergency wage subsidy (CEWS)](https://open.canada.ca/data/en/dataset/f713389f-ab1c-4be4-bade-05f71ed110fe) - Title: what data contains and where it comes from. - Description: details to quickly understand whether data is relevant to you - publisher: dataset originated, who is repsonsible for maintaining, credibility - license: - contact information: questions or incomplete metadata - frequency: interval data is updated. check for updates? data out of date? - date modified: relevant for your work? - spatial coverage: geographic area data is relevant - temporal coverage: - open data formats ] --- # Machine Readable .footnote[[A Primer on Machine Readability](https://www.data.gov/developers/blog/primer-machine-readability-online-documents-and-data)] <center> .info-box['machine readable' is not synonymous with 'digitally accessible'] </center> * Historical efforts have focused on - pushing static information about government programs and services to the web, - where the intended use is a human who can read, print, and take actions based on reading. - It's a narrow vision of the expected users and uses of the information. * Machine readable formats expand field of vision to new users and new uses and require technologies like XML and JSON - 😿 PDF is not suitably machine readable - 😀 CSV (or XLSX, XLS) is common, and universally accessible, but should be structured for analysis not for reading - 😸 XML, JSON is verbose, can contain metadata, but needs special readers - 🤩 API provides an interface that other software can utilise to automatically extract and process --- # Five star open data scheme The web site 5 ⭐ Open Data at https://5stardata.info/en/ reports a rating system for deploying open data. * ⭐ - .monash-blue2[An open license]: make your stuff available on the Web (whatever format) under an open license * ⭐⭐ - .monash-blue2[Re-usable format]: make it available as structured data (e.g., Excel proprietary instead of image scan of a table * ⭐⭐⭐ - .monash-blue2[Open format]: make it available in a non-proprietary open format (e.g., CSV instead of Excel) * ⭐⭐⭐⭐ - .monash-blue2[use (Uniform Resource Identifiers (URIs)] to denote things, so that others can link to it, and also give context to the values * ⭐⭐⭐⭐⭐ - .monash-blue2[Link data] to definitions and context for various aspects --- # FAIR principles for scientific data .footnote[Learn more at https://www.go-fair.org/fair-principles/] .flex[ .border-box[ ## Findable Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services. ] .border-box[ ## Accessible Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorisation. ] .border-box[ ## Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing. ] .border-box[ ## Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. ] ] --- # Publishing data Research data is increasingly seen as part of the corpus of scholarly publications. Publishers, funders and governments support researchers to publish their data outputs by various policies, guidelines and mandates. - Obtaining a Digital Object Identifier system (DOI) provides a persistent identifier, and can be used for data. Two services in Australia: - [Australian Research Data Commons (ARDC)](https://ardc.edu.au/services/identifier/) can generate a DOI for you. - [Australian National Data Service](https://www.ands.org.au/online-services/doi-service) - Many open data sets provide information on how to cite them, when used in other forms of publication. <br><br> See more guidelines at [ARDS](https://www.ands.org.au/working-with-data/publishing-and-reusing-data/publishing) and [ARDC](https://ardc.edu.au/services/research-data-australia/). --- # Open data quality .flex[ .border-box[**Legal requirements:** * Protect sensitive information like personal data * Preserve the rights of data owners * Promote correct use of the data ] .border-box[**Practical requirements:** * Link to the data from their website * Update the data regularly if it changes * Commit to continue to make the data available ] .border-box[**Technical requirements:** * The format in which the data is published * The structure of the data * The channels through which the data is available ] ] --- # Common pitfalls with open data * Mixed date formats american/european * Multiple representations differences in abbreviations, capitalisation, spacing * Duplicate records * Redundant data * Mixed numerical scales * Spelling errors * Inconsistent naming * Missing values --- # What is hidden data? <br> <center> <iframe src="https://player.vimeo.com/video/129197208?h=8f24e0dabc&title=0&byline=0&portrait=0" width="640" height="360" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen></iframe> <p><a href="https://vimeo.com/129197208">Open Data Institute - Dave Tarrant - EDP Module 12</a> from <a href="https://vimeo.com/theodiuk">Open Data Institute</a> on <a href="https://vimeo.com">Vimeo</a>.</p> </center> <!-- OLD LINK: https://player.vimeo.com/video/266650606 --> Let's look at https://www.realestate.com.au/buy. And do the quizzes [here](https://www.europeandataportal.eu/elearning/en/module12/#/id/co-01) --- class: middle # Some of my favourite examples of open data - Airline traffic in the USA https://www.bts.gov <!-- info about data policy https://www.transportation.gov/mission/digital-government-strategy-4 --> - Australian Bureau of Statistics http://stat.data.abs.gov.au - Australian Electoral Commission https://www.aec.gov.au - National Longitudinal Survey of Youth (NLSY) https://www.nlsinfo.org/investigator/pages/search?s=NLSY79 - Atlas of Living Australia https://www.ala.org.au - Australian bushfires from satellite hotspot remote sensing https://www.eorc.jaxa.jp/ptree/registration_top.html (also see resulting analysis at https://ebsmonash.shinyapps.io/VICfire/) - John Hopkins Coronavirus tracking https://coronavirus.jhu.edu/data - OECD Programme for International Student Assessment http://www.oecd.org/pisa/data/ - Melbourne pedestrian counting system http://www.pedestrian.melbourne.vic.gov.au/ --- class: transition ## We'll spend some time here taking a look at these open data examples Consider the interface Look for licensing Explanations of what's in the data Metadata --- class: transition middle center # Flavours of open data How to tell if the open data is not so good to consume? .footnote[This is Prof Di Cook's taxonomy] --- class: middle background-image: url(https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Food_on_shelf.jpg/1600px-Food_on_shelf.jpg) background-size: cover .fill-box[ # Long shelf life, highly processed - Convenient, but contains unhealthy ingredients, and is a bad habit - eg iris, mtcars, titanic, handwritten digits - Found at eg [UCI Machine learning archive](https://archive.ics.uci.edu/ml/datasets.php) ] --- class: middle background-image: url(https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Abandoned_Car_%282654024518%29.jpg/1600px-Abandoned_Car_%282654024518%29.jpg) background-size: cover .fill-box[ # Orphans - File dumped on an archive - Stale, could date your results - Found in places like [https://data.gov.au](https://data.gov.au) ] --- class: middle background-image: url(https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/Artificial_Putting_Green.JPG/1134px-Artificial_Putting_Green.JPG) background-size: cover .fill-box[ # Synthetic - Used primarily these days for privacy protection - Correct up to the model used to simulate the data - misses interesting structure in data not captured by model - Very pretty, very consistent, but it can burn you - eg [OECD Programme for International Student Assessment](https://www.oecd.org/pisa/data/) A generalised linear model is fitted to the scores, with predictors such as school, gender, ... Model is used to simulate a score for each student. - eg Also be aware of fraud [Article in the Lancet (2020)](https://bit.ly/3IzOx4u) ] --- class: middle background-image: url(https://m.media-amazon.com/images/M/MV5BZmVkNTAwZjctZDI4Yy00YWMyLWEwZmUtNGFlNDY2NGJiNDAyXkEyXkFqcGdeQXVyMTc0NzI3MDQ@._V1_.jpg) background-size: cover .fill-box[ # Wild - Fresh, interesting, exciting, challenging - eg [US Bureau of Transportation Statistics air traffic database](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236) ] .footnote[Image: Reese Witherspoon, Wild (2014) IMDb] --- class: middle background-image: url(https://upload.wikimedia.org/wikipedia/commons/thumb/a/a1/South_Melbourne_market_outside_1a.jpg/1056px-South_Melbourne_market_outside_1a.jpg) background-size: 70% .fill-box[ # Fresh and local - Wild data, collected locally, and impacting our own lives - eg [Melbourne pedestrian counts](https://cran.r-project.org/web/packages/rwalkr/index.html) ] --- class: refresher middle Our working definition of wild-caught data will be: <br><br> .info-box[ # Wild-caught data The data can be freely used, modified, and shared by anyone for any purpose <br><br> AND <br><br> The data source is traceable, the data collection is transparent, and the data is updated as new measurements arrive. In case of data processing, the process is clearly described and reproducible. ] --- ## What about your favourite datasets - are they Wild? <br><br> ✅ Freely available to be used and modified <br> ✅ Can be shared <br> ✅ Data provenance is clear <br> ✅ How the data was collected is transparent <br> ✅ Data is updated as new measurements become available <br> ✅ Any processing of this data is clear --- class: transition ## Slides originally developed by Professor Di Cook --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Kate Saunders* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC5512.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 1 <br> ]