Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
What we’ll cover:
Unit overview and details
An Introduction to Open Data
Societal importance of open data
What makes high quality open data
Learn about the different types of open data
Getting you set up in R (Drop In Session)
Kate Saunders
Lecturer at Monash University
👨🎓 PhD in Statistics
🌍 Home State is Queensland
👨💻 Research is in statistics of climate extremes
👩💻 Passionate about open data, data visualisation and data ethics
👨💻 Started R coding in 2012 (before tidyverse!)
❤️ Hobby is playing basketball.
Krisanat Anukarnsakulchularp
PhD Candidate at Monash University
👨🎓 Graduated in Masters in Business Analytics
🎓 Monash local since undergrad
🌎 Home Country is Thailand
👩💻 Just started his PhD, he’s researching network visualisation
👩💻 Recently published an R package called animbook!
❤️ Hobby is playing music.
Maliny Po
Graduated in Masters in Business Analytics
🎓 Undergraduate degree in International Trade and Business Logistics.
👨💻 Former Student in Wild Caught Data.
👩💻 Recently published a data viz package called Sugarglider, as part of the Google Summer of Code program.
❤️ Enjoys reading books, lego and a good cup of coffee.
Breakout Sesson: Your turn 📝
Quick tip
Finding connections with your classmates now can lead to great project collaborations later!
Unit Design
2 hours of seminars each week
1 hour tutorial each week
Expect ~12 hours of contact and study each week
We are learning to code. You will need to keep up with the material.
Coming to classes and consultations will help you!
There is also a drop in and practice session!
At the end of this unit you will be able to:
Understand the definitions, allowed usage, digital identification and licensing of open data
Know about common open data sources, how they are used and effectively search for new sources
Explain the differences between data collection methods and the limitations for data analysis
Work with the range of different data formats of open data, including APIs
Understand ethical constraints and privacy limits when working with open data
Recognise the components of effective curation needed for open data.
NEW: Drop in and Practice Session
We’ll use these to flesh out ideas from lectures
These will be recorded
The format will change week to week depending on unit needs.
Sometimes I’ll run through examples and live code.
Sometimes I’ll answer your questions.
Sometimes I’ll run through topics to supplement your learning.
Email me through each week questions you’d like me to work through!
Assignments
4 assignments worth 25% each
100% of your total grade
Assignment 1 will cover content from weeks 1 to 3 (Due week 4)
Assignment 2 will cover content from weeks 4 to 6 (Due week 7)
Assignment 3 will cover content from weeks 7 to 9 (Due week 10)
Assignment 4 will cover content from weeks 1 to 12 (Due in exam block)
You’ll get at least 2 weeks to complete each assignment.
Warning
Failure to submit and notify the CE accordingly will result in a zero score for the assignment.
If you miss more than one assignment you will need to re-take the unit at a later date.
Special Consideration
Apply for special consideration centrally. (This includes short extensions of 48 hours)
If you need special consideration, apply ASAP and no later than 11.55 pm on the day your assessment is due.
You can miss an assignment through illness, or personal difficulty. In that instance, provided you’ve applied for special consideration, I can grant permission to complete a replacement assignment instead at a later date.
Two Locations
Unit Website: Everything is displayed on the same page and is easy to access
Moodle: Where you submit your assignments, the discussion forum is located and I’ll make unit announcements.
Unit Github: Contains all the code, data etc to produce the unit content and website.
1. Ask your peers using the discussion forum
Suitable for:
General questions about course materials, tutorials, R or assignment clarifications.
Any emails regarding general matters will be redirected to the discussion forums.
Sharing helps you learn from each other!
Also I don’t want to answer the same question twice (three times, four times etc.)
Do not post code from your assignments or any assignment hints to the discussion portal. I may deduct marks
2. Attend Consultation
Suitable for:
When you need more specific one-on-one help
Working through problems with your tutor
Getting a head start on your assignments
Support debugging your code
Asking detailed questions about your assignments
Getting additional feedback on your assessments
Being really nerdy about the unit!
Unit Email
Suitable for:
For personal questions or issues email ETC5512.Clayton-x@monash.edu.
Response times are within 1 - 2 days but may vary during busy periods.
Also email if you notice issues with assessments or Moodle.
For remarking or to get feedback on your assessment from your tutor, your must email within 10 days of receiving your marks.
Do not direct emails to my staff account, I receive a high volume of high volume of emails and they risk going into a black hole and never being seen again!
Think about this unit in the way Dr Mine Çetinkaya-Rundel describes in this talk: “Let them eat cake (first)”
Imagine you’re new to baking, and you’re in a baking class.
There are two options: which gives you better sense of the final product?
The Textbook Learner
The Example Learner
Breakout Discussion
Which approach feels more natural to you?
What are the advantages of each?
When might one approach work better than the other?
How can understanding your learning style help you succeed?
Cakes and Case Studies
The case studies you will see in this unit are the cakes.
By showing you what these case studies look like (cakes), we are helping you learn how to perform your own data analysis studies by example!
This may be different to how you’ve learnt in the past.
Please approach this unit philosophy with openness.
And for textbook learners: Check out the textbook R for data science
Open data is … 1
a raw material for the digital age but,
it’s unlike coal, timber or diamonds,
it can be used by anyone and everyone at the same time.
Open Data Institute - Dave Tarrant - EDP Module 1 from Open Data Institute on Vimeo.
Important
Open data is measured by what it can be used for, not by how it is made available.
Open Data Considerations
Use limitations: No limitations that prevent particular uses.
Use limitations: Anyone free to use, modify, combine and share, even commercially.
Data cost: Free to use does not mean that it must be free to access.
Data cost: Cost to creating, maintaining and publishing usable data.
Data cost: Live data and big data can incur ongoing costs.
Reuse: Free to use, reuse and redistribute it - even commercially.
Important
Open data can be freely used, modified, and shared by anyone for any purpose
Two types of data openness:
The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. 1
Pop Quiz! ❤️
Try the quizzes here
Important
Help make governments more transparent.
Building new business opportunities
Protecting the planet
Globally
Australian governement examples:
Important
Licenses
Without a licence, users may find themselves in a legal grey area. Data may be ‘publicly available’, but users may not have permission to access, use and share it under general copyright or database laws.
An open data licence is an explicit permission to use the data for both commercial and non-commercial purposes.
Open data publishers should provide easy access to the licence for all datasets that are available to access, use and share.
Organisations and governments use Open Data licenses to clearly explain the conditions under which their data may be used.
TLDR
Many licenses include both a summary version, intended to convey the most important concepts to all users, and a detailed version that provides the complete legal foundation.
Examples include
Licence type
Standard licenses can offer several advantages over bespoke licenses.
Standard licences have greater recognition among users, increased interoperability, and greater ease of compliance.
Pop Quiz! ❤️
Try the quizzes here
Information components
Standards frameworks
Example datasets
Key metadata elements
Warning
‘machine readable’ is not synonymous with ‘digitally accessible’
Historical efforts have focused on: - pushing static information about government programs and services to the web, - where the intended use is a human who can read, print, and take actions based on reading. - It’s a narrow vision of the expected users and uses of the information.
Machine Readable
5 ⭐ ratings:
This web site 5 ⭐: Open Data at provides a rating system for deploying open data.
⭐ An open license.
Make your stuff available on the Web (whatever format) under an open license
⭐, ⭐ Re-usable format.
Make it available as structured data (e.g., A proprietary format like excel instead of image scan of a table.)
⭐, ⭐, ⭐ Open format.
Make it available in a non-proprietary open format (e.g., CSV instead of Excel)
⭐, ⭐, ⭐, ⭐ use (Uniform Resource Identifiers (URIs).
URIs help you reference your data, like a unique address and gives context to the values.
⭐, ⭐, ⭐, ⭐, ⭐ Linked data
Your data doesn’t exist in isolation. Your data links/ connects to other relevant data sets.
Learn more at fair.org
FAIR Principles
o * Findable Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services.
Accessible Once the user finds the required data, they need to know how can they accessed that data, possibly including authentication and authorisation.
Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
Publishing Your Data
Research data is increasingly seen as part of the corpus of scholarly publications.
Publishers, funders and governments support researchers to publish their data outputs by various policies, guidelines and mandates.
Obtaining a Digital Object Identifier system (DOI) provides a persistent identifier, and can be used for data. Two services in Australia:
Many open data sets provide information on how to cite them, when used in other forms of publication.
Legal requirements:
Practical requirements:
If you provide a link to the data on a website: * Update the data regularly if it changes * Commit to continue to make the data available
Technical requirements:
Watch out for:
Open Data Institute - Dave Tarrant - EDP Module 12 from Open Data Institute on Vimeo.
Let’s look at https://www.realestate.com.au/buy
Pop Quiz! ❤️
Try the quizzes here
Note
Exercise
Look at these open data examples:
Consider the interface
Look for licensing
Explanations of what’s in the data
Metadata
This is Prof Di Cook’s taxonomy!
Long shelf life, highly processed
Orphans
Synthetic
Wild
Fresh, interesting, exciting
But also challenging!
Real world data sets, with real world messiness
eg US Bureau of Transportation Statistics air traffic database
Fresh and local
Best kind of wild data
Collected locally, and about our own lives
Our working definition
Wild caught data is:
data the can be freely used, modified, and shared by anyone for any purpose, AND
The data source is traceable, the data collection is transparent, and the data is updated as new measurements arrive, AND
In case of data processing, the process is clearly described and reproducible.
Are they Wild?
✅ Freely available to be used and modified
✅ Can be shared
✅ Data provenance is clear
✅ How the data was collected is transparent
✅ Data is updated as new measurements become available
✅ Any processing of this data is clear
Open data: definitions, sources and examples
Introduced you to the Open Data Fundamentals
eg. power for societal good, where to access, limitations, licences
Data Quality Elements
eg. meta data, machine readable, FAIR, five star ratings
Wild Caught Data Meaning
and other flavours of open data
Teaching philosophy of cake first!
Getting set up
By the end of this unit you’ll be performing your own analytics case studies
Before you can do that we need to get you set up with the software you’ll use in this unit
We’ll now go through the steps to install R and RStudio
Think about R as the engine and RStudio as the dashboard and controls.
We need both to drive a car.
We’ll open RStudio to create data visualisations using R.
Step 1: Install R
Step 2: Install RStudio
Why do we need a Programming Language
It allows us to have reproducible steps, which can be applied for many different data sets
Make sure the analysis is not just point and click, you can work as a team on it on the same code
It also means we can more easily perform our own case studies in analytics
Why do we use R?
It has been around for a while.
It is regularly maintained and is open source.
It is beginner friendly
Even if you use other languages, you might still use R for your data visualisations
Learning a new language is hard!
You need to think about grammar and structure, and how to communicate well in it!
You will make mistakes, lots of them.
Below you see code that plots points showing the GDP per capita against life expectancy. The points are coloured by country and the size of the points shows the population.
It might look impossible now, but by the end of this semester you will be able to write this yourself!
ETC5512