Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
What we’ll cover:
Unit overview and details
An Introduction to Open Data
Societal importance of open data
What makes high quality open data
Learn about the different types of open data
Getting you set up in R (Drop In Session)
Kate Saunders
Lecturer at Monash University
🎓 PhD in Statistics
🌍 Home State is Queensland
👩💻 Research is in statistics of climate extremes
👨💻 Passionate about open data, data visualisation and data ethics
👩💻 Started R coding in 2012 (before tidyverse!)
❤️ Hobby is playing basketball.
Krisanat Anukarnsakulchularp
PhD Student at Monash University
👨🎓 Graduated in Masters in Business Analytics
🎓 Monash local since undergrad
🌏 Home Country is Thailand
👩💻 PhD research is on network visualisation
👩💻 Published an R package called animbook!
❤️ Hobby is playing music.
Maliny Po
Graduated in Masters in Business Analytics
👨🎓 Undergraduate degree in International Trade and Business Logistics.
👩💻 Former Student in Wild Caught Data.
👩💻 Published a data viz package called Sugarglider, as part of the Google Summer of Code program.
❤️ Enjoys reading books, lego and a good cup of coffee.
Breakout Sesson: Your turn
Quick tip
Finding connections with your classmates now can lead to great project collaborations later!
Unit Design
2 hours of seminars each week
1 hour tutorial each week
1 hour workshop each week
Caution
Expect ~12 hours of contact and study each week
We are learning to code. You will need to keep up with the material.
Coming to classes and consultations will help you!
At the end of this unit you will be able to:
Understand the definitions, allowed usage, digital identification and licensing of open data
Know about common open data sources, how they are used and effectively search for new sources
Explain the differences between data collection methods and the limitations for data analysis
Work with the range of different data formats of open data, including APIs
Understand ethical constraints and privacy limits when working with open data
Recognise the components of effective curation needed for open data.
Online workshops
We’ll use these to flesh out ideas from seminars
These will be recorded (tutorials will not be)
The workshop format will change week to week depending on unit needs.
Sometimes I’ll run through examples and live code.
Sometimes I’ll answer your questions.
Sometimes I’ll run through topics to supplement your learning.
Assignments
4 assignments worth 25% each (100% of your total grade)
Assignment 1 will cover content from weeks 1 to 3 (Due week 4)
Assignment 2 will cover content from weeks 4 to 6 (Due week 7)
Assignment 3 will cover content from weeks 7 to 9 (Due week 10)
Assignment 4 will cover content from weeks 1 to 12 (Due in exam block)
You’ll get at least 2 weeks to complete each assignment.
Warning
Failure to submit and notify the CE accordingly will result in a zero score for the assignment.
If you miss two assignments you will need to re-take the unit at a later date.
Special Consideration
Apply for special consideration centrally. This includes short extensions of 48 hours.
If you need special consideration, apply ASAP and no later than 11.55 pm on the day your assessment is due.
If you miss an assignment through illness or personal difficulty, provided you’ve applied for special consideration there will be options for scaling or alternative assessment.
Locations
Unit Website: Everything is displayed on the same page and is easy to access
Moodle: Where you submit your assignments, the discussion forum is located and I’ll make unit announcements.
Unit Github: Contains all the code, data etc to produce the unit content and website.
Struggle for a while!
Coding is a cycle:
Progress comes from iteration!


You can use Generative AI in this unit.
In fact I encourage it!
It’s a great tool for those learning to code

You can use Generative AI in this unit.
In fact I encourage it!
It’s a great tool for those learning to code
But
You must never copy and paste output from AI you don’t understand or can not explain
You must always provide appropriate acknowledge of you AI use
You need to be careful not to short cut your learning
What does academic integrity mean to you?
Still not sure - Monash Resources
What is academic integrity? Click here
What does maintaining academic integrity mean? Click here
What happens if I breach academic integrity? Click here
Ask your peers
Suitable for:
There is a discussion forum for general questions and clarifications
Emails on general matters will be redirected to the discussion forums
Sharing helps you learn from each other!
Prevents me answering the same question twice (three times, four times etc.)
Careful about posting code from your assignments or any assignment hints to the discussion portal. I may deduct marks

Attend Consultation
Suitable for:
Get one-on-one help
Working through problems with your tutor
Ask questions about your assignments
Get help debugging your code
Get feedback on your assessments
Being really nerdy about the unit!

Unit Email
Suitable for:
For personal questions or issues email ETC5512.Clayton-x@monash.edu.
Response times are within 1 - 2 days but may vary during busy periods.
Also email if you notice issues with assessments or Moodle.
For remarking or to get feedback on your assessment ask your tutor and email within 10 days of receiving your marks.
Do not direct emails to my staff account, I receive a high volume of high volume of emails and they risk going into a black hole and never being seen again!
Think about this unit in the way Dr Mine Çetinkaya-Rundel describes in this talk: “Let them eat cake (first)”
Imagine you’re new to baking, and you’re in a baking class.
There are two options: which gives you better sense of the final product?


The Textbook Learner

The Example Learner

Breakout Discussion
Which learning approach feels more natural to you?
Discuss the advantages of each
Cakes and Case Studies
The case studies you will see in this unit are the cakes.
By showing you what these case studies look like (cakes), we are helping you learn how to perform your own data analysis studies by example!
This may be different to how you’ve learnt in the past.
Please approach this unit philosophy with openness.
And for textbook learners: Check out the textbook R for data science
Open data is … 1
a raw material for the digital age but,
it’s unlike coal, timber or diamonds,
it can be used by anyone and everyone at the same time.
Open Data Institute - Dave Tarrant - EDP Module 1 from Open Data Institute on Vimeo.
Open data is measured by what it can be used for, not by how it is made available.
Open Data Considerations
No limitations that prevent particular uses.
Anyone free to use, modify, combine and share, even commercially.
Free to use does not mean that it must be free to access.
There is a cost to creating, maintaining and publishing usable data.
Live data, big data and data from generative AI can incur ongoing costs.
Free to use, reuse and redistribute it - even commercially.
Open data can be freely used, modified, and shared by anyone for any purpose
Two types of data openness:
The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. 1
Pop Quiz! ❤️
Try the quizzes here
Help make governments more transparent!
Building new business opportunities
Protecting people and our environment
Globally
Australian governement examples:
And so many more places …
How do I tell if data I find is open?
Licences!
Licences tells people how they can access, use and share data.
Licenses
Without a licence, users may find themselves in a legal grey area.
Data may be ‘publicly available’, but users may not have permission to access, use and share it under general copyright or database laws.
An open data licence is an explicit permission to use the data for both commercial and non-commercial purposes.
Open data publishers should provide easy access to the licence for all datasets that are available to access, use and share.
Organisations and governments use Open Data licenses to clearly explain the conditions under which their data may be used.
Examples include:
Standard re-usable license: consistent and broadly recognised terms of use
Creative Commons, particularly CC-By and CC0 https://creativecommons.org/
Open Database License https://opendatacommons.org/licenses/odbl/
Bespoke licenses: e.g. for governments, international organisations
TLDR
Many licenses have a summary version that helps convey the most important information to users and a detailed version that provides the complete legal foundation.
Licence type
Standard licenses can offer several advantages over bespoke licenses.
Standard licences have greater recognition among users, increased interoperability, and greater ease of compliance.
Pop Quiz! ❤️
Try the quizzes here
Information components
Standards frameworks
Example datasets
Key metadata elements
‘machine readable’ is not the same as ‘digitally accessible’
Historical efforts have focused on:
pushing static information about government programs and services to the web,
where the intended use is a human who can read, print, and take actions based on reading.
It’s a narrow vision of the expected users and uses of the data.
Machine Readable
5 ⭐ ratings:
This web site 5 ⭐: Open Data at provides a rating system for deploying open data.
⭐ An open license.
Make your stuff available on the Web (whatever format) under an open license
⭐, ⭐ Re-usable format.
Make it available as structured data (e.g., A proprietary format like excel instead of image scan of a table.)
⭐, ⭐, ⭐ Open format.
Make it available in a non-proprietary open format (e.g., CSV instead of Excel)
⭐, ⭐, ⭐, ⭐ use (Uniform Resource Identifiers (URIs).
URIs help you reference your data, like a unique address and gives context to the values.
⭐, ⭐, ⭐, ⭐, ⭐ Linked data
Your data doesn’t exist in isolation. Your data links/ connects to other relevant data sets.
Learn more at fair.org
FAIR Principles
Findable Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services.
Accessible Once the user finds the required data, they need to know how can they accessed that data, possibly including authentication and authorisation.
Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
Publishing Your Data
Research data is increasingly seen as part of the corpus of scholarly publications.
Publishers, funders and governments support researchers to publish their data outputs by various policies, guidelines and mandates.
Obtaining a Digital Object Identifier system (DOI) provides a persistent identifier, and can be used for data. Two services in Australia:
Many open data sets provide information on how to cite them, when used in other forms of publication.
Legal requirements:
Practical requirements:
If you provide a link to the data on a website:
Technical requirements:
Watch out for:
Open Data Institute - Dave Tarrant - EDP Module 12 from Open Data Institute on Vimeo.
Let’s look at https://www.realestate.com.au/buy
Pop Quiz! ❤️
Try the quizzes here
Exercise
Look at these open data examples:
Consider the interface
Look for licensing
Find explanations of what’s in the data
Review the meta data
This is Prof Di Cook’s taxonomy!
Long shelf life, highly processed
Orphans
Synthetic
Wild
Fresh, interesting, exciting
But also challenging!
Real world data sets, with real world messiness
eg US Bureau of Transportation Statistics air traffic database
Fresh and local
Best kind of wild data
Collected locally, and about our own lives
Our working definition of wild caught data is:
data the can be freely used, modified, and shared by anyone for any purpose, AND
The data source is traceable, the data collection is transparent, and the data is updated as new measurements arrive, AND
In case of data processing, the process is clearly described and reproducible.
Are they Wild?
✅ Freely available to be used and modified
✅ Can be shared
✅ Data provenance is clear
✅ How the data was collected is transparent
✅ Data is updated as new measurements become available
✅ Any processing of this data is clear
Open data: definitions, sources and examples
Introduced you to the Open Data Fundamentals
eg. power for societal good, where to access, limitations, licences
Data Quality Elements
eg. meta data, machine readable, FAIR, five star ratings
Wild Caught Data Meaning
and other flavours of open data
Teaching philosophy of cake first!

ETC5512