Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
What we’ll cover:
Unit overview and details
An Introduction to Open Data
Societal importance of open data
What makes high quality open data
Learn about the different types of open data
Getting you set up in R (Drop In Session)
Acknowledgement of Country
I wish to acknowledge the people of the Kulin Nations, on whose land we are gathered today. I pay my respects to their Elders, past and present.
In this unit, we will learn about how data can be transformed into information, and then into knowledge. And the many different types of data that can be used to understand our world.
Using data to understand our world isn’t new.
First nations peoples have been using data to understand country for generations. Through observing the same environment for thousands of years, they identified cause-and-effect relationships, such as subtle changes in ecosystems, and they have developed a deep understanding of native flora and fauna.
See for example: Indigenous Seasonal Calendar
Rome wasn’t built in a day
This unit has been running since 2020.
Their have been many members of the staff who have contributed to the development of this unit including:
Kate Saunders
Lecturer at Monash University
🎓 PhD in Statistics
🌎 Home State is Queensland
👨💻 Research is in statistics of climate extremes
👩💻 Passionate about open data, data visualisation and data ethics
👩💻 Started R coding in 2012 (before tidyverse!)
❤️ Hobby is playing basketball.
Krisanat Anukarnsakulchularp
PhD Student at Monash University
🎓 Graduated in Masters in Business Analytics
🎓 Monash local since undergrad
🌎 Home Country is Thailand
👩💻 PhD research is on network visualisation
👨💻 Published an R package called animbook!
❤️ Hobby is playing music.
Maliny Po
Graduated in Masters in Business Analytics
👨🎓 Undergraduate degree in International Trade and Business Logistics.
👩💻 Former Student in Wild Caught Data.
👨💻 Published a data viz package called Sugarglider, as part of the Google Summer of Code program.
❤️ Enjoys reading books, lego and a good cup of coffee.
Breakout Sesson: Your turn
Quick tip
Finding connections with your classmates now can lead to great project collaborations later!
Unit Design
2 hours of seminars each week
1 hour tutorial each week
1 hour workshop each week
Caution
Expect ~12 hours of contact and study each week
We are learning to code. You will need to keep up with the material.
Coming to classes and consultations will help you!
At the end of this unit you will be able to:
Understand the definitions, allowed usage, digital identification and licensing of open data
Know about common open data sources, how they are used and effectively search for new sources
Explain the differences between data collection methods and the limitations for data analysis
Work with the range of different data formats of open data, including APIs
Understand ethical constraints and privacy limits when working with open data
Recognise the components of effective curation needed for open data.
Online workshops
We’ll use these to flesh out ideas from seminars
These will be recorded (tutorials will not be)
The workshop format will change week to week depending on unit needs.
Sometimes I’ll run through examples and live code.
Sometimes I’ll answer your questions.
Sometimes I’ll run through topics to supplement your learning.
Assignments
4 assignments worth 25% each (100% of your total grade)
Assignment 1 will cover content from weeks 1 to 3 (Due week 4)
Assignment 2 will cover content from weeks 4 to 6 (Due week 7)
Assignment 3 will cover content from weeks 7 to 9 (Due week 10)
Assignment 4 will cover content from weeks 1 to 12 (Due in exam block)
You’ll get at least 2 weeks to complete each assignment.
Warning
Failure to submit and notify the CE accordingly will result in a zero score for the assignment.
If you miss two assignments you will need to re-take the unit at a later date.
Special Consideration
Apply for special consideration centrally. This includes short extensions of 48 hours.
If you need special consideration, apply ASAP and no later than 11.55 pm on the day your assessment is due.
If you miss an assignment through illness or personal difficulty, provided you’ve applied for special consideration there will be options for scaling or alternative assessment.
Locations
Unit Website: Everything is displayed on the same page and is easy to access
Moodle: Where you submit your assignments, the discussion forum is located and I’ll make unit announcements.
Unit Github: Contains all the code, data etc to produce the unit content and website.
Struggle for a while!
Coding is a cycle:
Progress comes from iteration!


You can use Generative AI in this unit.
In fact I encourage it!
It’s a great tool for those learning to code

You can use Generative AI in this unit.
In fact I encourage it!
It’s a great tool for those learning to code
But
You must never copy and paste output from AI you don’t understand or can not explain
You must always provide appropriate acknowledge of you AI use
You need to be careful not to short cut your learning
What does academic integrity mean to you?
Still not sure - Monash Resources
What is academic integrity? Click here
What does maintaining academic integrity mean? Click here
What happens if I breach academic integrity? Click here
Ask your peers
Suitable for:
There is a discussion forum for general questions and clarifications
Emails on general matters will be redirected to the discussion forums
Sharing helps you learn from each other!
Prevents me answering the same question twice (three times, four times etc.)
Careful about posting code from your assignments or any assignment hints to the discussion portal. I may deduct marks

Attend Consultation
Suitable for:
Get one-on-one help
Working through problems with your tutor
Ask questions about your assignments
Get help debugging your code
Get feedback on your assessments
Being really nerdy about the unit!

Unit Email
Suitable for:
For personal questions or issues email ETC5512.Clayton-x@monash.edu.
Response times are within 1 - 2 days but may vary during busy periods.
Also email if you notice issues with assessments or Moodle.
For remarking or to get feedback on your assessment ask your tutor and email within 10 days of receiving your marks.
Do not direct emails to my staff account, I receive a high volume of high volume of emails and they risk going into a black hole and never being seen again!
Think about this unit in the way Dr Mine Çetinkaya-Rundel describes in this talk: “Let them eat cake (first)”
Imagine you’re new to baking, and you’re in a baking class.
There are two options: which gives you better sense of the final product?


The Textbook Learner

The Example Learner

Breakout Discussion
Which learning approach feels more natural to you?
Discuss the advantages of each
Cakes and Case Studies
The case studies you will see in this unit are the cakes.
By showing you what these case studies look like (cakes), we are helping you learn how to perform your own data analysis studies by example!
This may be different to how you’ve learnt in the past.
Please approach this unit philosophy with openness.
And for textbook learners: Check out the textbook R for data science
Open data is … 1
a raw material for the digital age but,
it’s unlike coal, timber or diamonds,
it can be used by anyone and everyone at the same time.
Open Data Institute - Dave Tarrant - EDP Module 1 from Open Data Institute on Vimeo.
Open data is measured by what it can be used for, not by how it is made available.
Open Data Considerations
No limitations that prevent particular uses.
Anyone free to use, modify, combine and share, even commercially.
Free to use does not mean that it must be free to access.
There is a cost to creating, maintaining and publishing usable data.
Live data, big data and data from generative AI can incur ongoing costs.
Free to use, reuse and redistribute it - even commercially.
Open data can be freely used, modified, and shared by anyone for any purpose
Two types of data openness:
The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. 1
Pop Quiz! ❤️
Try the quizzes here
Help make governments more transparent!
Building new business opportunities
Protecting people and our environment
Globally
Australian governement examples:
And so many more places …
How do I tell if data I find is open?
Licences!
Licences tells people how they can access, use and share data.
Licenses
Without a licence, users may find themselves in a legal grey area.
Data may be ‘publicly available’, but users may not have permission to access, use and share it under general copyright or database laws.
An open data licence is an explicit permission to use the data for both commercial and non-commercial purposes.
Open data publishers should provide easy access to the licence for all datasets that are available to access, use and share.
Organisations and governments use Open Data licenses to clearly explain the conditions under which their data may be used.
Examples include:
Standard re-usable license: consistent and broadly recognised terms of use
Creative Commons, particularly CC-By and CC0 https://creativecommons.org/
Open Database License https://opendatacommons.org/licenses/odbl/
Bespoke licenses: e.g. for governments, international organisations
TLDR
Many licenses have a summary version that helps convey the most important information to users and a detailed version that provides the complete legal foundation.
Licence type
Standard licenses can offer several advantages over bespoke licenses.
Standard licences have greater recognition among users, increased interoperability, and greater ease of compliance.
Pop Quiz! ❤️
Try the quizzes here
Information components
Standards frameworks
Example datasets
Key metadata elements
‘machine readable’ is not the same as ‘digitally accessible’
Historical efforts have focused on:
pushing static information about government programs and services to the web,
where the intended use is a human who can read, print, and take actions based on reading.
It’s a narrow vision of the expected users and uses of the data.
Machine Readable
5 ⭐ ratings:
This web site 5 ⭐: Open Data at provides a rating system for deploying open data.
⭐ An open license.
Make your stuff available on the Web (whatever format) under an open license
⭐, ⭐ Re-usable format.
Make it available as structured data (e.g., A proprietary format like excel instead of image scan of a table.)
⭐, ⭐, ⭐ Open format.
Make it available in a non-proprietary open format (e.g., CSV instead of Excel)
⭐, ⭐, ⭐, ⭐ use (Uniform Resource Identifiers (URIs).
URIs help you reference your data, like a unique address and gives context to the values.
⭐, ⭐, ⭐, ⭐, ⭐ Linked data
Your data doesn’t exist in isolation. Your data links/ connects to other relevant data sets.
Learn more at fair.org
FAIR Principles
Findable Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services.
Accessible Once the user finds the required data, they need to know how can they accessed that data, possibly including authentication and authorisation.
Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
Publishing Your Data
Research data is increasingly seen as part of the corpus of scholarly publications.
Publishers, funders and governments support researchers to publish their data outputs by various policies, guidelines and mandates.
Obtaining a Digital Object Identifier system (DOI) provides a persistent identifier, and can be used for data. Two services in Australia:
Many open data sets provide information on how to cite them, when used in other forms of publication.
Legal requirements:
Practical requirements:
If you provide a link to the data on a website:
Technical requirements:
Watch out for:
Open Data Institute - Dave Tarrant - EDP Module 12 from Open Data Institute on Vimeo.
Let’s look at https://www.realestate.com.au/buy
Pop Quiz! ❤️
Try the quizzes here
Exercise
Look at these open data examples:
Consider the interface
Look for licensing
Find explanations of what’s in the data
Review the meta data
This is Prof Di Cook’s taxonomy!
Long shelf life, highly processed
Orphans
Synthetic
Wild
Fresh, interesting, exciting
But also challenging!
Real world data sets, with real world messiness
eg US Bureau of Transportation Statistics air traffic database
Fresh and local
Best kind of wild data
Collected locally, and about our own lives
Our working definition of wild caught data is:
data the can be freely used, modified, and shared by anyone for any purpose, AND
The data source is traceable, the data collection is transparent, and the data is updated as new measurements arrive, AND
In case of data processing, the process is clearly described and reproducible.
Are they Wild?
✅ Freely available to be used and modified
✅ Can be shared
✅ Data provenance is clear
✅ How the data was collected is transparent
✅ Data is updated as new measurements become available
✅ Any processing of this data is clear
Open data: definitions, sources and examples
Introduced you to the Open Data Fundamentals
eg. power for societal good, where to access, limitations, licences
Data Quality Elements
eg. meta data, machine readable, FAIR, five star ratings
Wild Caught Data Meaning
and other flavours of open data
Teaching philosophy of cake first!
Getting set up
By the end of this unit you’ll be performing your own analytics case
studies
Before you can do that we need to get you set up with the software
you’ll use in this unit
We’ll now go through the steps to install R and RStudio
Think about R as the engine and RStudio as the dashboard and
controls.
We need both to drive a car.
We’ll open RStudio to perform data analytics using R.
Step 1: Install R
Got to https://www.r-project.org/
Click “download R”
Select a mirror (I use the Melbourne one)
Install for your operating system
Step 2: Install RStudio
Got to https://www.rstudio.com/products/rstudio/download/ (you only
need the free version)
Select download for your system
Follow the prompts to install
Why do we need a Programming Language
It allows us to have reproducible steps, which can be applied for many different data sets
Make sure the analysis is not just point and click, you can work as a team on it on the same code
It also means we can more easily perform our own case studies in analytics
Why do we use R?
It has been around for a while.
It is regularly maintained and is open source.
It is beginner friendly
Even if you use other languages, you might still use R for your data wrangling and visualisations
Learning a new language is hard!
You need to think about grammar and structure, and how to communicate well in it!
You will make mistakes, lots of them.
Below you see code that plots points showing the GDP per capita against life expectancy. The points are coloured by country and the size of the points shows the population.
It might look impossible now, but by the end of this semester you will be able to write this yourself!
Note
# create a file that I can save and use again
# assign a variable
x = 2
y <- 3
# what should I not call my variable
kates_coolest_variable <- 4 # snake_case
katescoolestvariable <- 4 # ok but not readable
KatesCoolestVariable <- 4 # ok but could be easier to read
123kate <- 5 # bad
@kate <- 5 # also bad
kates_awesome_variable <- 5
kate_loves_R <- 6
# what we learn:
# give variables files/intuitive names
# don't name variable starting with numbers or
# or special characters
# bunch of inbuilt functions
sqrt(2)
1:10
mean(1:10)
# load in the functions
library(cowsay)
# check what they do
?cowsay
?say
# look at examples
say(what = "hot diggity", by = "frog")
say(what = "Happy Lunar New Year",
by = "endlesshorse")
# Summary:
# Install the cowsay package (first time use)
# Load the library
# print out a say - with an animal and a message
# past into our meeting chat
# Reading in files
# Can use the import dataset button
# but not advisable for reproducible reserach
# Can also use a direct file path
# But if you change the working directory
# the data read fails
library(tidyverse)
file_path = "Documents/Git/dvac-SSA/assignments/data/tourism_data.csv"
tourism_data <- read_csv(file_path)
View(tourism_data)
# Instead set up the project
# Then the working directory will be that project
# No need for long path names
getwd()
tourism_data <- read_csv("data/tourism_data.csv")
View(tourism_data)
ETC5512