Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
Suppose you are the data curator. What should you know?
Learning Objectives
We will learn:
We will also discuss SETUs and Assignment 4 today.
Open data is … 1
a raw material for the digital age but,
it’s unlike coal, timber or diamonds,
it can be used by anyone and everyone at the same time.
Let’s remind ourselves some examples why open data is important!
Important
Help make governments more transparent.
Building new business opportunities
Protecting the planet
Our working definition
Wild caught data is:
data the can be freely used, modified, and shared by anyone for any purpose, and
The data source is traceable, the data collection is transparent, and the data is updated as new measurements arrive, and
In case of data processing, the process is clearly described and reproducible.
It should be fresh, interesting, exciting!
And ideally collected locally, and about our own lives
It’s important to stop and reflect on what we’ve learnt.
Break out session
Discuss in your tables open data examples you’ve seen that:
You may also like to reflect on our case studies or where you can source open data from.
How has your understanding of open data changed since you started this unit?
Caution
But their are some common pitfalls we can easily avoid!
Also:
Vast number of people and organisations collating data, often (others) cross-checking numbers between sites.
Human consumption
Computer consumption
Source: Murrell (2013) Data Intended for Human Consumption
Broman and Woo (2018) Data Organization in Spreadsheets https://doi.org/10.1080/00031305.2017.1375989
Good practice to store dates as Year, Month, Day columns. Much is safer across systems.
Tip
The cells in your spreadsheet should each contain one piece of data. Do not put more than one thing in a cell.
In assignment 1 we saw a column with “HAZARD flooding”. It would be better to separate this into “HAZARD” and “flooding” columns.
Remember, airlines data, time zone in one column, departure time in another. This is partly technical because multiple time zones can’t be stored in a single column.
The census has an extensive data dictionary for each year distributed, giving variable names, and also explanation of levels in categorical variables.
But, it can still be hard to find what you are looking for.
In our wild caught data anaology
Beware your spreadsheets don’t “bite” your data!
Useful link
You can validate the integrity of your csv file with
http://csvlint.io
We want to look after our wild data so it doesn’t bite us!
We want this
Not this
Goodman et al (2014) created ten Simple Rules for the Care and Feeding of Scientific Data. These apply to us.
We can replace “science” with “data science”, “data analysis”, “analytics”, “business intelligence”.
🤔 So think about what these rules imply for business and government data.
The Rules
Love Your Data, and Help Others Love It, Too
Share Your Data Online, with a Permanent Identifier
Conduct Science with a Particular Level of Reuse in Mind
Publish Workflow as Context
Link Your Data to Your Publications as Often as Possible
Publish Your Code (Even the Small Bits)
State How You Want to Get Credit
Foster and Use Data Repositories
Reward Colleagues Who Share Their Data Properly
Be a Booster for Data Science
Love Your Data, and Help Others Love It, Too
Note
What are some ways to show your love?
What data have we seen that isn’t loved?
Share Your Data Online, with a Permanent Identifier
Conduct Science with a Particular Level of Reuse in Mind
Note
Reward Colleagues Who Share Their Data Properly
Note
Let’s review some examples
What’s really nice 😄:
- Github page
- Compiled data from various sources, sources listed
- Update time stamp
- Versioning
- Issues for two way conversations with users
Bureau of Transportation Statistics (Assignment 1)
csv
file is nicely rectangular ✅CSIRO (Assignment 1)
You’ve seen lots of examples of different case studies now (cakes).
Remember our teaching philosophy of “Let them eat cake (first)”
It’s time to show us that you know how to put the raw ingredients together to make your own cake.
Some me what you can do!
Show off what you’ve learnt and how far you’ve come!
Everyone will be on a slightly different journey here.
You can either use open data to analyse a question you:
Use open data to consider a question that:
What I’m examining
In Task 1 I want to see you can:
source open data appropriate for a task
can clearly document the download and processing steps
can curate your data to share with others
What I’m examining
In Task 2 I want you share your own case study in a blog:
show you know how to be curious about data
show you can use open data to answer a question
show you can think critically about a problem
show you can write about your analytics
show off the coding skills you’ve learnt
What I’m examining
In Task 3 I also want you reflect on your analysis:
share challenges you faced working with your data
share specific insights into your approach
share your ambitions for future work
share limitations of your data or analysis (may also apply for task 2)
clearly communicate any assumptions (may also apply for task 2)
Success
To be successful on assignment 4 I suggest going back through the lectures so far.
Take a look at the variety and depth of topics we’ve covered!
Ensure that in assignment 4 you demonstrate knowledge gained from a few of the different lectures and assignments.
Wrapping up this unit you should:
Understand the definitions, allowed usage, digital identification and licensing of open data
Know about common open data sources, how they are used and effectively search for new sources
Explain the differences between data collection methods and the limitations for data analysis
Work with the range of different data formats of open data
Understand ethical constraints and privacy limits when working with open data
Recognise the components of effective curation needed for open data
I love teaching this unit
I truly believe the skills you encounter in this unit, particularly the learning to deal with messy real-world data and the problem solving skills will set you in good stead for whatever career you have after this.
I also like to think that you will take with you a healthy respect for data ethics and data privacy, and I am grateful I got to share that with you.
I wish everyone the best of luck in their future studies and careers.
ETC5512