Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
Starts by asking questions
but needs to be
What we’ll cover:
Asking questions about our data
Thinking about if our data is suitable to answer those questions
That means learning about different types of sampling
Learn about how data is collected
Questions
How many yellow, green and red alien creatures?
What is the distribution of the height of the alien creatures?
Are yellow creatures more likely to have hair?
Does the hair growth formula work on these creatures?
What is the population?
It is rare to have resources to measure ALL of the population, take we a sample.
Population parameters:
Typically don’t know the values.
Sample statistics:
Are estimates of true parameters.
Important
Collecting data on the entire population is normally too expensive or infeasible!
Note
If we can collect data on the entire population it is called a census
Often only collect data only on a subset of the population.
So how should we sample the population?
There are many sampling schemes!
Goal of a sampling scheme
We want to get accurate information from the sample in order to answer your question about the population.
Think!!!
Accurate sampling involves identifying:
The population of interest (e.g. if studying about male baldness pattern, your population of interest is the biologically male population),
what responses (dependent variables) or covariates (explanatory or independent or predictor variables) to capture and how to measure it (e.g. do you collect their age? Which range of age they are in? Their hair count? The thickness of the hair?),
the sample size (how many samples do we need?),
any structure that will be in the data (e.g. population structures, repeated cross-sectional data, panel or longitudinal data), and
any restrictions (e.g. ethical concerns, limitation on collecting data).
Simple random sampling
Every unit in the population has the same sample probability to be drawn.
Stratified random sampling
Units are drawn from non-overlapping sub-populations.
Sampling strategies combine knowledge about the population with statistical methods.
For example:
Warning
What might go wrong with a simple random sampling of 10 creatures from this population?
Units (population members) ideally are sampled randomly, but often selections are made in a non-random manner.
Example
If I survey every 10th household in a street, is that a random selection?
Important
What do you think can go wrong if we don’t sample randomly?
Breakout Session: Discussion
You want to know the attitude of the creatures aboout working at home.
So you call phone numbers listed in the order of telephone directory and stop when you have 20 observations.
You want to get the hair count distribution of the Planet Cute Creatures population.
So you sample creatures from the Society of Bald Extraterrestrials.
Items 1 and 2 are toy examples. Do you know of any examples of bad sample design in the real world?
Designing a data collection is hard.
There may be unknown or hidden structures in the population.
It may add complex structural elements, e.g.
You may have introduce unintended or unknown structures in the data, e.g. confounded variables.
It’s further complicated by:
Observational Studies
Sampling from a population typically yields data considered to be an observational studies.
Almost all open data are from observational studies.
An observational study aims to draw inferences about a population from a sample where independent variables are not intentionally allocated to units within the sample for the purpose of a study.
Data considered in observational studies are observational data.
Tip
Tip
Tip
Experimental studies
A scientific claim generally needs to be validated by an experimental study.
In an experimental study, a causal variable of interest (referred to as treatment) is administered to recipients while holding other covariates at controlled settings to observe responses.
Data from an experiment are referred to as experimental data.
Tip
Tip
Tip
Note
Experimental units are recipients of the allocated treatment such that no sub-division of it can receive another treatment independently.
For the fertiliser example: The experimental units are the plots of land (fields) receiving different fertilisers.
For the vaccine example: The experimental units are the individual patients who receive either vaccine or placebo.
In the A/B testing example: The experimental units are the individual website visitors who see version A or B.
Note
Observational units are units that you measure the response on.
For the fertiliser example: The observational units are the same plots of land where wheat yield is measured.
For the vaccine example: The observational units are the same individual patients whose flu status is observed.
In the A/B testing example: The observational units are the same website visitors whose purchasing behavior is tracked.
Note
Observational unit is not the observation (the response)! The wheat yield, flu status, and purchase decisions are the observations, not the observational units.
Sometimes the experimental units are the same as the observational units, as in these examples.
Warning
Prof Android delivers their lecture by reciting word-to-word from the text in a monotone.
Prof Alien delivers their lecture by transmitting the information directly to the students mind.
You want to see if one of the methods is more effective.
Students in class 1, 3, 4, 7 and 10 have Prof Android.
Students in class 2, 5, 6, 8 and 9 have Prof Alien.
What are the experimental units?
Warning
Carrying on from the previous example…
What are the observational units?
10 week sensory experiment, 12 individuals assessed taste of french fries on several scales, fried in one of 3 different oils, replicated twice.
Study details
The treatment is oil, and there are 3 of them.
The experimental units are batches of chips.
The observational units are the tasters.
Replication is the two batches of each oil for each week.
Weeks could be considered to be blocks, because the taste might change as the oil ages.
The outcome or measured variable is the rating factor. There are five taste factors recorded.
Randomisation applied to order of tasting (probably), but tasters should be blind to the type of oil.
Randomisation is applied to order of tasting (probably).
Caution
Why don’t we order the treatments in a systematic order?
Isn’t this easier to manage the experiment?
Systematic designs are prone to bias and confounding.
Treatments
This avoids:
systematic bias - e.g. all flu vaccine A tested in January (summer) and all flu vaccine B tested in July (winter).
selection bias - e.g. giving the treatment that you are testing to the sick patients and placebo to those that are healthy.
other bias - e.g. the lab technician giving the treatment to the first rat that is taken out of the cage.
Blocks are used to group the experimental units into alike units.
Blocking
If well done, blocking can lower the variance of treatment contrasts which increase power.
A non-homogeneous block (i.e. units within block are not alike) can decrease the power of the experiment.
You can form blocks from Natural discrete divisions between experimental units.
e.g. in experiments with people, the gender and age groups makes an obvious block.
Grouping experimental units with similar continuous gradients.
e.g., if the experiment is spread out in time or space and there exists no obvious natural boundaries, then an arbitrary boundary may be chosen to group experimental units that are contiguous in time or space.
Salk Vaccine Field Trial
The first polio epidemic hit the United States in 1916 claiming hundreds of thousands of victims, especially children.
National Foundation for Infantile Paralysis (NFIP) was ready to test the vaccine developed by Jonas Salk in the real world.
A controlled experiment was proposed to test the effectiveness of the vaccine on grade 1, 2 and 3 children at selected school districts though the country where the risk of polio was high.
In total two million children were involved although not all parents consented to their children to be vaccinated.
Source: Freedman, Pisani & Purves (2010) Statistics. 4th edition
Design for the NFIP Study
Vaccinate all grade 2 children whose parents would consent, leaving children in grades 1 and 3 as controls.
Can grade 2 children whose parents did not consent be included as control?
What are the potential issues with such a design?
Polio is a contact disease. Would incidences of disease be higher in grade 2?
Randomised controlled trial
An alternate vaccine trial randomly assigned the vaccine and placebo to children.
Details
The rate is the number of polio cases per 100,000 in each group.
RCT and NFIP trial sampled from school districts with similar exposures to the polio virus.
Groups labelled with Not Vaccination (no consent), Control and Placebo group did not receive the vaccine.
Results
Let’s take a look at the results now - Why is the rate of polio cases different?
Group | Participants | Rate |
---|---|---|
Vaccinated (Grade 2) | 221,998 | 25 |
Control (Grade 1 & 3) | 725,173 | 54 |
Not Vaccination (Grade 2, no consent) |
123,605 | 44 |
Incomplete Vaccination (Grade 2, incomplete) |
9,904 | 40 |
Group | Participants | Rate |
---|---|---|
Vaccinated | 200,745 | 28 |
Placebo | 201,229 | 71 |
Not Vaccination (no consent) |
338,778 | 46 |
Incomplete Vaccination | 8,484 | 24 |
Tip
Higher income parents would more likely consent to treatment than lower-income parents.
Children of higher income parents are more vulnerable to polio.
Many forms of polio are hard to diagnose and in borderline cases.
Think about
Warning
Basically, designing and running experiments are hard.
Types
Experimental data: the gold standard of data collection, but very difficult
Observational data:
Sample and Population
Knowing how the sample of data relates to the population is an essential ingredient for making inferential statements and making decisions with data.
What type of data is ?
Data: Airline traffic (on-time performance database) in the USA as available from https://www.bts.gov.
What is in it: Records of every commercial flight operated in the USA since the 1980s, that has carried passengers.
OBSERVATIONAL, CENSUS
Always ask yourself “What is missing?”
What type of data is ?
Data: National Longitudinal Survey of Youth 1979 from the US Bureau of Labour Statistics https://www.nlsinfo.org/content/cohorts/NLSY79
What is in it: Records data about people born between 1957 and 1964. At the time of first interview, respondents’ ages ranged from 14 to 22.
OBSERVATIONAL, SURVEY SAMPLE
Always ask yourself “What is the population?”
What type of data is ?
Data: Atlas of Living Australia at https://www.ala.org.au.
What is in it: The Atlas of Living Australia (ALA) is a collaborative, digital, open infrastructure that pulls together Australian biodiversity data from multiple sources, making it accessible and reusable.
The ALA helps to create a more detailed picture of Australia’s biodiversity for scientists, policy makers, environmental planners and land managers, industry and the general public, and enables them to work more efficiently.
OBSERVATIONAL, OCCURRENCE
Always ask yourself “What is missing?” and “What is the population?”
Data
Data / what is it: The US National Institute of Health provides a catalog of medical studies including many COVID studies. Here is one that studies the “Safety and Efficacy of C21 in Subjects With COVID-19”.
EXPERIMENTAL
Think about what are the treatments? Experimental units? Outcome measure? Randomisation?
Introduction to Data Collection Methods
Encouraged you to be curious about your data!
Now understand the difference between sample vs population
Covered important aspects to sampling design
Exposed you to different types of data collection methods
Going forward: Think critically about whether the data collected is suitable for what you are curious about!
Your Job:
Find data sets that are useful along the warning value chain
Check licences for those data sets, are they open?
Review the data sets for FAIR Principles and 5-star quality
These data sets are important for
In addition to the warning value chain consider:
Try to find some of your own!
ETC5512