<- readRDS(here::here("final data", "2016_22_data.rds")) alldata
Week 7 Tutorial
Learning Objectives
In this tutorial, you will looking at the Fannie Mae performance data on single family loans.
You will be learning how to:
- import large data
- compare the default risk for different periods
- practice data wrangling skills, and
- as always use critical thinking for analysis
Before your tutorial
- Download and unzip the ETC5512-week07.zip file provided, and notice the file structure for this week looks like:
etc5512-week07
├── tutorial-07.Rmd
├── final data
│ ├── 2016_2022_data.rds
├── raw data
├── prep
│ ├── 00_read_data.R
│ ├── 01_import_data.R
│ ├── 02_combine.R
└── etc5512-week07.Rproj
- Import the data by loading the R data file, it has file type .rds . See
?readRDS
.
Note: This file has been pre-processed for you to work on the exercises in this tutorial - but it will still take time to read in.
This data contains the Fannie Mae Single Home Loan data from year 2016Q1 to 2022Q3.
- .rds files are more efficient for storing large datasets
- .rds fild compress and preserve R data types
- .csv files take up more space and require conversion when imported
Exercises
Create a table of default rate (based on 60, 90, 120 and 180 days of delinquency) by origination year and quarter. Hint: You might want to convert the
ORIG_DTE
to a date format first.
Do you observe something peculiar from the table? Explain why.Draw a U.S. map to show the default rate based on 90 days delinquency just for the year of 2016. Which states are having the top 3 highest default rate?
Draw a histogram to show the distribution of FICO Scores in different years. What do you notice?
The way FICO score is being interpreted is very tricky. A small increase at higher score does not have the same meaning as a small increase at the lower score. Therefore, regrouping the variable is needed. Regroup the variable of FICO scores as below (you need to use the variable as your independent variable to complete the exercise in your own time).
Optional: In your own time
The data that you have used so far is pre-processed from the raw data that you can get from Fannie Mae Single-Family Homeloan Data. Try the steps below to pre-process the data by yourself by just using the data from 2016Q1 to 2022Q3 as an exercise.
Before you get started, install relevant R-packages.
### Required packages
<- c("MASS", "data.table", "tidyverse", "here", "stringr",
list.of.packages "lubridate", "ggplot2", "usmap", "gganimate")
<- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
new.packages if(length(new.packages)) install.packages(new.packages)
invisible(lapply(list.of.packages, require, character.only = TRUE))
- Download the data
- Go to this page. On the right panel, click
Access the Data
. Register for an account if you have not (it is free). ClickHp (historical loan credit performance data)
/Download data
/Choose the year 2021Q1 to 2021Q3. The downloaded data should be placed under thedata
->raw
folder.
- Go to this page. On the right panel, click
- Convert each file to a smaller size
- Open
01_import_data.R
(underprep
folder) and run the script. - Your output file should have
_stat.csv
at the end (located in raw folder). The code has been pre-written for you in the00_read_data.R
script. - Try to understand the code in the
00_read_data.R
to see how the data is being converted from transactions data to single account level data.
- Open
- Combining data
- Run
02_combine.R
(Underprep
folder. It may takes a while to load. You should get a 20 million records. Yes, 20 millions row!).
- Run
- Analyze data
- Now the data is ready!
- Use the data and replicate the plots in the lecture.
- Answer the below questions:
- Which state has the highest default rate?
- What type of customers are more risky? Those have higher LMI or lower LMI?
- Are those customers with higher interest rate at origination has higher chances of default?
- Which state has the highest default rate?