🎯 Objectives

In this tutorial, you will learn to

πŸ”§ Preparation

Note: you should do preparation before the tutorial.

  1. Installing relevant R-packages
### Required packages
list.of.packages <- c("MASS", "data.table", "tidyverse", "here", "stringr",
                      "lubridate", "ggplot2", "usmap", "gganimate")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
invisible(lapply(list.of.packages, require, character.only = TRUE))
  1. Unzip the ETC5512-week07.zip file provided, and notice the file structure for this week looks like:
etc5512-week07
β”œβ”€β”€ tutorial-07.Rmd
β”œβ”€β”€ final data
β”‚   β”œβ”€β”€  2016_2022_data.rds
β”œβ”€β”€ raw data
β”œβ”€β”€ prep
β”‚   β”œβ”€β”€  00_read_data.R
β”‚   β”œβ”€β”€  01_import_data.R
β”‚   β”œβ”€β”€  02_combine.R
└── etc5512-week07.Rproj
  1. Import the data by loading the R datafile (.rds).

Note: This file has been pre-processed for you to work on the exercises in this tutorial.

This data contains the Fannie Mae Single Home Loan data from year 2016Q1 to 2022Q3.

alldata <- readRDS(here::here("final data", "2016_22_data.rds"))

πŸ•°οΈ Exercise 7A

Analyzing Data

  1. Create a table of default rate (based on 60, 90, 120 and 180 days of delinquency) by origination year and quarter. You might want to convert the ORIG_DTE to a date format first. Do you observe something peculiar from the table? Explain why.

  2. Draw a U.S. map to show the default rate based on 90 days delinquency just for the year of 2016. Which states are having the top 3 highest default rate?

  3. Draw a histogram to show the distribution of FICO Scores in different years. What do you notice?

  4. The way FICO score is being interpreted is very tricky. A small increase at higher score does not have the same meaning as a small increase at the lower score. Therefore, regrouping the variable is needed. Regroup the variable of FICO scores as below (you need to use the variable as your independent variable in Exercise 7B).

Credit.Score Rating
300-579 Poor
580-669 Fair
670-739 Good
740-799 Very Good
800-850 Exceptional

πŸ—οΈ Exercise 7B

The data that you have used so far is pre-processed from the raw data that you can get from Fannie Mae Single-Family Homeloan Data. Try the steps below to pre-process the data by yourself by just using the data from 2016Q1 to 2022Q3 as an exercise.

  1. Download the data
    • Go to this page. On the right panel, click Access the Data. Register for an account if you have not (it is free). Click Hp (historical loan credit performance data)/Download data/Choose the year 2021Q1 to 2021Q3. The downloaded data should be placed under the data -> raw folder.
  2. Convert each file to a smaller size
    • Open 01_import_data.R (under prep folder) and run the script.
    • Your output file should have _stat.csv at the end (located in raw folder). The code has been pre-written for you in the 00_read_data.R script.
    • Try to understand the code in the 00_read_data.R to see how the data is being converted from transactions data to single account level data.
  3. Combining data
    • Run 02_combine.R (Under prep folder. It may takes a while to load. You should get a 20 million records. Yes, 20 millions row!).
  4. Analyze data
    • Now the data is ready!
    • Use the data and replicate the plots in the lecture.
    • Answer the below questions:
      • Which state has the highest default rate?
      • What type of customers are more risky? Those have higher LMI or lower LMI?
      • Are those customers with higher interest rate at origination has higher chances of default?

Material maintained by Dr.Β Kate Saunders. Previous maintainer Dr.Β Joan Tan