I would like to come up with an investing strategy for Prosper, a peer-to-peer lending platform. Unlike Lending Club, Prosper has no gross income or net worth requirements for the state of Florida.
The data was offered by Udacity for a project in one of their courses. You can find it here or here. Although I did not take the course myself, I find the data very useful, indeed, and am very grateful to Udacity for making it available!
The Prosper dataset and the R notebook are here.
Going through the dataset dictionary here or here, I want to better acquaint myself with the data, since I have never either borrowed or lent on a peer-to-peer lending platform.
Start by loading the data:
loans <- read.csv("prosperLoanData.csv")
How large is our dataset?
cat("We have", dim(loans)[1], "observations of", dim(loans)[2], "variables.")
We have 113937 observations of 81 variables.
We can use histograms and barplots to get an idea of how some of the key variables are distributed. The following code generates the histogram for the loan amounts.
library(ggplot2)
theme_set(theme_gray(base_size = 10))
loan_amount_hist <- ggplot(loans, aes(LoanOriginalAmount)) + geom_histogram(binwidth=750, fill="#E69F00", colour="black") + xlim(2500,25000)
loan_amount_hist <- loan_amount_hist + xlab("Amount borrowed (USD)") + ylab("Number of loans") + ggtitle("Histogram of loan amounts")
loans_median_amount <- median(loans$LoanOriginalAmount)
loan_amount_hist <- loan_amount_hist + geom_vline(xintercept=loans_median_amount)
loan_amount_hist <- loan_amount_hist + annotate("text", x = loans_median_amount + 4800, y = 17000, label = paste("Median =",loans_median_amount, "USD"), size=3)
We can do similarly for other variables. I will not show the code here as it’s repetitive.
We can plot multiple graphs in an array using Winston Chang’s multiplot()
.
multiplot(loan_amount_hist, loan_rate_hist, loan_income_hist, loan_credit_scores_hist, cols=2)
The median income of potential borrowers is pretty healthy at $56,000/year. The median credit score is pretty close to 700, and in fact almost half (~47%) of the records have a credit score of 700 or higher. Mind, this is using the lowest credit score reported, as given by the variable CreditScoreRangeLower. The median interest, at 18.4%, seems really high but since the loans are typically for relatively short terms (see below) it might turn out to be not too onerous on borrowers. A $6,500 loan (the median amount requested) at 18.4% paid over 3 years will pay about $2,000 in interest in addition to the principal.
Variables related to credit utilization:
Other variables of interest:
About 75% of loans are 3-year loans. Similarly, about 75% of applicants have no delinquencies on their current accounts, which we would expect, and about 75% of them have had their credit pulled two or fewer times in the last 6 months. Home ownership is evenly split among applicants.
The response variable, LoanStatus, has several outcomes:
table(loans$LoanStatus)
Cancelled Chargedoff Completed Current
5 11992 38074 56576
Defaulted FinalPaymentInProgress Past Due (>120 days) Past Due (1-15 days)
5018 205 16 806
Past Due (16-30 days) Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
265 363 313 304
We would like to use a binomial model, so we will label all the loans that are either “Completed” or “Current” with a ‘1’ and loans that were charged off or in which the borrowed defaulted or is somehow late will be labelled with a ‘0’.
#Convert to character and back to factor
#http://programming-r-pro-bro.blogspot.com/2013/04/editingadding-factor-levels-in-r.html
loans$LoanStatus <- as.character(loans$LoanStatus)
#Usage of grepl to test if sub-string in string
#http://stackoverflow.com/questions/10128617/test-if-characters-in-string-in-r
loans$LoanStatus[loans$LoanStatus == "Cancelled" | loans$LoanStatus == "Chargedoff" | loans$LoanStatus == "Defaulted" | grepl("Past Due", loans$LoanStatus)] <- "0"
loans$LoanStatus[loans$LoanStatus == "Completed" | loans$LoanStatus == "Current" | loans$LoanStatus == "FinalPaymentInProgress" ] <- "1"
loans$LoanStatus <- as.factor(loans$LoanStatus)
table(loans$LoanStatus)
0 1
19082 94855
About 83% of the loans are in good status.
We will base our choice of predictors based on several factors. One is our familiarity with the data, e.g., we expect income and credit score to be good predictors of a borrower’s ability to repay a loan. We will leave out predictors that are likely to be heavily correlated with another we did include, e.g., take CreditScoreRangeLower and leave out CreditScoreRangeUpper. We will also leave out predictors we are not familiar with, e.g. trades.
We will use the following predictors:
Term, the length of the loan: 12, 36, or 60 months
BorrowerRate, the interest rate of the loan
CreditScoreRangeLower, the borrower’s lowest reported credit score
DebtToIncomeRatio, self-explanatory
EmploymentStatusDuration, how long has the borrower been at his current employment, in months
IsBorrowerHomeowner, self-explanatory, true or false
CurrentCreditLines, number of current credit lines at the time the credit profile was pulled.
TotalCreditLinespast7years, number of credit lines in the past seven years at the time the credit profile was pulled.
OpenRevolvingAccounts, number of borrower’s open accounts
OpenRevolvingMonthlyPayment, total monthly payment on open accounts
InquiriesLast6Months, number of credit inquiries in the last 6 months
TotalInquiries, total number of credit inquiries (all-time?)
CurrentDelinquencies, number of accounts delinquent at the time the credit profile was pulled
AmountDelinquent, total current dollar amount of delinquencies
DelinquenciesLast7Years, number of deliquencies in the last 7 years
PublicRecordsLast10Years, number of public records in the last 10 years
PublicRecordsLast12Months, number of public records in the last year
RevolvingCreditBalance, total dollar amount carried from one month to the next
BankcardUtilization, the percentage of available revolving credit that is utilized
AvailableBankcardCredit, total available credit via credit cards
IncomeVerifiable, whether borrower stated he/she had evidence to support the stated income
StatedMonthlyIncome, the borrower’s monthly income
Some of these predictors have missing data, several thousand of them in some cases. Rather than ignoring them, we try to impute them simply by sampling. I also tried the mice library, but it takes several minutes to run on a dataset as large as this one, even when I used the method ‘norm’.
#Replacing NA's in CreditScoreRangeLower:
#Getting number of NA's
num_NA <- length(loans$CreditScoreRangeLower[is.na(loans$CreditScoreRangeLower)])
#Sampling CreditScoreRangeLower
set.seed(123)#reproducibility
sample_vector <- sample(x=loans$CreditScoreRangeLower[!is.na(loans$CreditScoreRangeLower)], size=num_NA)
#Replace
loans$CreditScoreRangeLower[is.na(loans$CreditScoreRangeLower)] <- sample_vector
We can do similarly for the other predictors that have missing data. It’s pretty repetitive, so it will not be shown here.
We can try to see what happens if we ask R to generate the logistic regression model using glm()
.
predictors_vector <- c("Term", "BorrowerRate", "CreditScoreRangeLower", "DebtToIncomeRatio", "EmploymentStatusDuration", "IsBorrowerHomeowner", "CurrentCreditLines", "TotalCreditLinespast7years", "OpenRevolvingAccounts", "OpenRevolvingMonthlyPayment", "InquiriesLast6Months", "TotalInquiries", "CurrentDelinquencies", "AmountDelinquent", "DelinquenciesLast7Years", "PublicRecordsLast10Years", "PublicRecordsLast12Months", "RevolvingCreditBalance", "BankcardUtilization", "AvailableBankcardCredit", "IncomeVerifiable", "StatedMonthlyIncome")
predictors_vector_sum <- paste(predictors_vector, collapse="+")
logReg_model_fmla <- as.formula(paste("LoanStatus ~ ", predictors_vector_sum))
logReg_model <- glm(logReg_model_fmla, data=loans, family=binomial)
glm.fit: fitted probabilities numerically 0 or 1 occurred
R returns a warning: glm.fit: fitted probabilities numerically 0 or 1 occurred. This has to do with being able to perfectly separate the response based on the value of the predictors. In fact, if we try to generate a model using just the predictor StatedMonthlyIncome, we get the same warning.
temp_model <- summary(glm(LoanStatus ~ StatedMonthlyIncome, data=loans, family=binomial))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Among the possible solutions to this issue, binning the predictor values seems like the easiest thing to do.
Let’s start with StatedMonthlyIncome. We can try to split StatedMonthlyIncome into $1,000 bins using cut()
.
Monthly_Income_Ranges <- c("$0 - $1,999", "$2,000 - $2,999", "$3,000 - $3,999", "$4,000 - $4,999", "$5,000 - $5,999", "$6,000 - $6,999", "$7,000 - $7,999", "$8,000 - $8,999", "$9,000 - $9,999", "$10,000 - $10,999", "$11,000 - $11,999", "$12,000 - $12,999", "$13,000 - $13,999", "$14,000 - $14,999", "$15,000 or more")
loans$StatedMonthlyIncome <- cut(loans$StatedMonthlyIncome, c(-Inf, seq(2000,15000,1000), Inf), labels=Monthly_Income_Ranges, right=F)
We can bin other variables as well. I will skip the actual code since it’s very similar to what we just did for StatedMotnhlyIncome. The following variables will be binned: InquiriesLast6Months, TotalInquiries, CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years, PublicRecordsLast10Years, and PublicRecordsLast12Months.
We can try again:
logReg_model <- glm(logReg_model_fmla, data=loans, family=binomial)
We no longer get the glm.fit warning.
Using summary(logReg_model)
, we note that some of the predictors were found to be not statistically significant in determining the probability of a loan being repaid. These were PublicRecordsLast12Months, AmountDelinquent, TotalCreditLinespast7years, and DelinquenciesLast7Years. This might have been because they are related to other predictors in the model and thus add little of value to it. Let’s remove them from our model formula:
remove_predictors <- c ("PublicRecordsLast12Months", "AmountDelinquent", "TotalCreditLinespast7years", "DelinquenciesLast7Years")
predictors_vector <- setdiff(predictors_vector, remove_predictors)
predictors_vector_sum <- paste(predictors_vector, collapse="+")
logReg_model_fmla <- as.formula(paste("LoanStatus ~ ", predictors_vector_sum))
Let’s split the data into training and test sets, using the library caTools()
library(caTools)
set.seed(144)
spl = sample.split(loans$LoanStatus, 0.7)
train_set = subset(loans, spl == TRUE)
test_set = subset(loans, spl == FALSE)
Training the model on the split dataset, we can tabulate the confusion matrix by evaluating the model performance on the test set, using a threshold for labelling the loans of 0.5:
logReg_model <- glm(logReg_model_fmla, data=train_set, family=binomial)
# Evaluate the performance on test dataset
# test_set$predicted.risk is a vector of probabilities
# According to our model, the higher the probability the more likely the loan will be good
test_set$predicted.risk = predict(logReg_model, newdata=test_set, type="response")
#Confusion Matrix using a threshold of 0.5
threshold = 0.5
confusion_matrix <- table(test_set$LoanStatus, as.numeric(test_set$predicted.risk >= threshold))
confusion_matrix
0 1
0 1036 4689
1 692 27765
The rows in the confusion matrix represent the “truth”. We have 1036 + 4689 = 5725 total bad loans, of which 4689 were labelled, incorrectly, as good loans. That’s over 80% of the bad loans that were labelled as good. This is our False Positive Rat, or FPR. Similarly, there are 692 + 27765 = 28457 total good loans, of which 27765 were labelled, correctly, as good loans. Over 97% of the good loans were labelled correctly. This is our True Positive Rate, or TPR. We will see shortly how the choice of threshold allows us to trade FPR with TPR.
Using the confusion matrix, we can compute the accuracy of our model on the test set as the ratio of loans that were labelled correctly divided by the total number of loans. Here we have 1036 + 27765 = 28801 loans that were labelled correctly, out of a total of 34182 loans in the test set, so the accuracy is 0.842578.
The ROC (Receiver Operator Characteristic) curve plots the True Positive Rate (TPR) versus the False Positive Rate (FPR). The TPR is the fraction of good loans in the test set that were labelled correctly (1), whereas the FPR is the fraction of bad loans that were labelled, incorrectly, as good. It allows us to visualize how the choice of threshold allows us to trade off TPR with FPR.
library(ROCR)
ROCRpred = prediction(test_set$predicted.risk, test_set$LoanStatus)
# Performance function
ROCRperf = performance(ROCRpred, "tpr", "fpr")
# Plot ROC curve
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
grid()
We can compute the area under the ROC curve, or AUC:
#AUC
AUC <- as.numeric(performance(ROCRpred, "auc")@y.values)
AUC
[1] 0.7698326
We want the AUC to be significantly larger than 0.5.
We can see that using a threshold of 0.5 to label the loans would just about “catch” all the good loans since the TPR is almost 1.0, but it also means we have to eat more than 80% of the bad loans. A choice of threshold of 0.8 or even 0.7 would lead to catching 80% or more of the good loans while discarding between 40% and 60% of the bad ones Let’s try using a threshold of 0.8:
threshold2 = 0.8
confusion_matrix2 <- table(test_set$LoanStatus, as.numeric(test_set$predicted.risk >= threshold2))
confusion_matrix2
0 1
0 3385 2340
1 5960 22497
We can see that choosing a threshold of 0.8 means we would pass on, i.e., not invest in, 5960 good loans. Our TPR is lower: 79%, versus more than 97% with a threshold of 0.5. However, the FPR is much better now, as we are only mislabelling ~40% of the bad loans, versus more than 80% with the 0.5 threshold. Although our accuracy has dropped from 84% to about 75%, I think using a threshold of 0.8 or 0.7 is preferable to using 0.5.
We can use the caret
package to perform k-fold cross-validation. The data is split in k folds and the model is generated using k-1 folds of the data as a training set, and tested on the remaining fold. This is done k times, once for each fold. caret
computes and reports the accuracy of the model each time. We can compare the accuracies with the 70-30 split of the data we did earlier.
For logistic regression, caret
uses a hardwired threshold of 0.5 to compute the accuracy. Although we cannot change it, we can cross-check the accuracy numbers it reports with what we got earlier using a threshold of 0.5.
Experience has shown that 10-fold cross-validation gives the best estimate of error, so we will use k=10 folds.
library(caret)
fitControl = trainControl( method = "cv", number = 10, p = 0.70 )
logReg_model_cv <- train(logReg_model_fmla, data=train_set, method = "bayesglm", trControl = fitControl)
logReg_model_cv$resample$Accuracy
[1] 0.8450351 0.8427586 0.8420063 0.8454112 0.8429897 0.8440321 0.8416499 0.8410231 0.8437813
[10] 0.8466458
We see that the model accuracy is about 84% each time, which matches what we did earlier using a 0.5 threshold.
R Notebooks. Retrieved from http://rmarkdown.rstudio.com
Prosper loan data. Retrieved from https://docs.google.com
Prosper loan data - variable definitions. Retrieved from https://docs.google.com
Lending Club State and Financial Suitability conditions. Retrieved from https://help.lendingclub.com
Prosper State Financial Suitability Requirements. Retrieved from https://www.prosper.com
Chang, Winston. Cookbook for R. Sebastopol: O’Reilly Media, 2013. Retrieved from http://www.cookbook-r.com/
MaikelS. Set NA to 0 in R. Retrieved from http://stackoverflow.com.
mike and smu. Test if characters in string in R. Retrieved from http://stackoverflow.com.
Shreyes. Editing/Adding factor levels in R. Retrieved from http://programming-r-pro-bro.blogspot.com
Harding, Ted. glm.fit: fitted probabilities numerically 0 or 1 occurr. Retrieved from https://stat.ethz.ch
user333 and Scortchi. How to deal with perfect separation in logistic regression? Retrieved from http://stats.stackexchange.com
grautur and juba. how to succinctly write a formula with many variables from a data frame? Retrieved from http://stackoverflow.com.
Bertsimas, D., O’Hair, A. The Analytics Edge. Spring 2014. edX.org.
jeannot and Luciano Selzer. How to change the default font size in ggplot2. Retrieved from http://stackoverflow.com.
Dail and Brian Diggs. How to delete multiple values from a vector?. Retrieved from http://stackoverflow.com.
Witten, I., Frank, E., Hall, M. Data Mining: Practical Machine Learning Tools and Techniques, Third Edition. Burlington: Elsevier, 2011. PDF.
Çetinkaya-Rundel, M. Data Analysis and Statistical Inference. Spring 2014. www.coursera.org.