Introduction

Peer-to-peer(P2P) lending (and borrowing) is a new method of financing that has become immensely popular with in the small amount of time it has been around for. Prosper is the first P2P company in the US and now manages more than $7 billion dollars in loan amounts and is one of the leading P2P lending companies.

I’ve used a subset of Prosper’s loan data to analyze different variables, and provide my observations and reflection.

Univariate Analysis

Structure of the dataset

There are 113,937 loans in the dataset with 81 features (LoanOriginalAmount, LoanOriginationDates, BorrowerRate, LenderYield, etc., being some of them).

## [1] 113937     81
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

By looking at the structure of the data, I observed that the total number of records in the data set and the number of listing keys don’t match which could mean there are duplicates in the data set.

dup_pl<- pl[duplicated(pl$ListingKey) | duplicated(pl$ListingKey, fromLast=TRUE),]

length(dup_pl)
## [1] 81

The length of the duplicates’ data frame seems to make sense as it adds up to the total number of records in the original data set. I still want to take a closer look at the duplicates.

What I found interesting is that the data of these duplicate records are the same except for the ProsperScore field, which are different.

group_by(dup_pl, ListingKey) %>%
  summarise(.,duplicates = n()) %>%
  ggplot(aes(duplicates)) +
    geom_bar() +
    #scale_y_sqrt() +
    geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
    theme(text = element_text(size=8.5))

It looks like the ListingKey was not only repeated twice but upto 6 times for a particular loan. 3 times for about 32 of them and 4 times for 4 loans.

Let’s look at the prosper scores or analyze how they could be assigned a little later.

Univariate Plots

#Creating a function to plot some basic invariate plots
basicuniplots <- function(df, v, xlab, ylab = "Number of Listings") {
  
  ggplot(aes(v), data = df) + 
    geom_bar(stat = 'Count') +
    labs (x = xlab, y = ylab) +
    theme(text = element_text(size=8.5))
  
}  

Since I’ll end up plotting a lot of bar plots in this section, I created a function to plot them.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The minimum loan amount taken out is $1000 and the maximum is $35,000. Although the loan range is wide, most of the loans are below $10,000 with the mean and median sitting at $8337 and $6500 respectively.

The data set contains loan details right from the year 2006 and the company seems to have had a decent growth till late 2008 when the loans given out stalled for close to an year and resumed mid to late 2009. I wonder if the loan original amount will have any insights for the gaps in time.

I have created a new variable here called CumulativeLoanAmount that cumulatively adds the LoanOriginalAmount based on the LoanOriginationDate. The total loan amount given out through Prosper is about $950 Million. It is interesting to see that the company has had a nice growth right from the beginning with a linear growth till late 2008 and an exponential growth since 2010. It is also interesting to see a flat patch during 2009 and is in sync with the previous plot.

Some online research around this period threw some light on the fact that Prosper Marketplace had changed its business model during that period from a variable rate model, with rates set by lenders and borrowers to a pre-set rate model with Propser using its own formula in determining a borrower’s rate based on their creditworthiness.

The very next detail I wanted to look at is how much Prosper is getting back on thse loans. The plot above is a cumulative customer payments plot and seems to be in sync with the total loan amounts.

##           A    AA     B     C     D     E    HR 
## 29084 14551  5372 15581 18345 14274  9795  6935

There are quite a few loans with no prosper rating and I think these could be from the time when Prosper was on their old business model.

ggplot(data = pl[pl$ProsperRating..Alpha. == "",], aes(LoanOriginationDate)) + 
  geom_histogram(stat = "bin", binwidth = 20) +
  labs (x = "Loan Origination Date",y = "Number of Listings") +
  theme(text = element_text(size=8.5))

Above is a plot to confirm the same. I wonder if they still had the credit scores on file for these borrowers and it looks like they have most of their credit scores recorded and just did not calculate their Prosper Rating.

unique(pl[pl$ProsperRating..Alpha. == "","CreditScoreRangeLower"])
##  [1] 640 480 760 620 680 520 660 600 580 540 720 820 560 740 700 780 500
## [18] 840  NA 460 800   0 860 440 420 880 360
##    12    36    60 
##  1614 87778 24545

While Prosper offers loans at 3 different loan terms, the most opted for loan term is the 36-month loan term. Although this feature is a discrete feature in the current business model, the data set stores it as a continuous feature. To make the plot look better and for future use in other plots, this feature is converted to a factor with 3 levels.

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

It looks like most of the loans are in good standing as a huge portion of the loans are either current or completed. I have created another feature that simply notes if a loan is in good standing or not based on its loan status (assuming current and completed loans to be good).

Let us look at some borrower details

It is very clear that California, Texas and New York have the most number of listings created for loans.

This plot is very useful for both borrowers and investors alike. It is important to note that borrowers with any kind of job have a higher probability of getting a loan with Prosper.

## False  True 
## 56459 57478

This doesn’t really seem to be a major factor in getting a loan as both home owners and renters have almost equal number of loans. Although, it would be interesting to see if this variable plays any role in deciding the interest rate/APR.

## 
##               Auto    Baby & Adoption               Boat 
##               2572                199                 85 
##           Business Cosmetic Procedure Debt Consolidation 
##               7189                 91              58308 
##    Engagement Ring        Green Loans   Home Improvement 
##                217                 59               7433 
## Household Expenses    Large Purchases     Medical/Dental 
##               1996                876               1522 
##         Motorcycle      Not Available              Other 
##                304              16965              10494 
##      Personal Loan                 RV        Student Use 
##               2395                 52                756 
##              Taxes           Vacation      Wedding Loans 
##                885                768                771

I added a new variable for the ListingCategory descriptions as the data set only has their numeric counterparts. I found the mapping available in the Variable Description document for Prosper.

Debt consolidation seems to be the #1 reason for loans on prosper.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.653  15.630  20.980  21.880  28.380  51.230      25

The data set contains both, borrower APR and interest rate and since APR includes additional fees on top of the preset interest rate, I wanted to look at the Borrower APR instead of interest rate as this is what the borrower would end up paying.

The plot seems to sit between an uniform curve and a bimodal curve with most of the loans having an APR between 12% and 25%. It would be intersting to investigate more on the two big peaks at 30% and 36%. The peak could be due to the less creditworthy people at those rates.

Borrowers with a decent income have the most number of loans. No surprises here.

What’s in it for the investor?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00   12.42   17.30   18.27   24.00   49.25

This looks like a great news for investors with mean lender yield at 17% and the max in this data set at 49.25%. This doesn’t necessarily come without any risks.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

On average, a funded loan seems to have about 80-81 investors depending on the loan amount.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.04803  0.00000 39.00000

Very few borrowers seem to have recommendations.

I wonder if borrowers had friends who lended them money.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.02346  0.00000 33.00000

Similar results here. Very few borrowers seems have benefitted from a loan by their friends.

Main observations from Univariate Analysis:

  • There are only about 113066 listings in the data set.
  • For some, Prosper has multiple prosper scores for the same listings.
  • Number of loans has been constantly growing, linearly till mid 2009 followed by a sshort stall and exponentially since then.
  • Loan amounts vary from $1000 all the way up to #35,000.
  • There are 3 loan terms avaialble, 12, 36 and 60 month terms with 36-month being the most popular.
  • Till mid-2009 Prosper wasn’t calculating borrower’s credit score or rating, although it noted their credit scores for most of them.
  • Most of the loans are in good standing.
  • Most number of loans were taken out for debt consolidation.
  • California, Texas and New York have the most number of listings
  • Most of the borrowers have a job with a decent income

Bivariate Analysis

For ease of use, I created a variable called MeanCreditScore which gets the mean between the upper and lower ranges of the borrower’s credit score.

As it was observed earlier that for the years prior to 2010, Prosper was not calculating borrower’s prosper rating, we have quite a few missing ratings. Lets consider the current business model and filter out the records with missing prosper rating.

This makes perfect sense. Higher the credit score, better the rating except for HR rating.

Lets add Prosper Score to this plot and analyze a little later.

On the whole, it could be said that the borrower APR increases as the borrower’s credit score decreases, but there are a lot of outliers in the plot. While the credit score seems to be an important aspect in deciding a borrower’s APR, it is definitely not the only factor involved as it could also be gathered from the rest of this analysis.

## pl$ProsperRating..Alpha.: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.990   6.790   6.911   7.450  20.000 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.98    8.90   10.19   10.29   11.39   20.50 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.93   13.14   14.09   14.44   15.39   34.00 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.95   16.65   18.14   18.44   19.99   34.00 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.57   21.87   23.92   23.64   25.25   34.00 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.79   26.12   28.25   28.33   30.49   34.00 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.79   30.34   30.77   30.73   30.77   34.00

The relationship between lender yield and prosper rating looks as expected. Better the prosper rating, lower the yield and vice versa. High lender yield might also mean higher risk. More about this later.

There seems to be no direct relation between debt to income ratio and borrower rate.

Lets take a closer look at the plot.

To look at the bulk of the points, I limited debt to income ration to less than or equal to 1. There are numerous points spread out and there doesn’t seem to be a correlation between them. A linear model with formula y~x seems to linearly increase with debt albeit with a minimal slope and a lot of points that are not anywhere close to the line.

## pl$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.090   0.160   0.297   0.260  10.010     124 
## -------------------------------------------------------- 
## pl$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.010   0.160   0.295   3.328  10.010  10.010     728 
## -------------------------------------------------------- 
## pl$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     621 
## -------------------------------------------------------- 
## pl$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.020   0.190   0.320   0.737   0.500  10.010     913 
## -------------------------------------------------------- 
## pl$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1700  0.2600  0.2789  0.3600  7.9000    2311 
## -------------------------------------------------------- 
## pl$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1600  0.2300  0.2457  0.3200 10.0100    1690 
## -------------------------------------------------------- 
## pl$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1400  0.2000  0.2137  0.2800  2.5500     901 
## -------------------------------------------------------- 
## pl$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1200  0.1700  0.1806  0.2300 10.0100    1266

The income ranges with the least debt to income ratio is $100,000+ and $75,000-99,999. Although $100,000+ has a lower mean and median debt to income ratio, there seems to be very few outliers at 10.01.

## pl$ProsperRating..Alpha.: AA
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05249  0.00000 10.00000 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1539  0.0000 21.0000 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2337  0.0000 51.0000 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2875  0.0000 21.0000 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4203  0.0000 21.0000 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5987  1.0000 32.0000 
## -------------------------------------------------------- 
## pl$ProsperRating..Alpha.: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5857  1.0000 22.0000

As expected, there seems to be a pattern here. Borrowers with more delinquencies have poorer credit rating assigned by Prosper. And as the saying goes, “Correlation doesn’t mean causation”.

I’ve created a variable called main loan status that classifies whether a loan is performing well or not based on the variable Loan Status in the dataset. It looks like more and more loans recently are current or completed or not late in payments yet.

## pl$MainLoanStatus: Paid/Current
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   6.000   6.017   8.000  11.000   18410 
## -------------------------------------------------------- 
## pl$MainLoanStatus: Unpaid/Delayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   5.000   5.338   7.000  11.000   10674

Prosper Score seems to be a decent measure and predictor of a loan turning good or bad. Borrowers with better prosper scores have loans in good status and vice versa.

Main observations from Bivariate Analysis:

  • Borrower’s credit scores plays a major role in deciding prosper’s ratings/score.
  • Borrower APR, Lender yield are closely dependant on prosper’s credit rating.
  • Borrower APR spikes at around 36% because of the borrowers with poor prosper credit ratings.
  • Compared to the initial stages of Prosper, bad loans have gone down in the recent times.
  • Debt to income ratio is the highest for unemployed borrowers and least for borrowers with income above $75,000.
  • Prosper score is a decent measure of good and bad loans.

Multivariate Analysis

The correlation matrix revealed a few surprising details. There doesn’t seem to be a strong relationship between Borrower APR or Lender Yield and other variables including mean credit score. Interest rate of a borrower seems to be decided by Prosper based on a lot of borrower variables.

The estimated return seems to be more dependant on the prosper rating than the original loan amount. Although, it is also clear that only borrowers with good credit rating get higher loan amounts.

I used a subset of dataset that has ‘completed’ loans for this plot and as I hoped to see, the payments curve is well above the loan amount curve. But, this might not mean that all the investors have made a profit on these loans. I wanted to look at those loans and to my surprise, I found quite a few investors that have not received any payments from customers on completed loans. Or, there could be bad or missing data in the data set.

##   LoanOriginationDate LoanOriginalAmount LP_CustomerPayments
## 1          2009-05-14               2000                   0
## 2          2009-05-07               1500                   0
## 3          2009-05-07               3000                   0
## 4          2009-05-13               1000                   0
## 5          2009-05-13               1000                   0
## 6          2009-05-12               1000                   0

Here are a few listings that made 0 payments towards the loan and are still considered ‘Completed’. One thing they have in common is that they all seemt to be in 2009. Either the borrowers had no intention of paying the money back or this is missing data as this is from Prosper’s old business model.

I wanted to see if the original loan amount value made a difference to the estimated return. I created a variable called Returns to visually represent the estimated return below 0. Estimated return has a very poor correlation with the original loan amount.

By looking at the two plots above, it is clear that the estimated return is not completely in sync with the actual returns lenders earn.

So, I wanted to plot both estimated and actual returns against each other.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$EstimatedReturn and pl$ActualReturn
## t = 114.72, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3605978 0.3722479
## sample estimates:
##       cor 
## 0.3664372

The way actual returns are calculated, it only makes sense to look at the loans that have been completed. So, in this plot, I’m only looking at the loans that have been completed and I still see some bad returns. I’m guessing these are from the old business model (pre-2010).

In the Paid/Current facet of the plot, Prosper was never right in estimating the bad loans. I’m looking at the negative x and y axis for this observation.

Looking at the plot above, it is clear that the actual return is not even close to the estimated return. Prosper clearly is a little more conservative with their estimated return. This is definitely a good news for investors.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$CurrentDelinquencies and pl$BorrowerAPR
## t = 50.846, df = 113240, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1437046 0.1550933
## sample estimates:
##       cor 
## 0.1494039

Since current delinquencies has a good linear relation with Prosper Rating and Prosper Rating has a good relation with borrower APR, I was expecting to see a stronger linear correlation between current delinquencies and borrower AP but, that doesn’t seem to be the case with a mere 0.15 correlation.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$AmountDelinquent and pl$BorrowerAPR
## t = 21.461, df = 106310, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05969169 0.07166197
## sample estimates:
##        cor 
## 0.06567919

Borrower’s amount delinquent seems to have a very poor correlation with APR. Current delinquency matters but not the amount apprently in deciding a borrower’s APR.

Main observations from Multivariate Analysis:

  • Original loan amount almost has not effect on how the loan would turn out.
  • For some, Prosper has multiple prosper scores for the same listings.
  • There seem to be a few listings with 0 customer payments.
  • The returns at least on the completed loans have been growing along with the loan amount.
  • Estimated return is not a very accurate estimate. It is a more generic estimate.
  • Debt to Income ratio is higher for unemployed and lower income borrowers.
  • A combination of borrower’s prosper rating, mean score, income range and debt to income ratio have a strong effect on the returns, rate of returns and risk of loss.

Final Plots and Summary

Plot 1

Description 1

The plot shows the total amount of loans issued distributed geographically throughout US. Since this is an interactive map, I have also added the number of listings, the average debt to income ratio, monthly income and borrower APR for each state to the hover window.

Plot 2

Description 2

This plot shows the relationship between the borrower’s mean credit score and APR for different loan terms. I have also added a linear model for each term with formula, y = x^2. It shows that borrowers with higher credit scores have lower APR and there by interest rate and borrowers with lower credit scores have higher APR.

Plot 3

Description 3

This plot shows the relationship between the borrower’s debt to income ratio, prosper rating and lender’s yield on those listings. Lender’s yield is clearly dependant on both, debt to income ratio and prosper rating. The higher the debt to income ratio, worse the prosper rating and higher the lender yield and vice versa.

Reflection

The Prosper loan data set has 113,937 loan listings and 81 variables from Nov 2005 - March 2014. I was able to analyze around 20 variables over these 9 areas and also created a few additional variables to help in the analysis. The data is rich with information on borrower, credit, debt, investor, loan and all of their geographic data.

One main problem that I faced with a lot of variables is overplotting. In a lot of plots, the amount of data and variation in the data overshadowed a trend that may or may not exist. Some of the aternatives to overcome this situation is to sample the data or cluster the data and find patterns or trends within the clusters. Machine learning might be able to solve or some of these problems. The amount and types of variables in the data set have introduced me to various styles and best practices of plotting certain variables.

There is still a lot of room for analysis and hopefully build models to predict things like:

  1. Accurate estimated return based on the type of borrower than a generic estimated return
  2. Accurate estimated loss
  3. Which borrowers are more likely to default
  4. Will the borrowers be able to pay back with a more conservative payment plan

Although, there were too many variables and more than 100k records available, considering this to be 9 years worth of data for US wide loans, it is a very small data set and having additional data could help strength some patterns, trends or relations that are not strongly visible now.

References

  1. Prosper Marketplace
  2. Prosper Wikipedia
  3. R Markdown Cheatsheet
  4. Choropleth Maps in R - Plotly
  5. ggplot2 Reference
  6. List of U.S. states