This repository has been archived by the owner on Aug 16, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4
/
07-1-conclusion.Rmd
41 lines (20 loc) · 2.29 KB
/
07-1-conclusion.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Conclusion
This report was an exploration of the LendingClub dataset. The result should be seen as unsatisfactory: the final model clearly extracts relevant information that classifies loan applications in a
way that broadly mirrors the ratings proposed by LendingClub. It is however too imprecise to draw any comfort that it could be used in any way.
One question, which is at the end of the day the only relevant question, has not been answered: Is the interest rate proposed by LC high enough?
This report has additional clear limitations:
+ Facing such a large dataset, it took a very long time before settling on a tractable question. Exploring the data lead to many blind alleys with little interest. See next section for further explanations.^[But there is no Goldielocks dataset: they are always to small or too big.]
+ As a practical tool, our approach would not work in real life: we have no data on rejected loans. Using this model to accept/reject loans would need more work (for example using Reject Inference).
+ We only considered probabilities of default.
Further possible avenues to explore are numerous:
+ Address the unbalanced dataset by sampling a number of training samples whose size is identical to the test dataset.
+ Use PCA (albeit on an extract of the dataset) to narrow the variables.
+ Model the time-to-default to also provided guidance on the maximum term of the accepted loans. Maybe some sort of multivariate Poisson process (if such distributions exist).
+ Extend to LGD with Good Dollars and Bad Dollars instead of Good Loans / Bad Loans. This approach is suggested in a report by SAS ([@Miner_sasglobal]).
+ Improve the regularisation of the model parameters.
+ Explore economic cycles/situations as additional entry.
+ We use the entire dataset to estimate the impact of time as a polynomial curve. To assess future loans, this would not be acceptable. Only an online algorithm should be used. For example
ARMA/ARIMA, Kalman filtering of the time-trend trajectory?
+ The size of the dataset is an issue to apply other techinques. But with stochastics methods, possible other models could be:
- Tree models which have numerous variations (CART generally, and simple or aggregated boosted decision trees specifically)
- Neural network