Technologies used: python, pandas, sklearn, jupyter.
This project ‘Mortgage Risk Analysis’ consisted thorough analysis of the dataset and prediction of mortgage defaults. One-Dimensional analysis of the dataset conveyed the missing values in the dataset which were replaced by the mean of respective feature, and the prime result of the one dimensional analysis is that the dataset is skew (Biased). Two-Dimensional analysis showed the correlation between various features in the dataset. It is found that the feature interest rate is related with default value (i.e. more the interest rate the more are the chances that borrower being defaulter). Macroeconomic factors such as GDP rate, HPI do not significantly affect the mortgage risk. Small factors such as FICO score, LTV ratio, maturity time has considerable effect on the mortgage risk. This verifies that dataset consist records of common residential borrowers, since commercial mortgages are not significantly affected by macroeconomic factors.
The results found by the analysis of the dataset are used for the pre-processing. In pre-processing, insignificant features shown by the analysis are removed using technique such as backward elimination and forward selection. The features which are highly correlated are merged using technique known as PCA. The skewness of dataset is the most crucial property of this dataset. This skewness is removed by using concept of upsampling which helps to create dummy records of labels which are less dominant.
The ultimate part is training a model which will be most optimal in prediction of mortgage defaults. The models are trained using classifiers and it is found that the LGBMClassifier is the most optimal classifier of all the other classifiers used to train the model along with the data. The accuracy of this LGBMClassfier 81.77%.
Note: To avail the dataset vist site.