Credit Scoring in The Age of Big Data A

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/340551749
Credit scoring in the age of Big Data -A State-of-the-Art
Article in International Journal of Computer Science and Information Security, · July 2017
CITATIONS READS
5 984
3 authors:
Youssef Tounsi Larbi Hassouni

Université Hassan II de Casablanca University of Hassan II of Casablanca
5 PUBLICATIONS 19 CITATIONS 44 PUBLICATIONS 208 CITATIONS
SEE PROFILE SEE PROFILE
Houda Anoun
Instituto Politécnico de Lisboa
32 PUBLICATIONS 105 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Storage optimisation in Big Data context View project
Smart eLearning View project
All content following this page was uploaded by Youssef Tounsi on 10 April 2020.
The user has requested enhancement of the downloaded file.

International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
Credit scoring in the age of Big Data – A State-of-the-Art

Youssef TOUNSI, Larbi HASSOUNI, Houda ANOUN
RITM Laboratory, CED Engineering Sciences
Ecole Superieure de Technologie
Hassan II University of Casablanca, Morocco
tounsi@gmail.com, lhassouni@hotmail.com, houda.anoun@gmail.com
Abstract - The banking sector has become very competitive, especially for young customers, given this category tend to be
and increasingly sensitive to political and economic circumstances riskier and they do not have sufficient credit history [3].
in each country and around the world. In addition to the Therefore, other sources of data as “Big Data” is significantly
traditional strategy which aims to reduce expenses and increase necessary to bring value to the consumers' scoring systems
profits, many banks are looking for new methods to reduce credit
risks, in order to improve their performance. In this sense, to
performance.
resolve the most common problem, which is the lack of data, One of the most important lessons learned from the recent
several researches are interested in using new data sources such as global financial crises is that information technology and data
social networks which are used by all kind of users and more architectures of financial institutions are inadequate to support
particularly by the young population. These new sources store the overall management of financial risks. In this context, the
data in large quantities and in a non-traditional format, hence the Basel Committee on Banking Supervision published in 2013 a
need to look for new methods of processing. Techniques of Big set of principles under the name BCBS 239, the objective is to
Data allow to store any voluminous amount of structured, semi- enable banks to improve their production capacities and
structured and unstructured data also providing many solutions improve the reliability of regulatory reporting. BCBS 239
to mine these data in order to extract relevant information.
This paper has multiple goals. The first, which is the main, is to
mandates that banks adhere to a set of core principles for
examine the role of big data in predicting the creditworthiness of effective Risk Data Aggregation and Risk Reporting (RDARR)
consumers. The second is to explore the main machine learning practices. As a result, systems integrators, big data firms and
methods used in credit scoring. The third is to investigate what business consultants actively help banks prepare their processes
type of data is relevant in determining consumer creditworthiness. and IT infrastructures for compliance. Improving risk
And the last is to determine how various inherent characteristics assessment involves using more information to construct a
of big data – volume, velocity, variety, variability and complexity complete and relevant client profile. Furthermore, the
– are related to the assessment of credit risk. possibilities are explored in overcoming the hurdles
To address these topics, we have conducted a detailed and careful encountered by credit scoring system through the use of Big
study of several researches works in the field. We envisage that
our work can be of great usefulness to academics and professionals
Data.
in the field of finance and especially to microfinance organizations The remainder of this article is organized as follows. We
whose main activity is to grant microcredits to people with limited present some related work in Section II. In section III, we
incomes. These people are numerous in Morocco and often do not present the main classification methods in classical credit
even have bank accounts. scoring. In section IV, we summarize principles and insights of
big data for credit scoring. Then we present credit scoring using
Index Terms - Credit Scoring, Big Data, Machine Learning
Techniques, Social data.
non-traditional data. Finally, in section V, we end with a
conclusion and discuss possible future working directions in
I. INTRODUCTION Section 5.
It was in the banking sector that the credit risk assessment II. RELATED WORK
was first established in the mid-twentieth century, now some There are many researches works made to predict credit
banks use more than one type of score [1] [2]: application risk using wide-ranging computing [4], but we have to mention
(Credit) scoring, propensity score, behavioral scoring, that there is not much academic literature covering big data for
Collection scoring, recovery score, attrition scores, fraud Credit Scoring [5] [6]. However, there is a number of papers
detection, etc. analyzing big data analytics in banking sector in general terms
Usual credit scoring are typically constructed using data [7] [8], and other works studying the social networking data
extracted from traditional transactional systems such as OLTP from the sociological point of view. In fact, a few of available
(online transaction processing), Core Banking, ERP (enterprise studies have mentioned the impact of offline and online
resource planning), CRM (customer relationship management) customer’s information on the credit scoring. In addition, no
applications and credit bureaus. The basic data, about paper presents the state of the art of the credit scoring based on
consumers on which these systems operate, is very massive volume data.
conventional, such as birthdate, gender, income, employment A. Masyutin [6] clarifies that credit history is not rich
status, etc. All this data is precisely stored in relational enough for young clients but the social data using big data can
databases or data warehouses. Nevertheless, this history is not bring value to the scoring systems performance. More, Y. Wei,
rich enough in most countries even in developed regions, and P. Yildirim, C. Van and C. Dellarocas [10], highlight that credit
134 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Vol. 15, No. 7, July 2017
scoring using social network data can reduce lenders’ performance? In addition, what are the non-traditional data that
misgivings about engaging applicants with limited personal are most relevant for improving credit rating? These are the
financial history. Further, G. Guo, F. Zhu, E. Chen, Q. Liu, L. questions we attempt to answer in this paper.
Wu and C. Guan [3], illuminate that quantitative investigation
III. THE MAIN MACHINE LEARNING METHODS USED IN CREDIT
of the extracted features shows that online social media data
SCORING
does have good potential in discriminating good credit users
from bad. Machine learning (ML) is continuously unleashing its
Likewise, Y. Yang, J. Gu and Z. Zhou [11], analyze the power in a wide range of applications. Credit scoring using
opinions about some enterprises transmitted through social predictive analytics techniques, is one of these solicitations. ML
media to predict their future credit risk. Additionally, D. Ntwiga has been pushed to the front line in late years partly due to the
and P. Weke [12], present the limitations of the traditional rise of big data. It can be broadly categorized based on two
consumer lending models due to the use of historical data, and factors (Figure 1):
checking the benefits that could arise by incorporating social • Learning types: This is to do with what type of
media data in credit scoring process for consumer lending. In response variable (supervised, unsupervised, semi-
another empirical analysis, Y. Zhanga, H. Jiaa, Y. Diaoa, M. supervised, and reinforcement learning).
Haia and H. Lia [13], construct a credit scoring model in the • Subjective grouping: This grouping is driven by
case of online Peer-to-Peer (P2P) lending, by fusing social “what” the model is trying to achieve.
media information based on decision tree, their result shows
that their model has good classification accuracy.
Moreover, D. Björkegrena and D. Grissenb [14], establish
a method to predict default among borrowers without formal
financial histories, using behavioral patterns revealed by mobile
phone usage. Additionally, N. Kshetri [4], indicates that, the
main reason why low-income families and micro-enterprises in
emerging economies such as China, lack access to financial
services is not because creditworthiness does not exist but
merely because banks and financial institutions lack data.
Reading-through another case, M. Hurley and J. Adebayo [9] Fig. 1 Machine Learning Categories.
explore the problems posed by big data credit scoring tools and
analyze the gaps in existing laws of the United States of The supervised learning algorithms are a subcategory of
America. the family of machine learning algorithms, which are mainly
The amount of data is exploding at a remarkable rate used in predictive modeling. These algorithms try to model
because of developments in web technologies, social media and relationships between the target prediction output and the input
other activities generated data (mobile device, log file, IoT, features based on those independencies that it learned from the
etc.). The conclusion is that credit risk assessment can greatly historical or previous data sets. The main types of supervised
benefit from using non-traditional information with big data. learning algorithms are:
Nevertheless, traditional approaches are struggling when faced • Regression: the output variable takes continuous
with these massive data [15]. values.
Many studies have tried to address the challenges and • Classification: the output variable takes class labels.
opportunities of big data in general terms. E.Fortuny,
D.Martens and F. Provost [16], provide a clear illustration that The aim of the credit scoring model is to perform a
larger data indeed can be more valuable assets for predictive classification: To distinguish the “good” applicants from the
analytics. This implies that institutions with larger data assets— “bad” ones. In practice this means the statistical models is
plus the skill to take advantage of them—potentially can obtain required to find the separating line distinguishing the two
substantial competitive advantage over institutions without categories, in the space of the explanatory variables (age,
such access or skill. Furthermore, L. Wang, C. Alexander [17], salary, education, etc.). For the banking industry, the scoring
establish a comparison of several machine learning algorithms methodology has played an important role in developing
and an evaluation of big data technologies. Additionally, A. internal rating systems. There are several reasons that can
L’Heureux, K. Grolinger, H. ElYamany and M. Capretz [18], explain the widespread use of scoring models: First, since credit
address machine learning challenges in the era of big data with scoring model are established upon statistical models and not
the ultimate objective of helping practitioners select appropriate on opinions, it offers an objective way to measure and manage
solutions for their use cases. Limited existing studies have risk. Second, the statistical model used to produce credit scores
mentioned the effect of using big data for credit scoring. Our can be validated. Data have to be carefully used to check the
contribution consists in reviewing these hurdles and their accuracy and performance of the predictions. Third, the
solutions in the case of credit scoring (Section 4). statistical models used by credit scores can be improved over
In view of these elements, which credit scoring models will time as additional data are collected.
benefit most from the advantages of the dig data in terms of The scoring system are made up of three major parts:
ISSN 1947-5500
Vol. 15, No. 7, July 2017
• Problem Definition: This initial phase project focuses Type I error = FP/ TN+FP (2)
on understanding the project objectives and
requirements. Type II error = FN/ TP+FN (3)
• Data Gathering and Preparation: The data
understanding involves internal and external data A TP stand for good applicant correctly classified as good, TN
collection and storage in traditional systems stands for bad applicant correctly classified as bad, FN (Type
(Relational databases, data warehouses, etc) or big II) stands for good applicant incorrectly classified as bad
data platform, selection variables, data cleansing and customer and FP (Type I) stands for Bad customer incorrectly
data transformation if required, also data exploration. classified as Good customer (high risk).
• Model Building and Evaluation: Splitting data into There are several statistical methods for building and
training and test sets, selecting algorithm, tuning estimating scoring models, including linear regression models,
algorithm, building Model using training data and logit models, probit models, and neural networks. We introduce
Evaluating the model using test data (Figure 2) them in the remainder of this section.
Linear regression (LR)

Linear regression is a supervised machine learning
technique to identify the linear relationship between target
variables and explanatory variables. A general linear regression
problem can be explained by assuming some dependent or
response variable yi which is influenced by inputs or
independent variables xi1, xi2,....,xiq. A regression model can
express this relation:
yi= β1 xi1 + β2 xi2 + .....+ βq xiq + ε (4)
Where β1, β2,..., βq are fixed regression parameters and ε is a

random error or noise parameter. To get a more accurate
prediction, we need to reduce this error term as soon as possible.
Perhaps the most primitive classification method is formally
introduced by Ronald Fisher in 1936 [19]. Orgler (1970) used
regression analysis in credit scoring for commercial loans. The
Fig. 2 Model building Steps (Logistic Regression e.g.) use of regression analysis extended such applications to include
further aspects (Lucas, 1992; Henley, 1995; Hand & Henley,
There are many evaluation measurements of predictive 1997; Hand & Jacka, 1998). Furthermore, other authors have
performance of models in the area of credit scoring. These been studying linear regression models or its generalizations in
measurements are: ROC Curves, average accuracy, Type I and credit scoring (Hand and Kelly, 2002; Banasik et al., 2003;
Type II errors. Karlis and Rahmouni, 2007; Efromovich, 2010) [20]. Besides
The ROC (Receiver Operating Characteristics) curve was first and in the age of big data, S. Jun, S. Lee and J. Ryu (2015),
applied to assess how well radar equipment in WWII propose a new analytical methodology for big data analysis in
distinguished random interference or “noise” from the signals linear regression problem for reducing the computing burden
that were truly indicative of enemy planes (Swets, et al., 2000). [21].
The ROC curve plots the sensitivity or “hits” (e.g., true Discriminant analysis (DA)
positives) of a model on the vertical axis against 1-specificity Discriminant analysis is a credit scoring technique
or “false alarms” (e.g., false positives) on the horizontal axis. established to discriminate between two groups. It is commonly
The area under the ROC curve is a convenient way to compare agreed that the discriminant method is still one of the most
different predictive binary. generally industrialized techniques to classify customers as
TABLE I good credit or bad credit.
CONFUSION MATRIX FOR CREDIT SCORING Since the crisis of 1929-1933, more and more academics and
Predicted class (%)
practitioners have studied this phenomenon of bankruptcy
Actual class (%) Good loans Bad loans prediction, developing or using DA model (Fisher, 1936; 1986;
Good loans TP FN (Type II error) Altman, 1968; 1977; 2000; Edmister, 1972; Deakin, 1977;
Bad loans FP (Type I error) TN
Conan and Holder, 1979; Beaver, 1996; West, 2000; Anghel,
2002; Gestel et al., 2006; Yang, 2007; Falangis and Glen, 2010;
Dinca and Gidinceanu, 2011; Armeanu et al., 2012; Akkoc,
From the confusion matrix table the following calculations are 2012) [19] [22].
defined: Logistic regression (LG)
Logistic regression proposed by Berkson (1944), LG is a
Average Accuracy (ACC) = TP+TN/TP+FN+TN+FP (1) classification model that, in the specific context of binary
ISSN 1947-5500
Vol. 15, No. 7, July 2017
classification, estimates the posterior probability of the positive

class, as the logistic sigmoid of a linear function of the feature y y
u = w0 + ∑ wi xi
vector [22]. It can be implemented using logistic functions:
To predict the log odds ratios, use the following formula: u
logit(p) = β0 + β1 × x1 + β2 × x2 + . . . + βn × xn (5)
The probability formula is as follows:

x1 xn
logit(p) logit(p)
p=e ⁄1+e (6) Fig. 3 Activation functions in use with neural networks.
logit(p) is a linear function of the explanatory variable, X

∑ ∑
⎛ d ⎞ ⎛ d ⎞
⎜ ⎟ ⎜ ⎟
(x1,x2,x3..xn), which is similar to linear regression. So, the y= f⎜ w j x j + w0 ⎟ ≡ f⎜ wjx j ⎟ (7)
⎜ j=1 ⎟ ⎜ j =0 ⎟
output of this function will be in the range 0 to 1. Based on the ⎝ ⎠ ⎝ ⎠
probability score, we can set its probability range from 0 to 1.
In a majority of the cases, if the score is greater than a threshold More recently, different artificial neural networks have been
(e.g. 0.5), it will be considered as 1, otherwise 0. Also, we can suggested to tackle the credit scoring problem.
say it provides a classification boundary to classify the outcome In some datasets, the neural networks have the highest average
variable. Logistic regression has been largely used in credit correct classification rate when compared with other traditional
scoring applications [23]. The major benefit of this method is techniques, such as discriminant analysis and logistic
that it can generate a simple probabilistic formula for regression, taking into account the fact that results were very
classification. LG is compared with other credit scoring close (Abdou et al., 2008).
techniques (Li and Hand, 2002; Hand, 2005; Lee and Chen, Deep Learning (DL)
2005; Abdou et al., 2008; Yap et al., 2011; Pavlidis et al., 2012) Since 2006, deep structured learning, or more commonly
[19]. called deep learning or hierarchical learning, has emerged as a
Decision trees (DT) new area of machine learning research [26]. Deep learning is a
Decision tree is one of the most broadly used classification generic term for multilayer neural networks. Multilayer neural
and prediction method of machine learning. DT can deal with networks decrease the overall calculation time by performing
both numerical data and categorical data (such as gender), so it calculation on hidden layers. Thus, they were prone to
is very applicable for personal credit rating [13]. A decision tree excessive over training, as an intermediate layer was often used
can be defined as a tree in which each branch node specifies a for approximately every single layer.
choice between a number of alternatives, and each leaf node Deep learning provides training stability, generalization, and
distinguishes a classification or decision. Most algorithms such scalability with big data. Deep Learning is quickly becoming
as ID3, C4.5, and CART for decision tree induction follow a the algorithm of choice for the highest predictive accuracy [27].
greedy, top-down recursive divide-and-conquer approach, Niimi (2015) conduct Credit Card Data Analysis using deep
which starts with a training set of tuples and their associated Learning and confirm that deep learning has the same accuracy
class labels [24]. Kao et al. (2012) proposes a combination of as the Gaussian kernel SVM using German Data set (the most
a Bayesian behavior scoring model and a CART-based credit used public data set in credit scoring) [28].
scoring model. Other possible and particular methods of Support vector machine (SVM)
decision trees are C4.5 decision trees algorithm and J4.8 Support vector machine is described as a technique of
decision trees algorithm. classification that was first introduced by Vladimir N. Vapnik
Artificial Neural networks (ANNs) and Alexey Ya. Chervonenkis in 1963. The key benefits of this
ANNs are inspired by the functionality of the nerve cells in model are based on the nonparametric case. The main idea of
the brain. Just like humans, ANNs can learn to recognize the Support Vector Machines [29] algorithm is that given a set
patterns by repeated exposure to many different examples. They of points which belong to one of the two classes, it is needed an
are non-linear models that can classify based on pattern optimal way to separate the two classes by a hyperplane as seen
recognition capabilities [25]. A neural network is a highly in the below figure. This is done by:
interconnected network with abundant neurons and mutual • maximizing the distance (from closest points) of either
links between them. NN has several layers consisted of neurons class to the separating hyperplane (W=Gap)
having similar characteristics. The neurons in one layer are • minimizing the risk of misclassifying the training
connected with those in contiguous layers. The value of the samples and the unseen test samples.
connection between two neurons in different layers is called
'weight' (Figure 3). 1
min
r
w, (8)
w ,b 2
ISSN 1947-5500
Vol. 15, No. 7, July 2017
r v Yobas et al. (2000) compared the predictive performances of

s.t yi = +1 ⇒ w ⋅ xi + b ≥ +1 four techniques, one of which is GAs, GA faired quite well
r v coming in second place [19]. Genetic programming (GP) is one
yi = −1 ⇒ w ⋅ xi − b ≤ −1 kind of evolutionary algorithms and can be considered as an
(9)
r v extension of genetic algorithms, which employ a complication
s.t yi ( w ⋅ xi + b) ≥ 1, ∀i representation formal to code individuals. In 2002, Koza
proposed that GP can be employed in the credit scoring
problems. GA is usually used to attribute reduction in the credit
scoring problem, GA is frequently used to attribute reduction in
the credit scoring problem. Zhang et al. (2008) used a GA to
reduce the attributes before the classifier [35]. In the age of big
data, we can conclude that GA is very suitable for solving
problems with higher dimensions.
Bayesian networks (BN)
Bayesian networks (BN). A Bayesian classifier (Friedman
et al., 1997) is based on calculating a posterior probability of
each observation belongs to a specific class. The Bayesian
Fig. 4 Support Vector Machines. methodology is built upon the well known Bayes’ Rule, which
is itself derived from the fundamental rule for probability
The identification of the each data point xi is yi, which can take calculus.
a value of +1 or -1 (representing positive or negative P ( a , b ) = P ( a | b ) ∗ P (b ) (10)
respectively).
P(a,b) is the joint probability of both events a and b occurring,
Depending on the way the given points are separated into the
P(a|b) is the conditional probability of event a occurring given
two available classes, the SVMs can be linear SVM or Non-
that event b occurred, and P(b) is the probability of event b
Linear SVM. Lately, support vector machine was used in credit
occurring.
scoring (Chen et al., 2009; Li et al., 2006; Gestel et al., 2006;
Xiao and Fei, 2006; Yang, 2007; Chuang and Lin, 2009; Zhou
et al., 2009; 2010; Feng et al., 2010; Hens and Tiwari, 2012; P(a | b) ∗ P(b)
P(b | a ) = (11)
Ling et al., 2012). P(a)
Fuzzy logic (FUZZY) The Bayesian network is a probabilistic model that represents a
Zadeh (1965) introduced the Fuzzy Logic as a set of random variables and their conditional dependencies via
mathematical system, which there are not just two alternatives a direct acyclic graph. The most remarkable characteristic of
but a whole continuum of truth values for logical propositions. BN is the capability to encode both quantitative and qualitative
Unlike the binary logic, fuzzy logic uses the notion of knowledge. This means, we can rely in this model on statistical
membership to handle the imprecise information. Several data analysis and experience of domain experts as well.
fuzzy-based approaches have been suggested to evaluate credit Sometimes referred to as causal networks, Bayesian models are
worthiness. Zimmermann and Zysno (1980) used fuzzy constructed as graph models, where nodes are used to encode
operators to aggregate evaluation results from a four-level parameter characteristics (descriptive variables), and
hierarchy of criteria. Using a fuzzy connectionist model, directional links between them encode often complex
Romaniuk and Hall established an expert system, FUZZNET correlations, usually of causal nature. Several academics have
system, to acquire a knowledge base for classifying the credit studied Bayesian Credit Scoring Models (Whittaker, 1990;
worthiness of loan applicants [30]. Furthermore, Hoffman et al. Sewart and Whittaker, 1998; Hand et al., 1997; Chang et al.,
(2002) proposed a genetic fuzzy for credit scoring and 2000; Baesens et al., 2001; Gemela, 2001; Zhu et al., 2002;
compared it with neuro-fuzzy algorithm NefClass [31]. Other Abramowicz et al., 2003; Thomas et al., 2005; Antonakis and
authors have been studying Fuzzy logic models or its Sfakianakis, 2009; Bier et al., 2010; Wu, 2011; Hand and
generalizations in credit scoring (Lahsasna et al., 2010; Adams; 2014). Possible methods in Bayesian networks are
Nosratabadi, Nadali and Pourdarab, 2012; Bazmara and naive Bayes, tree augmented naive Bayes and Gaussian naive
Donighi, 2013; Duc and Thien, 2013) [32] [33]. Possible Bayes [19] [36].
methods in fuzzy logic are regularized adaptive network based Hybrid methods
Fuzzy inference systems and fuzzy Adaptive Resonance [19]. Hybrid methods combine different techniques to improve
Genetic Algorithms (GA) the performance capability. There are four different hybrid
GAs try and replicate the natural selection process where methods: Classification+Clustering, Clustering+Classification,
genes are passed from one generation to the next generation. Classification+ Classification and Clustering + Clustering
The theory of GAs was developed in the late 1960s and early (Figure 5):
1970s by John Holland and his associates as a means to study
evolutionary processes in nature (Holland, 1975) [34]. The use
of GAs is now growing rapidly with successful applications in
finance trading, fraud detection and other areas of credit risk. Fig. 5 Hybrid methods.
ISSN 1947-5500
Vol. 15, No. 7, July 2017
to be generated in the world is measured in exabytes and

The first classifier/Cluster can be trained in order to classify zettabytes. As the database grows, applications and architecture
good and bad clients, detect and filter outlier or data reduction. designed to support data need to be re-evaluated quite often.
The second uses the output of the first classifier conducive to Furthermore, the amount of data stored by the financial
perform and provide better results [37]. Several authors have institutions is rapidly increasing and provides the opportunity
been studying Hybrid approaches ( Lee et al., 2002; Huang et for them to conduct predictive analytics and enhance its
al., 2007; Lee and Chen, 2005; Hsieh, 2005; Huysmans et al., businesses. However, data scientists are facing large
2006; Shi , 2009; Chen et al., 2009; Liu et al., 2010; Ping and challenges, handling the massive amount of data efficiently and
Yongheng, 2011; Capotorti and Barbanera, 2012; Vukovic et generating insights with real business value.
al., 2012; Akkoc, 2012; Pavlidis et al., 2012) [1] [19] [36]. Secondly, Data velocity measures the speed of data
Ensemble methods creation, streaming, and aggregation. eCommerce has rapidly
An ensemble method is a set of classifiers that learn a target increased the speed and richness of data used for different
function, and their individual predictions are combined to business transactions (for example, web-site clicks). Data
classify new examples. Ensembles generally improve the velocity management is much more than a bandwidth issue, it
generalization performance of a set of classifiers on a domain is also an ingest issue (extract-transform-load)[43].
[17]. Some of the most frequently used methods are: bagging Then, data variety is the challenge of having disparate data
(Breiman, 1996), boosting (Schapire, 1990), stacking (Wolpert, sets, from different sources in different formats, all in silos –
1992) and random forest (Breiman and Cutler, 2001). Many text, images video, audio, etc. Thereby neglecting the benefits
authors have used these combined approaches in credit scoring a unified view of data brings, from an analytic perspective, it is
problems ( Homann et al., 2007; Hsieh and Hung, 2010; probably the biggest obstacle to effectively using large volumes
Paleologo et al., 2010; Zhang 10 et al., 2010; Finlay, 2011; of data. Incompatible data formats, non-aligned data structures,
Louzada et al., 2011; Xiao et al., 2012; Marques et al., 2012; and inconsistent data semantics represents significant
Wang, Ma, Huang and Xu, 2012; Malekipirbazari and challenges that can lead to analytic sprawl [44].
Aksakalli, 2015) [17] [19] [38] [39] [40] [41]. Nowadays, 15 characteristics are defined by some professionals
and academics (Doug Laney, SAS, Oracle, Oguntimilehin A
IV. UNDERSTANDING BIG DATA AND THEIR and Data Science Central) [42]: Volume, Velocity, Variety,
CHALLENGES FOR CREDIT SCORING Value, Veracity, Validity, Volatility, Visualization, Virality,
Viscosity, Variability, Venue, Vocabulary, Vagueness and
A. What Is Big Data? Complexity.
The methodologies and frameworks behind the Big Data
The volume of data which needs storage every day is concept have become very common in wide number of research
increasing exponentially. It is now possible to acquire these vast and industrial areas. This subsection introduces Hadoop and
amounts of information on low cost platforms such as Hadoop. Spark.
The “big data” phenomenon brings challenge to empower
predictive methods for credit scoring. Indeed, Big Data has to B. Apache Hadoop
deal with large and complex datasets that can be structured,
semi-structured, or unstructured and will typically not typically A Hadoop distribution is made of number of separate
fit into memory to be processed. frameworks that are designed to work together. The
Doug Laney[41] was the first one in talking about 3 V’s in Big frameworks are extensible as well as the Hadoop framework
Data management (Figure 6): platform. Hadoop has evolved to support fast data as well as big
data. Big data was initially about large batch processing of data.
Now banks also need to make lending decisions in real time or
near real time as the data arrives. Fast data involves the
capability to act on the data as it arrives. Hadoop’s flexible
framework architecture supports the processing of data with
different run-time characteristics. The Hadoop framework
platform includes these principals’ modules (Figure 7):
• Hadoop Distributed File System (HDFS™): A
distributed file system that provides high-throughput
access to application data.
• Hadoop YARN: A framework for job scheduling and
cluster resource management.
Fig. 6 Three V’s Big Data. • Hadoop MapReduce: A YARN-based system for
parallel processing of large data sets.
First, the volume at which new data is being generated is • Hadoop Common: The common utilities that support
surprising, for this reason, this “V” is the most associated with the other Hadoop modules.
Big Data. We live in an age when the amount of data we expect
ISSN 1947-5500
Vol. 15, No. 7, July 2017
master node arranges it so that for redundant copies of

input data only one is processed.
• Shuffle Step: Data belonging to one key is
redistributed to one worker, such that each worker
node contains data related to one key only.
• Reduce Step: Each worker node now processes data
belonging to same key in parallel.
Fig. 7 Hadoop framework.
Hadoop Distributed File System (HDFS) is a file system

that provides reliable data storage and access across all the
nodes in a Hadoop cluster (Figure 8). It links together the file
systems on many local nodes to create a single file system. Data
in a Hadoop cluster is broken down into smaller pieces (called
blocks) and distributed throughout various nodes in the cluster.
This way, the map and reduce functions can be executed on
smaller subsets of your larger data sets, and this provides the
scalability that is needed for big data processing. This powerful
feature is made possible through the HDFS of Hadoop.
Fig. 9 MapReduce framework.
YARN is a cluster management technology. It is one of the

key features in second-generation Hadoop. It is the next-
generation MapReduce, which assigns CPU, memory and
storage to applications running on a Hadoop cluster. It enables
application frameworks other than MapReduce to run on
Hadoop, opening up a wealth of possibilities. Part of the core
Hadoop project, YARN is the architectural center of Hadoop
that allows multiple data processing engines such as interactive
SQL, real-time streaming, data science and batch processing to
handle data stored in a single platform.
Fig. 8 Hadoop Distributed File System.
MapReduce is a programming framework of Hadoop C. Apache Spark
suitable for writing applications that process large amounts of
structured and unstructured data in parallel across a cluster of Spark is designed to run on top of Hadoop and it is an
thousands of machines, in a reliable, fault-tolerant manner. alternative to the traditional batch map and reduce model that
MapReduce is the heart of Hadoop. This programming can be used for real time stream data processing and fast queries
paradigm allows for massive scalability across hundreds or that finish work within a seconds. In addition to Map and
thousands of servers in a Hadoop cluster. The idea behind this Reduce functions spark take supports of SQL queries,
programming model is to design map functions (or mappers) streaming data, machine learning and graph data processing for
that are used to generate a set of intermediate key/value pairs, big data analysis. The spark platform is implemented in Scala
after which the reduce functions will merge(reduce can be used and is hence run within the Java Virtual Machine (JVM). In
as a shuffling or combining function ) all of the intermediate addition to a Scala API, interfaces in Java and Python are
values that are associated with the same intermediate key. The available. Spark provides two options for running applications.
key aspect of the MapReduce algorithm is that if every map and Firstly, interpreter in the Scala language distribution allows
reduce is independent of all other ongoing maps and reduces, users to run queries on large data sets through Spark engine.
and then the operations can be run in parallel on different keys Secondly applications can be written as Scala programs called
and lists of data. Although three functions, Map(), Shuffling(), driver programs and can be submitted to the cluster’s master
and Re-duce(), are the basic processes in any MapReduce node after compilation [45].
approach (Figure 9). Apache spark consists of a driver program (SparkContext),
workers also called executors, cluster manager, and the HDFS.
• Map Step: Each worker node applies the Map() Driver program is the main program of Spark. Spark
function to the local data and writes the output to a applications run as independent sets of processes on a cluster,
temporary storage space. The Map() code is run coordinated by the Spark Context object called the driver
exactly once for each key value Mapping, generating program. Whereas each application gets its own processes and
output that is organized by key values Shuffling. A run tasks in multiple threads and must be network addressable
from worker nodes. Once connected, Spark acquires executors
ISSN 1947-5500
Vol. 15, No. 7, July 2017
on nodes in the cluster, which are worker processes that run Aspects Hadoop Spark
computations and store data for your application. Next, it sends the Hadoop cluster to the jobs about 10 to 100 times
maximum. faster than Hadoop
your application code (defined by JAR or Python files passed MapReduce.
to SparkContext) to the executors. Finally, SparkContext sends Latency MapReduce is disk Spark ensures lower
tasks for the executors to run (Figure 10). oriented completely. latency computations by
caching the partial results
across its memory of
distributed workers.
Ease of Writing Hadoop Writing Spark code is
coding MapReduce pipelines is always more compact.
complex and lengthy
process.
Supported Java Java, Python, R, Scala
languages
D. Big data challenges in credit scoring
Fig. 10 Driver program (SparkContext). Supported by the increase of “Big Data” platforms such as
Apache Hadoop and Spark, banks are collecting and analyzing
Apache Spark is based on two key concepts: Resilient ever larger datasets. Thus, the Big Data presents opportunities
Distributed Datasets (RDD) and directed acyclic graph (DAG) and challenges for credit scoring as application of the predictive
execution engine. With regard to datasets, Spark supports two modeling [44]. First the topic with volume, many studies indeed
types of RDDs: parallelized collections are based on Scala continue to see increasing predictive performance from more
collections and Hadoop datasets that are created from the files data as the datasets become massive [44] [45] etc. Another
which are stored on HDFS. benefit of big data to machine learning lies in the fact with more
Spark jobs perform work on Resilient Distributed Datasets samples available for learning, the risk of overfitting becomes
(RDD), an abstraction for a collection of elements that can be smaller. The subject with variety, absolutely, interesting. A fast
operated on in parallel (Figure 11). When Spark is running on a flood of unstructured data, such as social media, image, audio,
Hadoop cluster, RDDs are created from files in the distributed video, in addition to the structured data, is providing novel
file system in any format such as text and sequence files or overtures for credit scoring especially for a population whose
anything supported by a Hadoop forma. data are missing. Velocity implies data are streaming at rates
faster than that can be handled by traditional systems. Veracity
suggests that despite the data being available, the quality of data
is still a major concern. On the other hand, there are several
bottlenecks in designing credit scoring system based on the big
data such as [46]:
• Time complexity: Finishing the computation within
acceptable time on a single computer is very hard.
• Memory restrictions: It is difficult to keep the
completely training data set or most of it in memory
on one computer.
Indeed, SVM for example is extremely powerful and widely
Fig. 11 Resilient Distributed Datasets. accepted classifier in the field risk assessment due to its better
A resilient distributed dataset (RDD) is a read-only collection generalization capability. However, SVM is not suitable for
of objects partitioned across a set of machines that can be rebuilt large scale dataset due to its high computational complexity
if a partition is lost. The elements of an RDD need not exist in [47].
physical storage. RDDs support two kinds of operations: The following table presents a summary of the challenges
transformations and actions [45]. and some proposed solutions for credit scoring with big data
TABLE II [18]:
HADOOP VS SPARK TABLE III
Aspects Hadoop Spark BIG DATA CHALLENGE AND SOME PROPOSED
Difficulty MapReduce is difficult to Spark is easy to program SOLUTIONS
program and needs and does not require any Challenges Examples / explanation Some
abstractions. abstractions proposed
Interactive There is no in-built It has interactive mode. solutions
Mode interactive mode. Volume
Streaming Hadoop MapReduce just Spark can be used to Processing • SVM algorithm has a Resilient
get to process a batch of modify in real time Performance training time complexity of distributed
large stored data. through Spark Streaming. O(m3) and a space datasets
Performance MapReduce does not Spark has been said to complexity of O(m2) (RDDs)
leverage the memory of execute batch processing
ISSN 1947-5500
Vol. 15, No. 7, July 2017
Challenges Examples / explanation Some Challenges Examples / explanation Some

proposed proposed
solutions solutions
• Logistic regression observe relationships and
O(mn2+n3) assess linearity.
• Linear regression Generalization • Generalization error can be
O(mn2+n3) error broken down into two
• Gaussian discriminative components: variance and
analysis O(mn2+n3) bias
m is the number of training • Variance describes the
samples n the number of features consistency of a learner’s
Curse of Many learning algorithms rely on ability to predict random
Modularity the assumption that the data being things
processed can be held entirely in • Bias describes the ability of
memory or in a single file on a a learner to learn the wrong
disk thing
Class Imbalance • This problem is especially Japkowicz • As the volume of data
prominent when some and Stephen increases, the learner may
classes are represented by a showed that become too closely biased
large number of samples decision to the training set and may
and some by very few. trees, neural be unable to generalize
• The performance of a networks, adequately for new data
machine learning algorithm and support Variety
can be negatively affected vector Data locality • It is difficult to keep the MapReduce-
when large datasets contain machine whole training data set or based
data from classes with algorithms most of it in memory on one approaches
various probabilities of are all very computer. encounter
occurrence. sensitive to difficulties
class when
imbalance. working with
Curse of • Difficulties encountered Feature highly
Dimensionality when working in high selection and iterative
dimensional space. feature algorithms.
Specifically, the Engineering Data • Machine learning
dimensionality describes the Heterogeneity approaches were not
number of features or developed to handle
attributes present in the semantically diverse data
dataset. types, file formats, data
• the time complexity of the encoding, data model, and
principal component similar.
analysis is O(mn2+n3) • The business value of data
• logistic regression analytics typically involves
O(mn2+n3), correlating diverse datasets,
Feature selection • feature selection and integration is crucial for
and feature (dimensionality reduction) carrying out machine
Engineering aims to select the most learning over such datasets.
relevant features Velocity
• the selection of the most Data Availability • Incremental learning is a
appropriate features is one relatively old concept, it is
of the most time consuming still an active research area
pre-processing tasks in due to the difficulty of
machine learning adapting some algorithms to
• As the dataset grows, both continuously arriving data
vertically and horizontally, Real-Time • Traditional machine
it becomes more difficult to Processing/Strea learning approaches are not
create new, highly relevant ming designed to handle constant
features. streams of data
Non-Linearity • the correlation coefficient is • The business value of real-
often cited as a good time processing systems lies
indicator of the strength of in their ability to provide
the relationship between instantaneous reaction
two or more variables Concept Drift • Big Data are non-stationary;
• This problem is not new data are arriving
exclusive to Big Data, non- continuously
linearity can be expected to • Machine learning models
be more prominent in large are built using older data
datasets. that no longer reflect the
• In the case of Big Data, the distribution of new data
large number of points often accurately.
creates a large cloud, • The challenges typically lie
making it difficult to in quickly detecting when
ISSN 1947-5500
Vol. 15, No. 7, July 2017
Challenges Examples / explanation Some V. CREDIT SCORING USING NON-TRADITIONAL

proposed DATA
solutions
concept drift is occurring
and effectively handling the The Exponential growth of using the social networks like
model transition during Facebook, YouTube, Instagram, Twitter and Weibo, as we have
these changes. witnessed during the past few years, generally, the total number
Veracity
Data Provenance Reduce and
of people using Social Media continues to rise and therefore
• Data provenance is the
process of tracing and Map online user footprints are accumulating rapidly on the social
recording the origin of data Provenance web. However, compared with traditional financial data,
and their movements (RAMP) diverse social data presents both opportunities and challenges
between locations developed for
MapReduce
for Credit scoring [3].
• The provenance dataset
itself becomes too large, Facebook was the first social network to surpass 1 billion
therefore, while these data registered accounts and currently sits at 1.97 billion monthly
provide excellent context to active users (See Figure 12 according to www.statista.com).
machine learning, the
volume of these metadata
creates its own set of
challenges.
• Not only is this dataset too
large, but the computational
cost of carrying this
overhead becomes
overwhelming.
Data Uncertainty • For example, sentiment data
are being collected through
social media, but although
these data are highly
important because they
contain precious insights
into subjective information,
the data themselves are
imprecise.
• Machine learning
algorithms are not designed
Fig. 12 Social network sites ranked by number of active users
to handle this kind of (in millions/April 2017).
imprecise data, thus With developments in internet connectivity alongside rises
resulting in another set of in smartphone adoption, social media platforms have been able
unique challenges for
machine learning with Big
to enrich the data they collect. This has particular relevance to
Data. the variety and frequency aspects of data value. Geographic
Dirty and Noisy • Noisy data contain various data logging across 4G, for example, supports platforms to
Data types of measurement identify a user’s routine movements in real time. Social data can
errors, outliers, and missing be retrieved from any point where a user has a traceable
values.
• From the machine learning interaction with an accessible technology. Therefore, social
perspective this is different data analysis could enable the banks to establish a credit scoring
from imprecise data; having for individuals with no credit history and to identify relevant
an unclear picture is products for specific individuals.
different from having the
wrong picture Indeed, we are addressing various non-traditional data for credit
• Noisy data one of the three scoring: technical information, public databases, web searches,
main challenges of Big Data location, session tracking, social networks, mobile data,
analysis in addition to browser data, telecom data, e-commerce, financials
multiple sources and
Dependent data challenge
transactions, etc. The table below presents the comparison of
(the samples are dependent the analyzed papers, which addresses the credit scoring using
with relatively weak non-traditional data.
signals).
In conclusion, Banks with bigger data assets may have an

important strategic advantage over their smaller competitors if
they can exceed the above bottlenecks.
ISSN 1947-5500
Vol. 15, No. 7, July 2017
Social
TABLE IV Type loan / network /
NON-TRADITIONAL DATA Autors ML method Region Data
M. Hurley Personal USA Media sites, browser
and J. Credit activity, blogs, retail
Social
Adebayo information, Internet
Type loan / network /
[9] Service Protocol (ISP)
Autors ML method Region Data
address and other data
A. Personal Vkontakte Age, Gender, Marital status,
points obtained from public
Masyutin Credit / / Russia Number of days since last
platforms
[6] Logistic visit, Number of
Regression subscriptions, Number of
days since the first post, Information posted or broadcast on social networks may be
Number of user’s posts with inaccurate or misleading. For example, a position of an unhappy
photos, Number of user’s
posts with video, Number of
customer may be inaccurate and may not be indicative of a
children, Major things in life company's success or its solvency. As a result, not only social
and Major qualities in media data must be predictive of the solvency of the applicant,
people but the reliability of social media data needs to be confirmed.
Y. Wei, P. Personal Tie among customers
Yildirim, Credit
Whether the use of such information in a credit decision is
C. Van appropriate must be established by the bank. Failure to use the
and C. correct information to make a credit decision can lead to
Dellarocas problems of security and reliability which the bank needs to
[10]
take into consideration [3] [8] [9].
G. Guo, F. Personal Weibo / Demographic features
Zhu, E. Credit / China Tweet features
Chen, Q. Decision Network features
Liu, L. Wu Tree, Naive High-Level features VI. CONCLUSIONS AND OPEN PROBLEMS
and C. Bayes,
Guan [3] Logistic
Regression We have presented a survey of credit scoring using non-
and SVM traditional data in area of big data. As soon as credit history
Y. Yang, Enterprise Hexun.co Financial status: Worries, (which usually serves as input data in the classical credit
J. Gu and Credit / m and predictions, explanations, scoring) is not exist or not rich enough for young clients, many
Z. Zhou Logistic and Finance.si etc.
[11] Probit na.com.cn Operation: News, academics and professional find that the social data can
Regression / China technologies, strategies improve the scoring systems performance. Nowadays, Credit
changes, etc. scoring using big data emerge as a way to ensure greater
Executive: Risk attitude, efficiency in underwriting while expanding access to the
experience, etc.
Marketing: Prospect, underbanked and to historically neglected groups. We have to
competitors, upstream and mention that there is not much academic work covering the use
downstream, etc. of non-traditional data in credit scoring with the big data.
Major Issues: Mergers, However, There are some firms and start-ups whose domain is
restructuring, major
investment, etc. online/offline data retrieval, aggregation and customer analytics
D. Ntwiga Personal Kenya Ntwiga (2016) : Age, for credit organizations: Wonga, Kreditech, Big Data Scoring,
and P. Credit / Gender, Trust, Interactions, Lenddo, SOCSCOR, Crediograph, etc[6].
Weke [12] Linear reg, Risk factor, Sociability, Thereafter we presented the summarized principles and
Logistic reg, Relationship strength,
Neural nets Private data and Return on
insights of big data for credit scoring. Machine Learning is at
and Genetic private data the core of data analysis in credit scoring. The accuracy of these
Alg algorithms depends on the size and the importance of the data.
Y. Online Peer- PPDai / membership score, prestige, One of the challenges of “big data credit scoring” is the need
Zhanga, to-Peer China forum currency,
H. Jiaa, Y. (P2P) contribution and group
for scalable Machine Learning algorithms implementation on
Diaoa, M. lending / very large data sets. Executing MapReduce jobs using Hadoop
Haia and Decision or spark and Machine Learning give best results for optimal
H. Lia Tree time efficiency. Finally, the open discussion at the end can help
[13]
researchers to have more general understanding the use of big
D. Personal Caribbean Mobile phone use preceding
Björkegre Credit country loan Weekly : Calls out, data in credit scoring and motivate them to get involved and
na and D. number, Calls out, minutes, contribute. We hope that this survey will pave the way for
Grissenb SMS sent, Data use, Top subsequent research in big data for credit scoring. There is no
[14] Ups, Spend, Balance and
reason to believe that adapting traditional credit scoring to Big
Days of mobile phone data
preceding loan Data could benefit banks substantially.
Features derived from phone
usage : Variation in usage,
Periodicity of usage and
Mobility
ISSN 1947-5500
Vol. 15, No. 7, July 2017
REFERENCES [24] Y. Jiang. Credit Scoring Model Based on the Decision Tree and the
Simulated, Second International Conference on Computer Modeling and
[1] S. Sadatrasoul1, M. Gholamian1, M. Siami1, Z. Hajimohammadi. Credit Simulation (2009).
scoring in banks and financial institutions via data mining techniques, [25] K. Leung, F. Cheong, C. Cheong. Consumer Credit Scoring using an
Journal of AI and Data Mining, Article 12, Volume 1, Issue 2, Summer Artificial Immune System Algorithm, 1-4244-1340-0/07
and Autumn 2013, Page 119-129 (2013). IEEE.Annealing Algorithm, World Congress on Computer Science and
[2] S. Tuffery. Data mining and statistics for decision making, John Wiley & Information Engineering (2008).
Sons, Ltd (2011). [26] Y. Bengio. Learning deep architectures for AI, in Foundations and Trends
[3] G. Guo, F. Zhu, E. Chen, Q. Liu, L. Wu, C. Guan. From Footprint to in Machine Learning, 2(1):1–127 (2009).
Evidence: An Exploratory Study of Mining Social Data for Credit [27] V. Ha, H. Nguyen. Credit scoring with a feature selection approach based
Scoring, ACM Transactions on the Web, Vol. 10, No. 4, Article 22 deep learning. MATEC Web of Conferences 54, 05004, MIMT (2016).
(2016). [28] A. Niimi. Deep Learning for Credit Card Data Analysis, World Congress
[4] N. Kshetri. Big data’s role in expanding access to financial services in on Internet Security (WorldCIS-2015).
China. International Journal of Information Management, 297-308 [29] W. Chen, C. Ma, L. Ma. Mining the customer credit using hybrid support
(2016). vector machine technique. Expert Systems with Applications 36 (4),
[5] L. Devasena. Proficiency comparison ofladtree and reptree classifiers for 7611–7616 (2009).
credit risk forecast, International Journal on Computational Sciences & [30] L. Chen, T. Chiou. A fuzzy credit-rating approach for commercial loans:
Applications (IJCSA) Vol.5, No.1 (2015). a Taiwan case, OMEGA - International Journal of Management Science,
[6] A. Masyutin. (2015) Credit scoring based on social network data. 27,407-419 (1999).
Business Informatics, No. 3 (33), pp. 15-23 [31] U. Farouk, J. Panford, J. Hayfron-Acquah. Fuzzy Logic Approach to
[7] D. Andrea, A. Alessia. An overview of methods for virtual social network Credit Scoring for MicroFinances in Ghana, International Journal of
analysis Computational Social Network Analysis: Trends, Tools and Computer Applications (0975 – 8887) Volume 94 – No.8 (2014).
Research Advances. Springer. P. 3–25 (2009). [32] A. Lahsasna, R. Ainon, T. Wah. Credit Scoring Models Using Soft
[8] N. Sun, J. G. Morris, J. Xu, X. Zhu, M. Xie. iCARE: A framework for big Computing Methods: A Survey, The International Arab Journal of
data-based banking customer analytics, IBM Journal of Research and Information Technology, Vol. 7, No. 2 (2010).
Development (Volume: 58, Issue: 5/6, Sept.-Nov) (2014). [33] S. Mammadlia. Fuzzy logic based loan evaluation system, 12th
[9] M. Hurley and J. Adebayo. Credit scoring in the era of big data, Yale International Conference on Application of Fuzzy Systems and Soft
Journal of Law and Technology: Vol. 18 : Iss. 1 , Article 5 (2016). Computing, ICAFS (2016).
[10] Y. Wei, P. Yildirim, C. Van and C. Dellarocas. Credit Scoring with Social [34] S. Finlay. Are we modelling the right thing? The impact of incorrect
Network Data, Marketing Science Vol. 35, No. 2, pp. 234–258 ISSN problem specification in credit scoring, Expert Systems with Applications
0732-2399 ISSN 1526-548X (2016). Volume 36, Issue 5, Pages 9065–9071(2008).
[11] Y. Yang, J. Gu and Z. Zhou. Credit risk evaluation based on social [35] B. Chen, W. Zeng, Y. Lin. Applications of Artificial Intelligence
media,Procedia Computer Science, Elsevier Volume 55, 2015, Pages 725- Technologies in Credit Scoring: a Survey of Literature, 2014 10th
731 (2015). International Conference on Natural Computation. IEEE 978-1-4799-
[12] D. Ntwiga and P. Weke. Consumer lending using social media data, 5151-2/14 (2014).
International Journal of Scientific Research and Innovative Technology [36] C. Leong. Credit Risk Scoring with Bayesian Network Models, Springer
ISSN: 2313-3759 Vol. 3 No.2 (2016). Science Business Media New York, Comput Econ, Volume 47, Issue 3,
[13] Y. Zhanga, H. Jiaa, Y. Diaoa, M. Haia and H. Lia. Research on Credit pp 423–446 (2015).
Scoring by fusing social media information in Online Peer-to-Peer [37] C. Tsai a, M. Chen. Credit rating by hybrid machine learning techniques,
Lending, Procedia Computer Science 91/168 – 174 (2016). Applied Soft Computing 10 (2010) 374–380(2010).
[14] D. Björkegrena, D. Grissenb. Behavior Revealed in Mobile Phone Usage [38] A. Fensterstock, J. Salters, R. Willging. On the Use of Ensemble Models
Predicts Loan Repayment, Entrepreneurial Finance Lab, Brown for Credit Evaluation, The Credit and Financial Management Review
University (2015). (2013).
[15] T. Asha, U. Shravanthi, N. Nagashree, M. Monika. Building Machine [39] L. Breiman. Random forests. Machine Learning, Statistics Department
Learning Algorithms on Hadoop for Bigdata, International Journal of University of California Berkeley, CA 94720 45, 5–32 (2001).
Engineering and Technology Volume 3 No. 2 (2013). [40] G. Wang, J. Ma, L. Huang, K. Xu. Two credit scoring models based on
[16] E. Fortuny, D. Martens, F. Provost. Predictive Modeling With Big Data: dual strategy ensemble trees, Knowledge-Based Systems 26 (2012) 61–
Is Bigger Really Better?, Big Data Mary Ann Liebert, Inc 68 (2012).
doi:10.1089/big.2013.0037, VOL. 1 NO. 4 (2013). [41] M. Malekipirbazari, V. Aksakalli. Risk assessment in social lending via
[17] L. Wang, C. Alexander. Machine Learning in Big Data, International random forests, Expert Systems with Applications 42 4621–4631 (2015).
Journal of Mathematical, Engineering and Management Sciences, Vol. [42] G. Kapil, A. Agrawal and R. Khan, A Study of Big Data Characteristics,
1,No.2,52-61 (2016). Communication and Electronics Systems (ICCES) International
[18] A. L’Heureux, K. Grolinger, H. ElYamany, M. Capretz. Machine Conference (2016).
Learning with Big Data: Challenges and Approaches. NSERC CRD at [43] G. Bello-Orgaza, J. Jungb, D. Camachoa. Social big data: Recent
Western University (CRD 477530-14), DOI achievements and new challenges, Information Fusion 000(2015)1–15
10.1109/ACCESS.2017.2696365, IEEE (2017). (2015).
[19] F. Louzadaa, A. Araa, G. Fernandes. Classification methods applied to [44] Z. Zhou, N. Chawla, Y. Jin, G. Williams. Big Data Opportunities and
credit scoring: A systematic review and overall comparison, Surveys in Challenges: Discussions from Data Analytics Perspectives, IEEE
Operations Research and Management Science, Elsevier (2016). Computational intelligence magazine (2013).
[20] H. Abdou, H, J. Pointon. Credit scoring, statistical techniques and [45] A. Verma, A. Mansuri, N. Jain. Big Data Management Processing with
evaluation criteria: A review of the literature, Intelligent Systems in Hadoop MapReduce and Spark Technology: A Comparison, Symposium
Accounting, Finance and Management, 18(2-3), 59-88 (2011). on Colossal Data Analysis and Networking (CDAN) (2016).
[21] S. Jun, S. Lee and J. Ryu. A Divided Regression Analysis for Big Data, [46] J. Wang, D. Crawl, S. Purawat, M. guyen, I. Altintas. Big Data
International Journal of Software Engineering and Its Applications Vol. Provenance: Challenges, State of the Art and Opportunities, IEEE
9, No. 5 (2015), pp. 21-32 (2015). International Conference on Big Data (2015).
[22] A. Bahnsen, D. Aouada and B. Ottersten. Example Dependent Cost- [47] A. Priyadarshini, S. Agarwa. A Map Reduce based Support Vector
Sensitive Logistic Regression for Credit Scoring, 13th International Machine for Big Data Classification (2015).
Conference on Machine Learning and Applications, 978-1-4799-7415-
3/14 IEEE (2014).
[23] H. Nguyen. Default Predictors in Credit Scoring: Evidence from France's
Retail Banking Institution, Journal of Credit Risk, Vol. 11, No. 2, Pages
41–66 (2015).
ISSN 1947-5500
View publication stats

Credit Scoring in The Age of Big Data A

Uploaded by

Copyright:

Available Formats

Credit Scoring in The Age of Big Data A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Credit Scoring in The Age of Big Data A

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Credit scoring in the age of Big Data -A State-of-the-Art

Article in International Journal of Computer Science and Information Security, · July 2017

Youssef Tounsi Larbi Hassouni

SEE PROFILE SEE PROFILE

Storage optimisation in Big Data context View project

Smart eLearning View project

The user has requested enhancement of the downloaded file.

Credit scoring in the age of Big Data – A State-of-the-Art

Linear regression (LR)

Where β1, β2,..., βq are fixed regression parameters and ε is a

classification, estimates the posterior probability of the positive

The probability formula is as follows:

logit(p) is a linear function of the explanatory variable, X

r v Yobas et al. (2000) compared the predictive performances of

to be generated in the world is measured in exabytes and

master node arranges it so that for redundant copies of

Hadoop Distributed File System (HDFS) is a file system

YARN is a cluster management technology. It is one of the

D. Big data challenges in credit scoring

Challenges Examples / explanation Some Challenges Examples / explanation Some

Challenges Examples / explanation Some V. CREDIT SCORING USING NON-TRADITIONAL

In conclusion, Banks with bigger data assets may have an

You might also like