Credit Scoring in The Age of Big Data A
Credit Scoring in The Age of Big Data A
Credit Scoring in The Age of Big Data A
net/publication/340551749
CITATIONS READS
5 984
3 authors:
Houda Anoun
Instituto Politécnico de Lisboa
32 PUBLICATIONS 105 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Youssef Tounsi on 10 April 2020.
Abstract - The banking sector has become very competitive, especially for young customers, given this category tend to be
and increasingly sensitive to political and economic circumstances riskier and they do not have sufficient credit history [3].
in each country and around the world. In addition to the Therefore, other sources of data as “Big Data” is significantly
traditional strategy which aims to reduce expenses and increase necessary to bring value to the consumers' scoring systems
profits, many banks are looking for new methods to reduce credit
risks, in order to improve their performance. In this sense, to
performance.
resolve the most common problem, which is the lack of data, One of the most important lessons learned from the recent
several researches are interested in using new data sources such as global financial crises is that information technology and data
social networks which are used by all kind of users and more architectures of financial institutions are inadequate to support
particularly by the young population. These new sources store the overall management of financial risks. In this context, the
data in large quantities and in a non-traditional format, hence the Basel Committee on Banking Supervision published in 2013 a
need to look for new methods of processing. Techniques of Big set of principles under the name BCBS 239, the objective is to
Data allow to store any voluminous amount of structured, semi- enable banks to improve their production capacities and
structured and unstructured data also providing many solutions improve the reliability of regulatory reporting. BCBS 239
to mine these data in order to extract relevant information.
This paper has multiple goals. The first, which is the main, is to
mandates that banks adhere to a set of core principles for
examine the role of big data in predicting the creditworthiness of effective Risk Data Aggregation and Risk Reporting (RDARR)
consumers. The second is to explore the main machine learning practices. As a result, systems integrators, big data firms and
methods used in credit scoring. The third is to investigate what business consultants actively help banks prepare their processes
type of data is relevant in determining consumer creditworthiness. and IT infrastructures for compliance. Improving risk
And the last is to determine how various inherent characteristics assessment involves using more information to construct a
of big data – volume, velocity, variety, variability and complexity complete and relevant client profile. Furthermore, the
– are related to the assessment of credit risk. possibilities are explored in overcoming the hurdles
To address these topics, we have conducted a detailed and careful encountered by credit scoring system through the use of Big
study of several researches works in the field. We envisage that
our work can be of great usefulness to academics and professionals
Data.
in the field of finance and especially to microfinance organizations The remainder of this article is organized as follows. We
whose main activity is to grant microcredits to people with limited present some related work in Section II. In section III, we
incomes. These people are numerous in Morocco and often do not present the main classification methods in classical credit
even have bank accounts. scoring. In section IV, we summarize principles and insights of
big data for credit scoring. Then we present credit scoring using
Index Terms - Credit Scoring, Big Data, Machine Learning
Techniques, Social data.
non-traditional data. Finally, in section V, we end with a
conclusion and discuss possible future working directions in
I. INTRODUCTION Section 5.
It was in the banking sector that the credit risk assessment II. RELATED WORK
was first established in the mid-twentieth century, now some There are many researches works made to predict credit
banks use more than one type of score [1] [2]: application risk using wide-ranging computing [4], but we have to mention
(Credit) scoring, propensity score, behavioral scoring, that there is not much academic literature covering big data for
Collection scoring, recovery score, attrition scores, fraud Credit Scoring [5] [6]. However, there is a number of papers
detection, etc. analyzing big data analytics in banking sector in general terms
Usual credit scoring are typically constructed using data [7] [8], and other works studying the social networking data
extracted from traditional transactional systems such as OLTP from the sociological point of view. In fact, a few of available
(online transaction processing), Core Banking, ERP (enterprise studies have mentioned the impact of offline and online
resource planning), CRM (customer relationship management) customer’s information on the credit scoring. In addition, no
applications and credit bureaus. The basic data, about paper presents the state of the art of the credit scoring based on
consumers on which these systems operate, is very massive volume data.
conventional, such as birthdate, gender, income, employment A. Masyutin [6] clarifies that credit history is not rich
status, etc. All this data is precisely stored in relational enough for young clients but the social data using big data can
databases or data warehouses. Nevertheless, this history is not bring value to the scoring systems performance. More, Y. Wei,
rich enough in most countries even in developed regions, and P. Yildirim, C. Van and C. Dellarocas [10], highlight that credit
134 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
scoring using social network data can reduce lenders’ performance? In addition, what are the non-traditional data that
misgivings about engaging applicants with limited personal are most relevant for improving credit rating? These are the
financial history. Further, G. Guo, F. Zhu, E. Chen, Q. Liu, L. questions we attempt to answer in this paper.
Wu and C. Guan [3], illuminate that quantitative investigation
III. THE MAIN MACHINE LEARNING METHODS USED IN CREDIT
of the extracted features shows that online social media data
SCORING
does have good potential in discriminating good credit users
from bad. Machine learning (ML) is continuously unleashing its
Likewise, Y. Yang, J. Gu and Z. Zhou [11], analyze the power in a wide range of applications. Credit scoring using
opinions about some enterprises transmitted through social predictive analytics techniques, is one of these solicitations. ML
media to predict their future credit risk. Additionally, D. Ntwiga has been pushed to the front line in late years partly due to the
and P. Weke [12], present the limitations of the traditional rise of big data. It can be broadly categorized based on two
consumer lending models due to the use of historical data, and factors (Figure 1):
checking the benefits that could arise by incorporating social • Learning types: This is to do with what type of
media data in credit scoring process for consumer lending. In response variable (supervised, unsupervised, semi-
another empirical analysis, Y. Zhanga, H. Jiaa, Y. Diaoa, M. supervised, and reinforcement learning).
Haia and H. Lia [13], construct a credit scoring model in the • Subjective grouping: This grouping is driven by
case of online Peer-to-Peer (P2P) lending, by fusing social “what” the model is trying to achieve.
media information based on decision tree, their result shows
that their model has good classification accuracy.
Moreover, D. Björkegrena and D. Grissenb [14], establish
a method to predict default among borrowers without formal
financial histories, using behavioral patterns revealed by mobile
phone usage. Additionally, N. Kshetri [4], indicates that, the
main reason why low-income families and micro-enterprises in
emerging economies such as China, lack access to financial
services is not because creditworthiness does not exist but
merely because banks and financial institutions lack data.
Reading-through another case, M. Hurley and J. Adebayo [9] Fig. 1 Machine Learning Categories.
explore the problems posed by big data credit scoring tools and
analyze the gaps in existing laws of the United States of The supervised learning algorithms are a subcategory of
America. the family of machine learning algorithms, which are mainly
The amount of data is exploding at a remarkable rate used in predictive modeling. These algorithms try to model
because of developments in web technologies, social media and relationships between the target prediction output and the input
other activities generated data (mobile device, log file, IoT, features based on those independencies that it learned from the
etc.). The conclusion is that credit risk assessment can greatly historical or previous data sets. The main types of supervised
benefit from using non-traditional information with big data. learning algorithms are:
Nevertheless, traditional approaches are struggling when faced • Regression: the output variable takes continuous
with these massive data [15]. values.
Many studies have tried to address the challenges and • Classification: the output variable takes class labels.
opportunities of big data in general terms. E.Fortuny,
D.Martens and F. Provost [16], provide a clear illustration that The aim of the credit scoring model is to perform a
larger data indeed can be more valuable assets for predictive classification: To distinguish the “good” applicants from the
analytics. This implies that institutions with larger data assets— “bad” ones. In practice this means the statistical models is
plus the skill to take advantage of them—potentially can obtain required to find the separating line distinguishing the two
substantial competitive advantage over institutions without categories, in the space of the explanatory variables (age,
such access or skill. Furthermore, L. Wang, C. Alexander [17], salary, education, etc.). For the banking industry, the scoring
establish a comparison of several machine learning algorithms methodology has played an important role in developing
and an evaluation of big data technologies. Additionally, A. internal rating systems. There are several reasons that can
L’Heureux, K. Grolinger, H. ElYamany and M. Capretz [18], explain the widespread use of scoring models: First, since credit
address machine learning challenges in the era of big data with scoring model are established upon statistical models and not
the ultimate objective of helping practitioners select appropriate on opinions, it offers an objective way to measure and manage
solutions for their use cases. Limited existing studies have risk. Second, the statistical model used to produce credit scores
mentioned the effect of using big data for credit scoring. Our can be validated. Data have to be carefully used to check the
contribution consists in reviewing these hurdles and their accuracy and performance of the predictions. Third, the
solutions in the case of credit scoring (Section 4). statistical models used by credit scores can be improved over
In view of these elements, which credit scoring models will time as additional data are collected.
benefit most from the advantages of the dig data in terms of The scoring system are made up of three major parts:
135 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
• Problem Definition: This initial phase project focuses Type I error = FP/ TN+FP (2)
on understanding the project objectives and
requirements. Type II error = FN/ TP+FN (3)
• Data Gathering and Preparation: The data
understanding involves internal and external data A TP stand for good applicant correctly classified as good, TN
collection and storage in traditional systems stands for bad applicant correctly classified as bad, FN (Type
(Relational databases, data warehouses, etc) or big II) stands for good applicant incorrectly classified as bad
data platform, selection variables, data cleansing and customer and FP (Type I) stands for Bad customer incorrectly
data transformation if required, also data exploration. classified as Good customer (high risk).
• Model Building and Evaluation: Splitting data into There are several statistical methods for building and
training and test sets, selecting algorithm, tuning estimating scoring models, including linear regression models,
algorithm, building Model using training data and logit models, probit models, and neural networks. We introduce
Evaluating the model using test data (Figure 2) them in the remainder of this section.
136 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
logit(p) = β0 + β1 × x1 + β2 × x2 + . . . + βn × xn (5)
137 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
138 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
139 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
140 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
on nodes in the cluster, which are worker processes that run Aspects Hadoop Spark
computations and store data for your application. Next, it sends the Hadoop cluster to the jobs about 10 to 100 times
maximum. faster than Hadoop
your application code (defined by JAR or Python files passed MapReduce.
to SparkContext) to the executors. Finally, SparkContext sends Latency MapReduce is disk Spark ensures lower
tasks for the executors to run (Figure 10). oriented completely. latency computations by
caching the partial results
across its memory of
distributed workers.
Ease of Writing Hadoop Writing Spark code is
coding MapReduce pipelines is always more compact.
complex and lengthy
process.
Supported Java Java, Python, R, Scala
languages
Fig. 10 Driver program (SparkContext). Supported by the increase of “Big Data” platforms such as
Apache Hadoop and Spark, banks are collecting and analyzing
Apache Spark is based on two key concepts: Resilient ever larger datasets. Thus, the Big Data presents opportunities
Distributed Datasets (RDD) and directed acyclic graph (DAG) and challenges for credit scoring as application of the predictive
execution engine. With regard to datasets, Spark supports two modeling [44]. First the topic with volume, many studies indeed
types of RDDs: parallelized collections are based on Scala continue to see increasing predictive performance from more
collections and Hadoop datasets that are created from the files data as the datasets become massive [44] [45] etc. Another
which are stored on HDFS. benefit of big data to machine learning lies in the fact with more
Spark jobs perform work on Resilient Distributed Datasets samples available for learning, the risk of overfitting becomes
(RDD), an abstraction for a collection of elements that can be smaller. The subject with variety, absolutely, interesting. A fast
operated on in parallel (Figure 11). When Spark is running on a flood of unstructured data, such as social media, image, audio,
Hadoop cluster, RDDs are created from files in the distributed video, in addition to the structured data, is providing novel
file system in any format such as text and sequence files or overtures for credit scoring especially for a population whose
anything supported by a Hadoop forma. data are missing. Velocity implies data are streaming at rates
faster than that can be handled by traditional systems. Veracity
suggests that despite the data being available, the quality of data
is still a major concern. On the other hand, there are several
bottlenecks in designing credit scoring system based on the big
data such as [46]:
• Time complexity: Finishing the computation within
acceptable time on a single computer is very hard.
• Memory restrictions: It is difficult to keep the
completely training data set or most of it in memory
on one computer.
Indeed, SVM for example is extremely powerful and widely
Fig. 11 Resilient Distributed Datasets. accepted classifier in the field risk assessment due to its better
A resilient distributed dataset (RDD) is a read-only collection generalization capability. However, SVM is not suitable for
of objects partitioned across a set of machines that can be rebuilt large scale dataset due to its high computational complexity
if a partition is lost. The elements of an RDD need not exist in [47].
physical storage. RDDs support two kinds of operations: The following table presents a summary of the challenges
transformations and actions [45]. and some proposed solutions for credit scoring with big data
TABLE II [18]:
HADOOP VS SPARK TABLE III
Aspects Hadoop Spark BIG DATA CHALLENGE AND SOME PROPOSED
Difficulty MapReduce is difficult to Spark is easy to program SOLUTIONS
program and needs and does not require any Challenges Examples / explanation Some
abstractions. abstractions proposed
Interactive There is no in-built It has interactive mode. solutions
Mode interactive mode. Volume
Streaming Hadoop MapReduce just Spark can be used to Processing • SVM algorithm has a Resilient
get to process a batch of modify in real time Performance training time complexity of distributed
large stored data. through Spark Streaming. O(m3) and a space datasets
Performance MapReduce does not Spark has been said to complexity of O(m2) (RDDs)
leverage the memory of execute batch processing
141 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
142 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
143 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
Social
TABLE IV Type loan / network /
NON-TRADITIONAL DATA Autors ML method Region Data
M. Hurley Personal USA Media sites, browser
and J. Credit activity, blogs, retail
Social
Adebayo information, Internet
Type loan / network /
[9] Service Protocol (ISP)
Autors ML method Region Data
address and other data
A. Personal Vkontakte Age, Gender, Marital status,
points obtained from public
Masyutin Credit / / Russia Number of days since last
platforms
[6] Logistic visit, Number of
Regression subscriptions, Number of
days since the first post, Information posted or broadcast on social networks may be
Number of user’s posts with inaccurate or misleading. For example, a position of an unhappy
photos, Number of user’s
posts with video, Number of
customer may be inaccurate and may not be indicative of a
children, Major things in life company's success or its solvency. As a result, not only social
and Major qualities in media data must be predictive of the solvency of the applicant,
people but the reliability of social media data needs to be confirmed.
Y. Wei, P. Personal Tie among customers
Yildirim, Credit
Whether the use of such information in a credit decision is
C. Van appropriate must be established by the bank. Failure to use the
and C. correct information to make a credit decision can lead to
Dellarocas problems of security and reliability which the bank needs to
[10]
take into consideration [3] [8] [9].
G. Guo, F. Personal Weibo / Demographic features
Zhu, E. Credit / China Tweet features
Chen, Q. Decision Network features
Liu, L. Wu Tree, Naive High-Level features VI. CONCLUSIONS AND OPEN PROBLEMS
and C. Bayes,
Guan [3] Logistic
Regression We have presented a survey of credit scoring using non-
and SVM traditional data in area of big data. As soon as credit history
Y. Yang, Enterprise Hexun.co Financial status: Worries, (which usually serves as input data in the classical credit
J. Gu and Credit / m and predictions, explanations, scoring) is not exist or not rich enough for young clients, many
Z. Zhou Logistic and Finance.si etc.
[11] Probit na.com.cn Operation: News, academics and professional find that the social data can
Regression / China technologies, strategies improve the scoring systems performance. Nowadays, Credit
changes, etc. scoring using big data emerge as a way to ensure greater
Executive: Risk attitude, efficiency in underwriting while expanding access to the
experience, etc.
Marketing: Prospect, underbanked and to historically neglected groups. We have to
competitors, upstream and mention that there is not much academic work covering the use
downstream, etc. of non-traditional data in credit scoring with the big data.
Major Issues: Mergers, However, There are some firms and start-ups whose domain is
restructuring, major
investment, etc. online/offline data retrieval, aggregation and customer analytics
D. Ntwiga Personal Kenya Ntwiga (2016) : Age, for credit organizations: Wonga, Kreditech, Big Data Scoring,
and P. Credit / Gender, Trust, Interactions, Lenddo, SOCSCOR, Crediograph, etc[6].
Weke [12] Linear reg, Risk factor, Sociability, Thereafter we presented the summarized principles and
Logistic reg, Relationship strength,
Neural nets Private data and Return on
insights of big data for credit scoring. Machine Learning is at
and Genetic private data the core of data analysis in credit scoring. The accuracy of these
Alg algorithms depends on the size and the importance of the data.
Y. Online Peer- PPDai / membership score, prestige, One of the challenges of “big data credit scoring” is the need
Zhanga, to-Peer China forum currency,
H. Jiaa, Y. (P2P) contribution and group
for scalable Machine Learning algorithms implementation on
Diaoa, M. lending / very large data sets. Executing MapReduce jobs using Hadoop
Haia and Decision or spark and Machine Learning give best results for optimal
H. Lia Tree time efficiency. Finally, the open discussion at the end can help
[13]
researchers to have more general understanding the use of big
D. Personal Caribbean Mobile phone use preceding
Björkegre Credit country loan Weekly : Calls out, data in credit scoring and motivate them to get involved and
na and D. number, Calls out, minutes, contribute. We hope that this survey will pave the way for
Grissenb SMS sent, Data use, Top subsequent research in big data for credit scoring. There is no
[14] Ups, Spend, Balance and
reason to believe that adapting traditional credit scoring to Big
Days of mobile phone data
preceding loan Data could benefit banks substantially.
Features derived from phone
usage : Variation in usage,
Periodicity of usage and
Mobility
144 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 7, July 2017
REFERENCES [24] Y. Jiang. Credit Scoring Model Based on the Decision Tree and the
Simulated, Second International Conference on Computer Modeling and
[1] S. Sadatrasoul1, M. Gholamian1, M. Siami1, Z. Hajimohammadi. Credit Simulation (2009).
scoring in banks and financial institutions via data mining techniques, [25] K. Leung, F. Cheong, C. Cheong. Consumer Credit Scoring using an
Journal of AI and Data Mining, Article 12, Volume 1, Issue 2, Summer Artificial Immune System Algorithm, 1-4244-1340-0/07
and Autumn 2013, Page 119-129 (2013). IEEE.Annealing Algorithm, World Congress on Computer Science and
[2] S. Tuffery. Data mining and statistics for decision making, John Wiley & Information Engineering (2008).
Sons, Ltd (2011). [26] Y. Bengio. Learning deep architectures for AI, in Foundations and Trends
[3] G. Guo, F. Zhu, E. Chen, Q. Liu, L. Wu, C. Guan. From Footprint to in Machine Learning, 2(1):1–127 (2009).
Evidence: An Exploratory Study of Mining Social Data for Credit [27] V. Ha, H. Nguyen. Credit scoring with a feature selection approach based
Scoring, ACM Transactions on the Web, Vol. 10, No. 4, Article 22 deep learning. MATEC Web of Conferences 54, 05004, MIMT (2016).
(2016). [28] A. Niimi. Deep Learning for Credit Card Data Analysis, World Congress
[4] N. Kshetri. Big data’s role in expanding access to financial services in on Internet Security (WorldCIS-2015).
China. International Journal of Information Management, 297-308 [29] W. Chen, C. Ma, L. Ma. Mining the customer credit using hybrid support
(2016). vector machine technique. Expert Systems with Applications 36 (4),
[5] L. Devasena. Proficiency comparison ofladtree and reptree classifiers for 7611–7616 (2009).
credit risk forecast, International Journal on Computational Sciences & [30] L. Chen, T. Chiou. A fuzzy credit-rating approach for commercial loans:
Applications (IJCSA) Vol.5, No.1 (2015). a Taiwan case, OMEGA - International Journal of Management Science,
[6] A. Masyutin. (2015) Credit scoring based on social network data. 27,407-419 (1999).
Business Informatics, No. 3 (33), pp. 15-23 [31] U. Farouk, J. Panford, J. Hayfron-Acquah. Fuzzy Logic Approach to
[7] D. Andrea, A. Alessia. An overview of methods for virtual social network Credit Scoring for MicroFinances in Ghana, International Journal of
analysis Computational Social Network Analysis: Trends, Tools and Computer Applications (0975 – 8887) Volume 94 – No.8 (2014).
Research Advances. Springer. P. 3–25 (2009). [32] A. Lahsasna, R. Ainon, T. Wah. Credit Scoring Models Using Soft
[8] N. Sun, J. G. Morris, J. Xu, X. Zhu, M. Xie. iCARE: A framework for big Computing Methods: A Survey, The International Arab Journal of
data-based banking customer analytics, IBM Journal of Research and Information Technology, Vol. 7, No. 2 (2010).
Development (Volume: 58, Issue: 5/6, Sept.-Nov) (2014). [33] S. Mammadlia. Fuzzy logic based loan evaluation system, 12th
[9] M. Hurley and J. Adebayo. Credit scoring in the era of big data, Yale International Conference on Application of Fuzzy Systems and Soft
Journal of Law and Technology: Vol. 18 : Iss. 1 , Article 5 (2016). Computing, ICAFS (2016).
[10] Y. Wei, P. Yildirim, C. Van and C. Dellarocas. Credit Scoring with Social [34] S. Finlay. Are we modelling the right thing? The impact of incorrect
Network Data, Marketing Science Vol. 35, No. 2, pp. 234–258 ISSN problem specification in credit scoring, Expert Systems with Applications
0732-2399 ISSN 1526-548X (2016). Volume 36, Issue 5, Pages 9065–9071(2008).
[11] Y. Yang, J. Gu and Z. Zhou. Credit risk evaluation based on social [35] B. Chen, W. Zeng, Y. Lin. Applications of Artificial Intelligence
media,Procedia Computer Science, Elsevier Volume 55, 2015, Pages 725- Technologies in Credit Scoring: a Survey of Literature, 2014 10th
731 (2015). International Conference on Natural Computation. IEEE 978-1-4799-
[12] D. Ntwiga and P. Weke. Consumer lending using social media data, 5151-2/14 (2014).
International Journal of Scientific Research and Innovative Technology [36] C. Leong. Credit Risk Scoring with Bayesian Network Models, Springer
ISSN: 2313-3759 Vol. 3 No.2 (2016). Science Business Media New York, Comput Econ, Volume 47, Issue 3,
[13] Y. Zhanga, H. Jiaa, Y. Diaoa, M. Haia and H. Lia. Research on Credit pp 423–446 (2015).
Scoring by fusing social media information in Online Peer-to-Peer [37] C. Tsai a, M. Chen. Credit rating by hybrid machine learning techniques,
Lending, Procedia Computer Science 91/168 – 174 (2016). Applied Soft Computing 10 (2010) 374–380(2010).
[14] D. Björkegrena, D. Grissenb. Behavior Revealed in Mobile Phone Usage [38] A. Fensterstock, J. Salters, R. Willging. On the Use of Ensemble Models
Predicts Loan Repayment, Entrepreneurial Finance Lab, Brown for Credit Evaluation, The Credit and Financial Management Review
University (2015). (2013).
[15] T. Asha, U. Shravanthi, N. Nagashree, M. Monika. Building Machine [39] L. Breiman. Random forests. Machine Learning, Statistics Department
Learning Algorithms on Hadoop for Bigdata, International Journal of University of California Berkeley, CA 94720 45, 5–32 (2001).
Engineering and Technology Volume 3 No. 2 (2013). [40] G. Wang, J. Ma, L. Huang, K. Xu. Two credit scoring models based on
[16] E. Fortuny, D. Martens, F. Provost. Predictive Modeling With Big Data: dual strategy ensemble trees, Knowledge-Based Systems 26 (2012) 61–
Is Bigger Really Better?, Big Data Mary Ann Liebert, Inc 68 (2012).
doi:10.1089/big.2013.0037, VOL. 1 NO. 4 (2013). [41] M. Malekipirbazari, V. Aksakalli. Risk assessment in social lending via
[17] L. Wang, C. Alexander. Machine Learning in Big Data, International random forests, Expert Systems with Applications 42 4621–4631 (2015).
Journal of Mathematical, Engineering and Management Sciences, Vol. [42] G. Kapil, A. Agrawal and R. Khan, A Study of Big Data Characteristics,
1,No.2,52-61 (2016). Communication and Electronics Systems (ICCES) International
[18] A. L’Heureux, K. Grolinger, H. ElYamany, M. Capretz. Machine Conference (2016).
Learning with Big Data: Challenges and Approaches. NSERC CRD at [43] G. Bello-Orgaza, J. Jungb, D. Camachoa. Social big data: Recent
Western University (CRD 477530-14), DOI achievements and new challenges, Information Fusion 000(2015)1–15
10.1109/ACCESS.2017.2696365, IEEE (2017). (2015).
[19] F. Louzadaa, A. Araa, G. Fernandes. Classification methods applied to [44] Z. Zhou, N. Chawla, Y. Jin, G. Williams. Big Data Opportunities and
credit scoring: A systematic review and overall comparison, Surveys in Challenges: Discussions from Data Analytics Perspectives, IEEE
Operations Research and Management Science, Elsevier (2016). Computational intelligence magazine (2013).
[20] H. Abdou, H, J. Pointon. Credit scoring, statistical techniques and [45] A. Verma, A. Mansuri, N. Jain. Big Data Management Processing with
evaluation criteria: A review of the literature, Intelligent Systems in Hadoop MapReduce and Spark Technology: A Comparison, Symposium
Accounting, Finance and Management, 18(2-3), 59-88 (2011). on Colossal Data Analysis and Networking (CDAN) (2016).
[21] S. Jun, S. Lee and J. Ryu. A Divided Regression Analysis for Big Data, [46] J. Wang, D. Crawl, S. Purawat, M. guyen, I. Altintas. Big Data
International Journal of Software Engineering and Its Applications Vol. Provenance: Challenges, State of the Art and Opportunities, IEEE
9, No. 5 (2015), pp. 21-32 (2015). International Conference on Big Data (2015).
[22] A. Bahnsen, D. Aouada and B. Ottersten. Example Dependent Cost- [47] A. Priyadarshini, S. Agarwa. A Map Reduce based Support Vector
Sensitive Logistic Regression for Credit Scoring, 13th International Machine for Big Data Classification (2015).
Conference on Machine Learning and Applications, 978-1-4799-7415-
3/14 IEEE (2014).
[23] H. Nguyen. Default Predictors in Credit Scoring: Evidence from France's
Retail Banking Institution, Journal of Credit Risk, Vol. 11, No. 2, Pages
41–66 (2015).
145 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
View publication stats