CN111738819A - Method, device and equipment for screening characterization data - Google Patents
Method, device and equipment for screening characterization data Download PDFInfo
- Publication number
- CN111738819A CN111738819A CN202010540728.9A CN202010540728A CN111738819A CN 111738819 A CN111738819 A CN 111738819A CN 202010540728 A CN202010540728 A CN 202010540728A CN 111738819 A CN111738819 A CN 111738819A
- Authority
- CN
- China
- Prior art keywords
- variable
- variables
- initial
- user
- derived
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012216 screening Methods 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000012512 characterization method Methods 0.000 title claims abstract description 38
- 238000007473 univariate analysis Methods 0.000 claims abstract description 36
- 238000007637 random forest analysis Methods 0.000 claims abstract description 28
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 24
- 238000012545 processing Methods 0.000 claims abstract description 24
- 238000011156 evaluation Methods 0.000 claims abstract description 11
- 230000008859 change Effects 0.000 claims description 11
- 238000000926 separation method Methods 0.000 claims description 8
- 230000006399 behavior Effects 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 5
- 238000009825 accumulation Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000011161 development Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 description 12
- 230000000694 effects Effects 0.000 description 9
- 238000012360 testing method Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 239000000047 product Substances 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000002159 abnormal effect Effects 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 3
- 230000001364 causal effect Effects 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 239000007795 chemical reaction product Substances 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010205 computational analysis Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000010970 precious metal Substances 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2113—Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Accounting & Taxation (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Finance (AREA)
- Artificial Intelligence (AREA)
- Economics (AREA)
- Technology Law (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Development Economics (AREA)
- Evolutionary Biology (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Business, Economics & Management (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
Abstract
The application provides a method, a device and equipment for screening characterization data, wherein the method comprises the following steps: acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; performing data processing on initial variables in the initial variable set to obtain a derivative variable set; performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user or not in the derived variables. In the embodiment of the application, the variable which can effectively represent the positive sample user can be efficiently and conveniently screened from the high-dimensional data by utilizing the univariate analysis and the random forest algorithm, and the accuracy of user evaluation can be further improved.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, and a device for screening characterization data.
Background
The stock user is an important user resource of the bank, scientifically manages the stock user, and is particularly important to provide targeted personalized service for the high-quality stock user on the premise of controlling the risk.
In the prior art, a bank system only provides functional services on a process for screening high-quality users, business personnel need to judge a strong causal relationship between a user variable and whether the user is the high-quality user according to subjective experience of the business personnel, so that the variables of an evaluation user are screened out, and the business personnel comprehensively evaluate the user according to the screened variables to determine whether targeted personalized services need to be provided for the user. The variable screening is carried out by depending on the subjective experience of business personnel, the subjective randomness is strong, and the accuracy of the screening result is low due to the fact that the variable screening is easily influenced by the uncertainty of artificial subjective factors. Furthermore, the efficiency of analysis by adopting the method for screening the characterization data in the prior art is low, only a few variables can be analyzed and selected, and redundant or irrelevant variables may exist in the selected variables, so that the evaluation accuracy is low. Therefore, the technical scheme in the prior art cannot efficiently and accurately screen the variable which can effectively represent the user from the high-dimensional data.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the application provides a method, a device and equipment for screening characterization data, and aims to solve the problem that effective variables cannot be efficiently and accurately screened from high-dimensional data in the prior art.
The embodiment of the application provides a method for screening characterization data, which comprises the following steps: acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; performing data processing on the initial variables in the initial variable set to obtain a derivative variable set; wherein the derivative variable set is a set of new variables derived from the initial variables; performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein the univariate analysis is used to determine the characterization capability of a single variable; and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user or not in the derived variables.
In one embodiment, obtaining an initial set of variables comprises: acquiring a first sample data set in a preset time period; wherein the first sample data set comprises data of a plurality of positive sample users and a plurality of negative sample users within the preset time period; performing data cleaning on the sample data set to obtain a second sample data set; extracting values of a plurality of initial variables corresponding to each positive sample user and each negative sample user in the second sample data set; and generating the initial variable set according to the values of a plurality of initial variables corresponding to each positive sample user and each negative sample user.
In one embodiment, the data processing of the initial variables in the initial variable set comprises: and counting, summing, averaging and date compressing the values of the initial variables of the users in the initial variable set.
In one embodiment, the screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable includes: determining the importance value of each derived variable in the screened derived variable set by using the random forest algorithm; performing descending order arrangement on the importance values of the derivative variables to obtain descending order arrangement results; according to the descending order arrangement result, carrying out descending order accumulation on the importance values of the derivative variables; and taking the derivative variable corresponding to the importance value obtained by accumulation in a preset range as a characteristic variable.
In one embodiment, the performing univariate analysis on each derived variable in the derived variable set to obtain a filtered derived variable set includes: analyzing and determining the stability of each derivative variable in the derivative variable set according to a preset time frequency by utilizing a population stability index; removing the derived variables with the stability smaller than a first preset threshold value from the derived variable set to obtain a first variable set; calculating the information value of each variable in the first variable set; removing variables with information values smaller than a second preset threshold value from the first variable set to obtain a second variable set; determining the correlation between the variables in the second variable set by using the correlation coefficient; removing the variable with a lower information value from the second variable set in the two variables with the correlation greater than or equal to a third preset threshold value to obtain a third variable set; and taking the third variable set as the derived variable set after screening.
In one embodiment, after obtaining the at least one characteristic variable, the method further includes: performing box separation operation on the at least one characteristic variable to obtain a box separation structure of each characteristic variable, wherein the box separation structure is used for representing the evaluation standard of the characteristic variable; and according to the box-separating structure of each characteristic variable, carrying out score distribution on each characteristic variable to obtain a target scoring model, wherein the target scoring model is used for scoring the target user according to the input data of the target user.
In one embodiment, the initial variables in the initial set of variables include at least one of: user basic information, user address, user score, user rank, loan contracts, loan accounts, loan account transaction pipelining, loan payoff pipelining, debit card contracts, debit card accounts, debit card account pipelining, asset management scale; the derived variables in the set of derived variables include at least one of: user seniority data, investment and financing product holding type, amount, quantity, holding duration, historical loan behavior, credit granting change trend, loan use condition, loan overdue data and economic development degree of province and city where the historical loan signing institution is located.
The embodiment of the present application further provides a device for screening characterization data, including: the acquisition module is used for acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; the data processing module is used for carrying out data processing on the initial variables in the initial variable set to obtain a derivative variable set; wherein the derivative variable set is a set of new variables derived from the initial variables; the univariate analysis module is used for performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein the univariate analysis is used to determine the characterization capability of a single variable; and the variable screening module is used for screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user in the derived variables.
The embodiment of the application also provides characterization data screening equipment, which comprises a processor and a memory for storing processor executable instructions, wherein the processor executes the instructions to realize the steps of the characterization data screening method.
The embodiment of the application also provides a computer readable storage medium, which stores computer instructions, and the instructions realize the steps of the characterization data screening method when executed.
The embodiment of the application provides a method for screening characterization data, which can be used for screening the characterization data by acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users. The initial variables in the initial variable set can be subjected to data processing to obtain a derivative variable set. Furthermore, univariate analysis can be performed on each derivative variable in the derivative variable set to obtain the screened derivative variable set. And screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables. By utilizing univariate analysis and a random forest algorithm, effective characteristic variables for representing whether the user is a positive sample user can be extracted from high-dimensional data efficiently and conveniently, and the accuracy of user evaluation can be further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, are incorporated in and constitute a part of this application, and are not intended to limit the application. In the drawings:
FIG. 1 is a schematic illustration of steps of a method for screening characterization data provided in accordance with an embodiment of the present application;
FIG. 2 is a schematic diagram of an approval process for a K loan application according to an embodiment of the application;
FIG. 3 is a schematic structural diagram of a device for screening characterization data provided in an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a device for screening characterization data according to an embodiment of the present application.
Detailed Description
The principles and spirit of the present application will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present application, and are not intended to limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present application may be embodied as a system, apparatus, device, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
Although the flow described below includes operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).
Referring to fig. 1, the present embodiment can provide a method for screening characterization data. The characterization data screening method can be used for screening effective variables which can be used for characterizing users from high-dimensional data. The above-described characterization data screening method may include the following steps.
S101: acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users.
In the present embodiment, an initial variable set may be acquired in advance. The initial variable set may include values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users, where the positive sample users and the negative sample users may mark sample data according to users mined according to actual needs. For example: if a variable used for judging whether the user is a user with high reputation degree is screened out, the user with high reputation degree without default behavior in the bank database can be marked as a positive sample user, and the user with default behavior in history in the bank database can be marked as a negative sample user. It is to be understood that, in other scenarios, the positive sample user and the negative sample user may be marked according to the above-mentioned idea, which may be determined according to practical situations, and this application does not limit this.
In this embodiment, the number of positive sample users and the number of negative sample users may be in a certain ratio, for example: the ratio of the number of positive sample users to the number of negative sample users is 1:5, but other ratio values may also be adopted, for example, 1:5.5 or 1:3, which may be determined according to actual situations, and this is not limited in this application.
In this embodiment, the initial variable may be data related to the user, which can be directly acquired from a bank database, and may include user basic information, behavior information, and the like, for example: user identification number, user gender, user age, user deposit amount, user's surcharge data, etc. In some embodiments, the initial variables may include: user basic information (gender, identification number, age, reserved telephone number, work unit, bank account information, bank account opening date, house information, etc.), user address, user rating, loan contract, loan account transaction flow, loan payout flow, debit card contract, debit card account flow, AUM (asset Management scale), etc. It will of course be appreciated that the initial variables described above may also include other variables in other scenarios, such as: the working age, graduation school, specialty, graduation time and the like of the user can be determined according to actual conditions, and the method is not limited in the application.
The AUM generally includes information such as financing, fund, precious metals, and loan of the user in the bank, in addition to deposit. The user value can be measured by AUM, and the higher AUM of a user (an individual user or a public user), the higher contribution degree to a bank generally, and the AUM is an index for measuring a high-net-value user.
In this embodiment, the manner of obtaining the initial variable set may include: and receiving an initial variable set input into the screening system by the bank related business personnel, or obtaining the initial variable set by querying according to a preset path. It is understood that, the above initial variable set may also be obtained in other possible manners, for example, the initial variable set is searched in a web page or a database according to a certain search condition, which may be determined according to actual situations, and the present application does not limit this.
S102: carrying out data processing on initial variables in the initial variable set to obtain a derivative variable set; and the derivative variable set is a set of new variables derived according to the initial variables.
Because the variables that can be directly obtained from the database or the business system are designed according to the needs of the business, the initial variables often fail to achieve the goal of obtaining good data mining results. Therefore, in this embodiment, the initial variables in the initial variable set may be subjected to data processing to obtain a derivative variable set, that is, new variables are constructed from the initial variables, so as to exploit the dimensionality of the analysis.
In this embodiment, the derivative variable set may be a set of new variables derived from an initial variable, and the derivative variable may be a variable that changes according to a change in the initial variable. The set of derived variables may include: and the values of the derivative variables corresponding to the plurality of positive sample users and the plurality of negative sample users.
In this embodiment, the manner of performing data processing on the initial variables in the initial variable set may include, but is not limited to, at least one of the following: variable expansion, which is to derive a plurality of labeled type variables based on one variable in a way that variable values are flattened (expanded), can also be understood as discretization; the variable combination is that two or more variables are combined in a mathematical operation mode, and the combination of the variables can be regarded as logical connection; synthetic variables, which are formed by combining (multiplying or cartesian product) individual variables, are a way for linear models to learn nonlinear variables.
In one embodiment, the data processing of the plurality of initial variables in the initial variable set comprises: and counting, summing, averaging, date compression and the like are carried out on the initial variable values of the users in the initial variable set. It will be appreciated that other data processing approaches may also be employed in some embodiments, such as: and (4) carrying out cross combination on a plurality of variables, or carrying out operations such as intersection, combination, complement and Cartesian set.
In one embodiment, two programming languages, Python (computer programming Language) and SQL (Structured Query Language) may be used to compute the set of derived variables from the initial set of variables. In one embodiment, variable cleaning may be performed before variable derivation, so as to solve the data quality problem in the initial variable set and make the cleaned data more suitable for mining.
In one embodiment, the derived variable set obtained from the initial variable set may include: user seniority data such as user line age, electronic channel signing time length and the like; user preference data such as investment financing product holding type, amount, quantity, holding duration and the like; historical loan application times, application frequency, signing times, signing frequency, expenditure times, repayment times and other historical loan behavior data; credit change trends such as historical credit amount change situation, credit mode change situation, user identity change situation and the like; the loan use conditions such as historical loan interest income, quota use rate, quota use duration and the like; historical loan overdue amount, overdue times, overdue duration and other loan overdue data; the economic development degree of provinces and cities where the historical loan signing institution is located, and the like. It will be appreciated that other data may also be included in some other scenarios, such as: the transfer frequency of the user and the like can be determined according to the actual situation, and the application does not limit the transfer frequency.
S103: carrying out univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; in which univariate analysis is used to determine the characterization capabilities of individual variables.
In this embodiment, in order to ensure the prediction capability of each derived variable on a target event (a screening scenario that needs to be applied, for example, to determine whether a target user is a positive sample user) and the uniqueness of the derived variables in the derived variable set, univariate analysis may be performed on each derived variable in the derived variable set, so that the derived variables in the derived variable set may be screened to obtain the screened derived variables.
In the present embodiment, the univariate analysis mainly focuses on the description and statistical inference of univariates, and is to reflect the basic information contained in a large amount of sample data in the simplest generalized form, and describe the centralized or discrete capture trend in the sample data. The univariate analysis is used for determining the characterization capability of a single variable, and the variable with low prediction capability and the repeated variable (with high correlation) can be screened out by utilizing the univariate analysis, so that the derived variables in the screened derived variable set are more suitable for data mining.
In one embodiment, the univariate analysis may comprise: the Stability of each variable is calculated using PSI (Population Stability Index), the IV (Information Value) of a single variable is calculated, and the Correlation between each two variables is calculated (Correlation Coefficient). Of course, other indexes can be used for univariate analysis, and the concrete determination can be determined according to the actual situation, which is not limited in the present application.
The PSI (Population Stability Index) is an Index for measuring the deviation between the predicted value and the actual value of the model, and is a model Stability evaluation Index, and the PSI indicates whether the Population distribution changes for samples at different times after grading according to the scores, or whether the proportion of the Population in each score interval to the total Population changes significantly. When the stability evaluation is performed, the sample may be analyzed according to a preset time frequency, for example, each variable in the derivative variable set may be analyzed monthly or at a time frequency of each week or year, which may be determined according to actual conditions, and the present application does not limit this. The preset time frequency may be: monthly, yearly or weekly, etc., and the details may be determined according to actual conditions, and the present application does not limit the present invention.
The above IV (Information Value) can be used to measure the predictive ability of the argument. The variable screening process takes into account a number of factors, such as: the predictive power of variables, the correlation between variables, the simplicity of variables (easy to generate and use), the robustness of variables (not easily bypassed), the interpretability of variables in business (which can be explained when challenged), etc., but the most important measure is the predictive power of variables. Thus, in this embodiment, the predictive power of a variable can be measured by IV: assume that in a classification problem, the categories of target variables are of two types: y is1,Y2. For one to be predictedMeasuring individual A, judging whether A belongs to Y1Or Y2Certain information is required, assuming that the total amount of information is I, and the required information is contained in all the independent variables C1、C2、C3、……、CnThen for one of the variables CiIn other words, the more information it contains, it belongs to Y for judgment A1Or Y2The greater the contribution of CiThe greater the information value of (C)iThe larger the IV, the more it should go into the final set of variables.
The Correlation (Correlation Coefficient) can be used to characterize whether there is a certain dependency relationship between two variables, and to study the Correlation direction and degree of the phenomenon with the dependency relationship, which is a statistical method for studying the Correlation relationship between the variables. If the correlation between the two variables is high, indicating that the two variables have approximate effects on predicting the target event, one may choose to retain.
In an embodiment, the single variable analysis is performed on each derived variable in the derived variable set to obtain the screened derived variable set, and the method may include determining the stability of each derived variable in the derived variable set according to a preset time frequency analysis by using a population stability index, and removing the derived variable with the stability smaller than a first preset threshold from the derived variable set to obtain the first variable set. Further, the information values of the variables in the first variable set may be calculated, and the variables with the information values smaller than the second preset threshold value are removed from the first variable set, so as to obtain a second variable set. Further, the correlation coefficient may be used to determine the correlation between the variables in the second variable set, the variable with a lower information value in the two variables whose correlation is greater than or equal to the third preset threshold is removed from the second variable set to obtain a third variable set, and the third variable set is used as the filtered derivative variable set.
In this embodiment, the first preset threshold may be a value greater than 0, and preferably may be: 0.1, 0.12, etc., which can be set according to actual requirements, and the application does not limit the same. The second preset threshold may be a value greater than 0, and preferably may be: 0.05, 0.06, etc., which can be set according to actual requirements, and the application does not limit the same. The third threshold may be a value greater than 0, and preferably may be: 0.7, 0.72, etc., which can be set according to actual requirements, and the application does not limit the same.
S104: and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables.
In order to remove invalid variables on the premise of ensuring accuracy and reflect main characteristics required by predicting a target event with fewer characteristic variables as far as possible, in the embodiment, a random forest algorithm can be used for screening the derivative variables in the screened derivative variable set to obtain at least one characteristic variable. The characteristic variables are variables obtained by final screening, and may be the simplest variables used for representing whether the user is a positive sample user.
In this embodiment, in machine learning, the Random Forest (Random Forest) includes a classifier having a plurality of decision trees, which constructs a plurality of decision trees by using a Random resampling technique and a node Random splitting technique, and obtains a final classification result by voting. The random forest has the capability of analyzing complex interaction classification features, has good robustness on noise data and data with missing values, has high learning speed, and can be used as a feature selection tool for high-dimensional data by variable importance measurement. The random forest algorithm can be used for determining how much each variable makes contribution on each tree in the random forest, then the average value is taken, and finally the contribution sizes of different variables are compared.
In one embodiment, the set of derived variables after the filtering may be used as a training set, and a random forest algorithm is used for training. A Bootstrap sample set can be randomly selected from a training set in a replacement way by applying a Bootstrap method (which is an analog sampling statistical inference method based on a large amount of calculation), so as to generate a Bootstrap data sample, and further, an unstructured tree can be constructed by using the generated Bootstrap data. In the process of generating the tree, a plurality of variables participating in division are randomly selected from the screened derivative variable set, one variable with the most classification capability is selected from the plurality of variables participating in division to perform node splitting by calculating the information content (which can be distinguished by a Gini coefficient, a gain rate or an information gain) contained in each variable, so that a plurality of decision trees can be constructed, and the importance value of each variable in the screened derivative variable set can be obtained by voting.
In this embodiment, the importance values of the respective derived variables may be sorted in descending order to obtain a result of the sorting in descending order. Furthermore, according to the descending order arrangement result, the importance values of the derivative variables are accumulated in a descending order, and the derivative variables corresponding to the accumulated importance values in the preset range are used as the characteristic variables. For example: after the derived variables after screening are 20 derived variables in total, and the derived variables are output in descending order of importance values, when the sum of the importance values of the first 12 derived variables is greater than 0.96, the first 12 characteristics can be used as characteristic variables, and the remaining 8 derived variables are removed. Although some loss in final accuracy may result from removing part of the features, this loss of less than 4% may be acceptable compared to the simplicity of predicting the target event.
In the present embodiment, since the sum of the importance values of the respective derived variables in the set of derived variables after the screening is 1, both end values of the preset range are positive numbers of greater than 0 and equal to or less than 1, the preset range may be preferably (0.96,1], or [0.97,1], and the like, and may be specifically set according to the actual situation, and the present application is not limited thereto.
From the above description, it can be seen that the embodiments of the present application achieve the following technical effects: the initial variable set can be obtained; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users. The initial variables in the initial variable set can be subjected to data processing to obtain a derivative variable set. Furthermore, univariate analysis can be performed on each derivative variable in the derivative variable set to obtain the screened derivative variable set. And screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables. The variables which can effectively represent the positive sample user can be efficiently and conveniently screened from the high-dimensional data by utilizing univariate analysis and a random forest algorithm, and the accuracy of user evaluation can be further improved.
In one embodiment, the initial set of variables may be obtained as follows: a first sample data set may be obtained for a preset time period, where the first sample data set may include data of a plurality of positive sample users and a plurality of negative sample users for the preset time period. Since the sample data set may contain abnormal data, repeated data, and the like, in order to ensure the quality of data, data cleaning may be performed on the sample data set first, so as to obtain a second sample data set. Further, values of a plurality of initial variables corresponding to each positive sample user and each negative sample user in the second sample data set may be extracted, and an initial variable set may be generated according to the values of the plurality of initial variables corresponding to each positive sample user and each negative sample user.
In this embodiment, in order to make the data in the first sample data set representative and accurate, the first sample data set in a preset time period may be acquired, and the preset time period may be a certain time period in the historical time, and may be data of the last year, for example: 1/2019-12/21, although they can be selected in other ways, for example: data from 6/month 1 in 2018 to 11/month 30 in 2018 are selected, which can be determined according to actual conditions, and are not limited in the application.
In this embodiment, the test data set may be selected according to the preset time period, for example, an expiration date of the preset time period may be used as a starting time point of the test data set, so as to obtain the test data set. The test data set obtained by adopting the method can better verify the training result of the training set, and the corresponding test data set also comprises the positive sample user data and the negative sample user data.
In this embodiment, since the first sample data set is obtained from the history data, the data in the obtained first sample data set can be accurately labeled, that is, the data in the first sample data set is labeled as the positive sample user data and the negative sample user data. The positive sample users are the users expected to be predicted, and the negative sample users are opposite to the positive sample users. For example: and if the target user is expected to be predicted to be the user with high credibility, the positive sample user is the user with high credibility, and the negative sample user is the user with non-high credibility.
In one embodiment, after obtaining the at least one characteristic variable, a binning operation may be further performed on the at least one characteristic variable to obtain a binning structure of each characteristic variable, where the binning structure is used to characterize an evaluation criterion of the characteristic variable. Furthermore, score distribution can be performed on each characteristic variable according to the box-dividing structure of each characteristic variable to obtain a target scoring model, wherein the target scoring model is used for scoring a target user according to input data of the target user, and then whether the target user is a positive sample user or not can be determined according to a scoring result.
The binning operation is to discretize continuous data, for example: the variable of age is a continuous variable and can be classified into 0-18, 18-30, 30-45 and 45-60. The effect of binning mainly may include: the binned variables are more robust to abnormal data, for example, if an abnormal value of 300 is found in the age, the binned variables may be classified into a bin of > 80; nonlinear relations in data can be effectively captured, the expression capacity of the model can be improved, and fitting is increased; non-monotonicity relationships in the data can be captured; the variable values can be standardized; the categorical variables can be effectively incorporated into the model; the 'anti-oscillation performance' of the model can be effectively improved, various noises are eliminated through box separation operation, extreme value ground influence is eliminated or greatly weakened, the sufficiency of the sample size of each interval is ensured, and the model cannot be impacted due to fine fluctuation of data; the model interpretability can be effectively improved.
In this embodiment, before performing binning, the type of the variable may be determined, and the type of the variable may include: continuous variables such as income, age; ordered categorical variables such as academic calendar, job title; unordered categorical variables such as provinces. The unordered classification variables can be sorted and converted based on Bad sample Rate, and the ordered classification variables can be reasonably and sequentially converted until all variable data are processed into continuous data, and box separation operation can be performed.
In this embodiment, WOE (Weight of Evidence) can be used to perform computational analysis on each set of data after initial binning, and then the binning structure of each variable is adjusted according to IV (Information Value) Value until the best binning effect is achieved (bad user ratio is monotonic after binning).
In this embodiment, the above WOE is a coding form of the original independent variable, and the WOE actually describes the current group of the variable, and has an influence direction and magnitude for determining whether the individual will respond or belongs to which class, when the WOE is positive, the current value of the variable has a positive influence on determining whether the individual will respond, and when the WOE is negative, the variable has a negative influence. The magnitude of the WOE value is representative of the magnitude of this effect. The calculation of IV depends on WOE, and IV is an index for measuring the influence degree of the independent variable on the target variable.
In this embodiment, the above-described binning structure for each characteristic variable may be used to characterize the evaluation criterion of each characteristic variable. And (4) carrying out score distribution by utilizing the box-separating structure of each characteristic variable so as to obtain a target scoring model. Taking age as an example, the result of assigning the score may be 20 for 0-18 years, 40 for 19-35 years, 30 for 36-50 years, and 10 for more than 50 years.
In this embodiment, the target scoring model may be used to score the target user according to the input data of the target user, for example, the data of the target user is: age 27, sex male, marital status married, academic department, monthly income 10000, and score 264 given by the target scoring model. Further, whether the target user is a positive sample user or not may be determined according to the obtained scoring result.
In this embodiment, the ratio (good%) of the cumulative number of positive samples to the total number of positive samples and the ratio (bad%) of the cumulative number of negative samples to the total number of negative samples in each scoring interval may be calculated according to the sharing structure of the characteristic variables, so that the score may be assigned to each scoring interval according to the calculation result of each scoring interval.
In one scenario example, the approval process for personal K-loan (online consumer credit) application as applied in the row is basically divided into 2 phases, a pre-crediting phase and an application phase. In the pre-credit stage, a user clicks and checks the K credit limit through channels such as a mobile phone bank, an online bank, an offline cabinet machine and the like, then a credit access link is entered, if the access is passed, the output of the pre-credit limit can be obtained, and the application stage is entered; in the application stage, the user clicks application K credit, a series of rules are screened in the middle, if the application K credit passes the screening, the output of the application amount can be obtained, the user enters a signing and account opening link, and the whole application approval process is finished. The whole application approval process is non-differentiated for all users, and no matter the users are stock users or new users in the industry, so that high-quality users cannot be screened out from the stock users, and the quota is scientifically improved to provide personalized services for the high-quality users.
In the scene example, the existing K credit application approval process can be optimized, a stock high-quality user quota increasing module is added between the credit pre-granting stage and the application stage, a link for judging whether the stock user is a high-quality user is added to the stock user, and the user is screened and identified. If the stock user is judged to be a high-quality user, carrying out appropriate quota increase on the high-quality user on the premise of quota in the pre-credit stage; if the pre-trust stage has no limit, the high-quality user is given a certain limit, thereby achieving the purposes of improving the satisfaction degree of the user and improving the stickiness of the user.
In one specific embodiment, the K loan application approval process may be as shown in FIG. 2, and the quotation module may include a stock user basic admission, a stock good user scoring model, and a stock good user quotation model. The high-quality user quota module can respectively carry out access, scoring and quota increase on the two types of users according to the existence of the initial pre-granted credit limit of the users.
In this embodiment, the basic admission of the stock user is a necessary and insufficient condition for the whole stock high-quality user quota service, and the user to be quota needs to be judged for the basic credit risk admission condition, and the subsequent quota process can be performed only if the user to be quota accords with the basic admission rule. The basic admission rules may include, but are not limited to, at least one of the following: whether the current credit condition of the user belongs to the user range of inventory settlement and credit extension, whether the current internal user score of the user meets the minimum entering standard, whether the user is overdue within all the past K loan periods settled by the user history, and the like.
In this embodiment, the establishment of the high-quality user rating model may include, for example, whether the user has an initial pre-credit line, forming an original sample user set by the user who has performed a K credit settlement operation and performed pre-credit calculation without a credit source during the screening 20180601 to 20181130, and screening out a batch of users (5000 users) from the original sample user set as high-quality users of K credit products by the service department according to past service experience to form a target sample white list. According to target sample and non-target sample 1: and 5, randomly extracting 25000 non-business-identified high-quality users with high storage quantity from the original sample user set to form a user sample data set (30000). For the 3 ten thousand users, taking 20181130 as a modeling benchmark day, and taking data of the users up to 20181130 as training sample data of a scoring model; data of these users up to 20191130 were taken as test sample data of the scoring model with 20191130 as the test benchmark day.
In this embodiment, for a user in a user sample data set, acquiring various types of original data in each user line, including: the system comprises multi-dimensional data such as user basic information, address information, inline user rating information, user grade information, loan contract information, loan account transaction running information, loan payment running information, K loan transaction operation information, debit card contract information, debit card account running information, AUM information and the like. After data cleaning, an initial variable set is generated.
In this embodiment, the derivative variables can be calculated by performing data analysis on the initial variables using two programming languages, Python and SQL. The generated derivative variables can comprise user qualification data such as user line age, electronic channel signing time and the like, preference data such as types, amounts, quantities and holding time of investment and financing products, K loan behavior data such as historical loan application times, application frequency, signing times, signing frequency, expenditure times and repayment times and the like, credit change trend data such as historical loan credit change condition, credit change condition and user identity change condition, K loan use condition data such as historical K loan interest income, amount use rate and use time of the K loan, loan use data such as historical loan amount, overdue times and overdue time and the like, and province and economic developed degree of a historical K loan signing institution and the like.
In this embodiment, in the process of feature variable screening, a mode of combining univariate analysis and a machine learning algorithm of a random forest is used to achieve a better feature screening effect. Firstly, univariate analysis is performed on variables, which specifically includes: analyzing and observing the stability of the variables monthly by using a group stability index, and removing the variables with low stability (measured according to PSI > 0.1); analyzing the single prediction capability of the characteristic variables on the target event by using the information values, and removing the variables with low prediction capability (measured according to IV < 0.05); the correlation coefficient (correlation between analysis variables, for two variables with high correlation (greater than 0.7)) is used, the variable with the higher IV value is retained.
In this embodiment, the final screened characteristic variables may be subjected to binning operation, WOE is used to perform calculation and analysis on each group of binned data, then the binning structure of each variable is adjusted according to IV until the best binning effect is achieved (bad user occupation is monotonous after binning), and finally score assignment is performed on each variable to obtain an admission scoring model for high-quality user quota of stock.
In this embodiment, since the loan term of the loan transaction applicable to the model of the user rating card with high quality inventory is 1 year, the user may be rated monthly by using the model using the test sample data from 20181101 to 20191031, and whether the ratio of the number of people in each score interval to the total number of people changes significantly is observed, so as to calculate the stability index of the model, and if the calculated stability index is less than 10%, the model of the user rating with high quality inventory is considered to be relatively stable.
In this embodiment, the user with balance and credit balance who gives no credit limit can give an offer decision based on credit data (the last successful K credit amount of the user in the last n months) in a certain period of time in the near future, and the user with balance and credit balance who gives credit limit again gives an offer decision based on credit scale factor + scoring result combined with consideration to determine an offer proportion, so that accurate marketing and differential pricing can be realized.
In this embodiment, the predicted value of the model may be recorded, the daily updated data of the credit system is used for monitoring and analyzing, the predicted value obtained by the model is compared with the actual performance of the user, the variables and the relevant parameters of the model can be continuously changed, and the model is continuously iteratively optimized.
In this embodiment, the high-quality user quota model based on the random forest algorithm can reduce decision communication cost in a row on one hand, gradually reduce the past auditing mode depending on business experience judgment, reduce errors of artificial subjective judgment, and improve scientific decision rate. On the other hand, the random forest algorithm can deeply mine the existing data, accurately screen out high-quality users on the premise of controlling the risk, improve the limit for the high-quality users as much as possible, and improve the satisfaction degree of the users on K credit business of own lines, thereby improving the user viscosity and the competitiveness of the K credit business. It should be noted, however, that the foregoing specific examples are provided for the purpose of better illustrating the present application and are not to be construed as limiting the present application.
Based on the same inventive concept, the embodiment of the present application further provides a device for screening characterization data, such as the following embodiments. Because the principle of the characterization data screening device for solving the problems is similar to that of the characterization data screening method, the characterization data screening device can be implemented by the characterization data screening method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated. Fig. 3 is a block diagram of a structure of a device for screening characterization data according to an embodiment of the present application, as shown in fig. 3, which may include: an obtaining module 301, a data processing module 302, a univariate analysis module 303, and a variable screening module 304, which are described below.
An obtaining module 301, configured to obtain an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users;
the data processing module 302 may be configured to perform data processing on initial variables in the initial variable set to obtain a derivative variable set; the derived variable set is a set of new variables derived according to the initial variables;
the univariate analysis module 303 may be configured to perform univariate analysis on each derived variable in the derived variable set to obtain a filtered derived variable set; wherein univariate analysis is used to determine the characterization capability of a single variable;
the variable screening module 304 may be configured to screen derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, where the characteristic variable is a variable used to characterize whether a user is a positive sample user in the derived variables.
The embodiment of the present application further provides an electronic device, which may specifically refer to a schematic structural diagram of the electronic device based on the characterization data screening method provided in the embodiment of the present application shown in fig. 4, and the electronic device may specifically include an input device 41, a processor 42, and a memory 43. The input device 41 may specifically be used to input an initial set of variables. The processor 42 may be specifically configured to obtain an initial set of variables; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; carrying out data processing on initial variables in the initial variable set to obtain a derivative variable set; the derived variable set is a set of new variables derived according to the initial variables; carrying out univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein univariate analysis is used to determine the characterization capability of a single variable; and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables. The memory 43 may be specifically used for storing parameters such as a set of derived variables, characteristic variables, and the like.
In this embodiment, the input device may be one of the main apparatuses for information exchange between a user and a computer system. The input devices may include a keyboard, mouse, camera, scanner, light pen, handwriting input panel, voice input device, etc.; the input device is used to input raw data and a program for processing the data into the computer. The input device can also acquire and receive data transmitted by other modules, units and devices. The processor may be implemented in any suitable way. For example, a processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The memory may in particular be a memory device used in modern information technology for storing information. The memory may include multiple levels, and in a digital system, memory may be used as long as binary data can be stored; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
In this embodiment, the functions and effects specifically realized by the electronic device can be explained by comparing with other embodiments, and are not described herein again.
The embodiment of the present application further provides a computer storage medium based on the characterization data screening method, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer storage medium may implement: acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; carrying out data processing on initial variables in the initial variable set to obtain a derivative variable set; the derived variable set is a set of new variables derived according to the initial variables; carrying out univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein univariate analysis is used to determine the characterization capability of a single variable; and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables.
In the present embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Although the present application provides method steps as in the above-described embodiments or flowcharts, additional or fewer steps may be included in the method, based on conventional or non-inventive efforts. In the case of steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. When the method is executed in an actual device or end product, the method can be executed sequentially or in parallel according to the embodiment or the method shown in the figure (for example, in the environment of a parallel processor or a multi-thread processing).
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the application should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the pending claims along with the full scope of equivalents to which such claims are entitled.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and it will be apparent to those skilled in the art that various modifications and variations can be made in the embodiment of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A method for screening characterization data, comprising:
acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users;
performing data processing on the initial variables in the initial variable set to obtain a derivative variable set; wherein the derivative variable set is a set of new variables derived from the initial variables;
performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein the univariate analysis is used to determine the characterization capability of a single variable;
and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user or not in the derived variables.
2. The method of claim 1, wherein obtaining an initial set of variables comprises:
acquiring a first sample data set in a preset time period; wherein the first sample data set comprises data of a plurality of positive sample users and a plurality of negative sample users within the preset time period;
performing data cleaning on the sample data set to obtain a second sample data set;
extracting values of a plurality of initial variables corresponding to each positive sample user and each negative sample user in the second sample data set;
and generating the initial variable set according to the values of a plurality of initial variables corresponding to each positive sample user and each negative sample user.
3. The method of claim 1, wherein data processing initial variables in the initial variable set comprises: and counting, summing, averaging and date compressing the values of the initial variables of the users in the initial variable set.
4. The method of claim 1, wherein the selecting the derived variables in the selected set of derived variables using a random forest algorithm to obtain at least one feature variable comprises:
determining the importance value of each derived variable in the screened derived variable set by using the random forest algorithm;
performing descending order arrangement on the importance values of the derivative variables to obtain descending order arrangement results;
according to the descending order arrangement result, carrying out descending order accumulation on the importance values of the derivative variables;
and taking the derivative variable corresponding to the importance value obtained by accumulation in a preset range as a characteristic variable.
5. The method of claim 1, wherein performing univariate analysis on each of the derived variables in the set of derived variables to obtain a set of filtered derived variables comprises:
analyzing and determining the stability of each derivative variable in the derivative variable set according to a preset time frequency by utilizing a population stability index;
removing the derived variables with the stability smaller than a first preset threshold value from the derived variable set to obtain a first variable set;
calculating the information value of each variable in the first variable set;
removing variables with information values smaller than a second preset threshold value from the first variable set to obtain a second variable set;
determining the correlation between the variables in the second variable set by using the correlation coefficient;
removing the variable with a lower information value from the second variable set in the two variables with the correlation greater than or equal to a third preset threshold value to obtain a third variable set;
and taking the third variable set as the derived variable set after screening.
6. The method of claim 1, after obtaining at least one characteristic variable, further comprising:
performing box separation operation on the at least one characteristic variable to obtain a box separation structure of each characteristic variable, wherein the box separation structure is used for representing the evaluation standard of the characteristic variable;
and according to the box-separating structure of each characteristic variable, carrying out score distribution on each characteristic variable to obtain a target scoring model, wherein the target scoring model is used for scoring the target user according to the input data of the target user.
7. The method of claim 1, wherein the initial variables in the initial set of variables comprise at least one of: user basic information, user address, user score, user rank, loan contracts, loan accounts, loan account transaction pipelining, loan payoff pipelining, debit card contracts, debit card accounts, debit card account pipelining, asset management scale;
the derived variables in the set of derived variables include at least one of: user seniority data, investment and financing product holding type, amount, quantity, holding duration, historical loan behavior, credit granting change trend, loan use condition, loan overdue data and economic development degree of province and city where the historical loan signing institution is located.
8. A device for screening characterization data, comprising:
the acquisition module is used for acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users;
the data processing module is used for carrying out data processing on the initial variables in the initial variable set to obtain a derivative variable set; wherein the derivative variable set is a set of new variables derived from the initial variables;
the univariate analysis module is used for performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein the univariate analysis is used to determine the characterization capability of a single variable;
and the variable screening module is used for screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user in the derived variables.
9. A device for screening characterizing data, comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010540728.9A CN111738819A (en) | 2020-06-15 | 2020-06-15 | Method, device and equipment for screening characterization data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010540728.9A CN111738819A (en) | 2020-06-15 | 2020-06-15 | Method, device and equipment for screening characterization data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111738819A true CN111738819A (en) | 2020-10-02 |
Family
ID=72649173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010540728.9A Pending CN111738819A (en) | 2020-06-15 | 2020-06-15 | Method, device and equipment for screening characterization data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111738819A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112102074A (en) * | 2020-10-14 | 2020-12-18 | 深圳前海弘犀智能科技有限公司 | Grading card modeling method |
CN112328657A (en) * | 2020-11-03 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Feature derivation method, feature derivation device, computer equipment and medium |
CN113409139A (en) * | 2021-07-27 | 2021-09-17 | 深圳前海微众银行股份有限公司 | Credit risk identification method, apparatus, device, and program |
CN113610175A (en) * | 2021-08-16 | 2021-11-05 | 上海冰鉴信息科技有限公司 | Service strategy generation method and device and computer readable storage medium |
CN115329280A (en) * | 2022-08-18 | 2022-11-11 | 中国建设银行股份有限公司 | Data screening method, device, equipment and medium |
CN115880053A (en) * | 2022-12-05 | 2023-03-31 | 中电金信软件有限公司 | Training method and device for grading card model |
WO2023071529A1 (en) * | 2021-10-29 | 2023-05-04 | 新智我来网络科技有限公司 | Device data cleaning method and apparatus, computer device and medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150379428A1 (en) * | 2014-06-30 | 2015-12-31 | Amazon Technologies, Inc. | Concurrent binning of machine learning data |
CN106548350A (en) * | 2016-11-17 | 2017-03-29 | 腾讯科技(深圳)有限公司 | A kind of data processing method and server |
CN109635953A (en) * | 2018-11-06 | 2019-04-16 | 阿里巴巴集团控股有限公司 | A kind of feature deriving method, device and electronic equipment |
CN109784373A (en) * | 2018-12-17 | 2019-05-21 | 深圳魔数智擎科技有限公司 | Screening technique, computer readable storage medium and the computer equipment of characteristic variable |
CN110009479A (en) * | 2019-03-01 | 2019-07-12 | 百融金融信息服务股份有限公司 | Credit assessment method and device, storage medium, computer equipment |
CN110222087A (en) * | 2019-05-15 | 2019-09-10 | 平安科技(深圳)有限公司 | Feature extracting method, device and computer readable storage medium |
-
2020
- 2020-06-15 CN CN202010540728.9A patent/CN111738819A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150379428A1 (en) * | 2014-06-30 | 2015-12-31 | Amazon Technologies, Inc. | Concurrent binning of machine learning data |
CN106548350A (en) * | 2016-11-17 | 2017-03-29 | 腾讯科技(深圳)有限公司 | A kind of data processing method and server |
CN109635953A (en) * | 2018-11-06 | 2019-04-16 | 阿里巴巴集团控股有限公司 | A kind of feature deriving method, device and electronic equipment |
CN109784373A (en) * | 2018-12-17 | 2019-05-21 | 深圳魔数智擎科技有限公司 | Screening technique, computer readable storage medium and the computer equipment of characteristic variable |
CN110009479A (en) * | 2019-03-01 | 2019-07-12 | 百融金融信息服务股份有限公司 | Credit assessment method and device, storage medium, computer equipment |
CN110222087A (en) * | 2019-05-15 | 2019-09-10 | 平安科技(深圳)有限公司 | Feature extracting method, device and computer readable storage medium |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112102074A (en) * | 2020-10-14 | 2020-12-18 | 深圳前海弘犀智能科技有限公司 | Grading card modeling method |
CN112102074B (en) * | 2020-10-14 | 2024-01-30 | 深圳前海弘犀智能科技有限公司 | Score card modeling method |
CN112328657A (en) * | 2020-11-03 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Feature derivation method, feature derivation device, computer equipment and medium |
CN113409139A (en) * | 2021-07-27 | 2021-09-17 | 深圳前海微众银行股份有限公司 | Credit risk identification method, apparatus, device, and program |
CN113409139B (en) * | 2021-07-27 | 2024-05-28 | 深圳前海微众银行股份有限公司 | Credit risk identification method, apparatus, device and program |
CN113610175A (en) * | 2021-08-16 | 2021-11-05 | 上海冰鉴信息科技有限公司 | Service strategy generation method and device and computer readable storage medium |
CN113610175B (en) * | 2021-08-16 | 2024-06-14 | 上海冰鉴信息科技有限公司 | Service policy generation method and device and computer readable storage medium |
WO2023071529A1 (en) * | 2021-10-29 | 2023-05-04 | 新智我来网络科技有限公司 | Device data cleaning method and apparatus, computer device and medium |
CN115329280A (en) * | 2022-08-18 | 2022-11-11 | 中国建设银行股份有限公司 | Data screening method, device, equipment and medium |
CN115880053A (en) * | 2022-12-05 | 2023-03-31 | 中电金信软件有限公司 | Training method and device for grading card model |
CN115880053B (en) * | 2022-12-05 | 2024-05-31 | 中电金信软件有限公司 | Training method and device for scoring card model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111738819A (en) | Method, device and equipment for screening characterization data | |
Aggarwal et al. | A complete empirical ensemble mode decomposition and support vector machine-based approach to predict Bitcoin prices | |
EP1361526A1 (en) | Electronic data processing system and method of using an electronic processing system for automatically determining a risk indicator value | |
CN111709826A (en) | Target information determination method and device | |
CN111583012B (en) | Method for evaluating default risk of credit, debt and debt main body by fusing text information | |
CN114078050A (en) | Loan overdue prediction method and device, electronic equipment and computer readable medium | |
EP4044094A1 (en) | System and method for determining and managing reputation of entities and industries through use of media data | |
CN115545886A (en) | Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium | |
CN114612239A (en) | Stock public opinion monitoring and wind control system based on algorithm, big data and artificial intelligence | |
CN114092230A (en) | Data processing method and device, electronic equipment and computer readable medium | |
CN114219611A (en) | Loan amount calculation method and device, computer equipment and storage medium | |
CN113129125A (en) | Risk monitoring method and system based on pledge system and storage medium | |
CN112819341A (en) | Scientific and technological type small and micro enterprise credit risk assessment method | |
CN117114812A (en) | Financial product recommendation method and device for enterprises | |
Huang et al. | Time series analysis and prediction on bitcoin | |
CN113421154B (en) | Credit risk assessment method and system based on control chart | |
Niknya et al. | Financial distress prediction of Tehran Stock Exchange companies using support vector machine | |
CN114626940A (en) | Data analysis method and device and electronic equipment | |
CN115099933A (en) | Service budget method, device and equipment | |
CN115237970A (en) | Data prediction method, device, equipment, storage medium and program product | |
CN118333738A (en) | Method for constructing retail credit risk prediction model and credit card service Scorealpha model | |
CN117788133A (en) | Method for constructing retail credit risk prediction model and retail credit score model | |
CN117994017A (en) | Method for constructing retail credit risk prediction model and online credit service Scoredelta model | |
Chen et al. | A Model Based on Survival-based Credit Risk Assessment System of SMEs | |
CN118333739A (en) | Method for constructing retail credit risk prediction model and retail credit business Scoremult model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220907 Address after: 25 Financial Street, Xicheng District, Beijing 100033 Applicant after: CHINA CONSTRUCTION BANK Corp. Address before: 25 Financial Street, Xicheng District, Beijing 100033 Applicant before: CHINA CONSTRUCTION BANK Corp. Applicant before: Jianxin Financial Science and Technology Co.,Ltd. |