CN111738819A

CN111738819A - Method, device and equipment for screening characterization data

Info

Publication number: CN111738819A
Application number: CN202010540728.9A
Authority: CN
Inventors: 加鸣; 郑玉函; 陈芷君; 袁韵; 程琬芸
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-10-02

Abstract

The application provides a method, a device and equipment for screening characterization data, wherein the method comprises the following steps: acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; performing data processing on initial variables in the initial variable set to obtain a derivative variable set; performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user or not in the derived variables. In the embodiment of the application, the variable which can effectively represent the positive sample user can be efficiently and conveniently screened from the high-dimensional data by utilizing the univariate analysis and the random forest algorithm, and the accuracy of user evaluation can be further improved.

Description

Method, device and equipment for screening characterization data

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, and a device for screening characterization data.

Background

The stock user is an important user resource of the bank, scientifically manages the stock user, and is particularly important to provide targeted personalized service for the high-quality stock user on the premise of controlling the risk.

In the prior art, a bank system only provides functional services on a process for screening high-quality users, business personnel need to judge a strong causal relationship between a user variable and whether the user is the high-quality user according to subjective experience of the business personnel, so that the variables of an evaluation user are screened out, and the business personnel comprehensively evaluate the user according to the screened variables to determine whether targeted personalized services need to be provided for the user. The variable screening is carried out by depending on the subjective experience of business personnel, the subjective randomness is strong, and the accuracy of the screening result is low due to the fact that the variable screening is easily influenced by the uncertainty of artificial subjective factors. Furthermore, the efficiency of analysis by adopting the method for screening the characterization data in the prior art is low, only a few variables can be analyzed and selected, and redundant or irrelevant variables may exist in the selected variables, so that the evaluation accuracy is low. Therefore, the technical scheme in the prior art cannot efficiently and accurately screen the variable which can effectively represent the user from the high-dimensional data.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for screening characterization data, and aims to solve the problem that effective variables cannot be efficiently and accurately screened from high-dimensional data in the prior art.

The embodiment of the application provides a method for screening characterization data, which comprises the following steps: acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; performing data processing on the initial variables in the initial variable set to obtain a derivative variable set; wherein the derivative variable set is a set of new variables derived from the initial variables; performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein the univariate analysis is used to determine the characterization capability of a single variable; and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user or not in the derived variables.

In one embodiment, obtaining an initial set of variables comprises: acquiring a first sample data set in a preset time period; wherein the first sample data set comprises data of a plurality of positive sample users and a plurality of negative sample users within the preset time period; performing data cleaning on the sample data set to obtain a second sample data set; extracting values of a plurality of initial variables corresponding to each positive sample user and each negative sample user in the second sample data set; and generating the initial variable set according to the values of a plurality of initial variables corresponding to each positive sample user and each negative sample user.

In one embodiment, the data processing of the initial variables in the initial variable set comprises: and counting, summing, averaging and date compressing the values of the initial variables of the users in the initial variable set.

In one embodiment, the screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable includes: determining the importance value of each derived variable in the screened derived variable set by using the random forest algorithm; performing descending order arrangement on the importance values of the derivative variables to obtain descending order arrangement results; according to the descending order arrangement result, carrying out descending order accumulation on the importance values of the derivative variables; and taking the derivative variable corresponding to the importance value obtained by accumulation in a preset range as a characteristic variable.

In one embodiment, the performing univariate analysis on each derived variable in the derived variable set to obtain a filtered derived variable set includes: analyzing and determining the stability of each derivative variable in the derivative variable set according to a preset time frequency by utilizing a population stability index; removing the derived variables with the stability smaller than a first preset threshold value from the derived variable set to obtain a first variable set; calculating the information value of each variable in the first variable set; removing variables with information values smaller than a second preset threshold value from the first variable set to obtain a second variable set; determining the correlation between the variables in the second variable set by using the correlation coefficient; removing the variable with a lower information value from the second variable set in the two variables with the correlation greater than or equal to a third preset threshold value to obtain a third variable set; and taking the third variable set as the derived variable set after screening.

In one embodiment, after obtaining the at least one characteristic variable, the method further includes: performing box separation operation on the at least one characteristic variable to obtain a box separation structure of each characteristic variable, wherein the box separation structure is used for representing the evaluation standard of the characteristic variable; and according to the box-separating structure of each characteristic variable, carrying out score distribution on each characteristic variable to obtain a target scoring model, wherein the target scoring model is used for scoring the target user according to the input data of the target user.

In one embodiment, the initial variables in the initial set of variables include at least one of: user basic information, user address, user score, user rank, loan contracts, loan accounts, loan account transaction pipelining, loan payoff pipelining, debit card contracts, debit card accounts, debit card account pipelining, asset management scale; the derived variables in the set of derived variables include at least one of: user seniority data, investment and financing product holding type, amount, quantity, holding duration, historical loan behavior, credit granting change trend, loan use condition, loan overdue data and economic development degree of province and city where the historical loan signing institution is located.

The embodiment of the present application further provides a device for screening characterization data, including: the acquisition module is used for acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; the data processing module is used for carrying out data processing on the initial variables in the initial variable set to obtain a derivative variable set; wherein the derivative variable set is a set of new variables derived from the initial variables; the univariate analysis module is used for performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein the univariate analysis is used to determine the characterization capability of a single variable; and the variable screening module is used for screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user in the derived variables.

The embodiment of the application also provides characterization data screening equipment, which comprises a processor and a memory for storing processor executable instructions, wherein the processor executes the instructions to realize the steps of the characterization data screening method.

The embodiment of the application also provides a computer readable storage medium, which stores computer instructions, and the instructions realize the steps of the characterization data screening method when executed.

The embodiment of the application provides a method for screening characterization data, which can be used for screening the characterization data by acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users. The initial variables in the initial variable set can be subjected to data processing to obtain a derivative variable set. Furthermore, univariate analysis can be performed on each derivative variable in the derivative variable set to obtain the screened derivative variable set. And screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables. By utilizing univariate analysis and a random forest algorithm, effective characteristic variables for representing whether the user is a positive sample user can be extracted from high-dimensional data efficiently and conveniently, and the accuracy of user evaluation can be further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application, are incorporated in and constitute a part of this application, and are not intended to limit the application. In the drawings:

FIG. 1 is a schematic illustration of steps of a method for screening characterization data provided in accordance with an embodiment of the present application;

FIG. 2 is a schematic diagram of an approval process for a K loan application according to an embodiment of the application;

FIG. 3 is a schematic structural diagram of a device for screening characterization data provided in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a device for screening characterization data according to an embodiment of the present application.

Detailed Description

The principles and spirit of the present application will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present application, and are not intended to limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present application may be embodied as a system, apparatus, device, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

Although the flow described below includes operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).

Referring to fig. 1, the present embodiment can provide a method for screening characterization data. The characterization data screening method can be used for screening effective variables which can be used for characterizing users from high-dimensional data. The above-described characterization data screening method may include the following steps.

S101: acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users.

In the present embodiment, an initial variable set may be acquired in advance. The initial variable set may include values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users, where the positive sample users and the negative sample users may mark sample data according to users mined according to actual needs. For example: if a variable used for judging whether the user is a user with high reputation degree is screened out, the user with high reputation degree without default behavior in the bank database can be marked as a positive sample user, and the user with default behavior in history in the bank database can be marked as a negative sample user. It is to be understood that, in other scenarios, the positive sample user and the negative sample user may be marked according to the above-mentioned idea, which may be determined according to practical situations, and this application does not limit this.

In this embodiment, the number of positive sample users and the number of negative sample users may be in a certain ratio, for example: the ratio of the number of positive sample users to the number of negative sample users is 1:5, but other ratio values may also be adopted, for example, 1:5.5 or 1:3, which may be determined according to actual situations, and this is not limited in this application.

In this embodiment, the initial variable may be data related to the user, which can be directly acquired from a bank database, and may include user basic information, behavior information, and the like, for example: user identification number, user gender, user age, user deposit amount, user's surcharge data, etc. In some embodiments, the initial variables may include: user basic information (gender, identification number, age, reserved telephone number, work unit, bank account information, bank account opening date, house information, etc.), user address, user rating, loan contract, loan account transaction flow, loan payout flow, debit card contract, debit card account flow, AUM (asset Management scale), etc. It will of course be appreciated that the initial variables described above may also include other variables in other scenarios, such as: the working age, graduation school, specialty, graduation time and the like of the user can be determined according to actual conditions, and the method is not limited in the application.

The AUM generally includes information such as financing, fund, precious metals, and loan of the user in the bank, in addition to deposit. The user value can be measured by AUM, and the higher AUM of a user (an individual user or a public user), the higher contribution degree to a bank generally, and the AUM is an index for measuring a high-net-value user.

In this embodiment, the manner of obtaining the initial variable set may include: and receiving an initial variable set input into the screening system by the bank related business personnel, or obtaining the initial variable set by querying according to a preset path. It is understood that, the above initial variable set may also be obtained in other possible manners, for example, the initial variable set is searched in a web page or a database according to a certain search condition, which may be determined according to actual situations, and the present application does not limit this.

S102: carrying out data processing on initial variables in the initial variable set to obtain a derivative variable set; and the derivative variable set is a set of new variables derived according to the initial variables.

Because the variables that can be directly obtained from the database or the business system are designed according to the needs of the business, the initial variables often fail to achieve the goal of obtaining good data mining results. Therefore, in this embodiment, the initial variables in the initial variable set may be subjected to data processing to obtain a derivative variable set, that is, new variables are constructed from the initial variables, so as to exploit the dimensionality of the analysis.

In this embodiment, the derivative variable set may be a set of new variables derived from an initial variable, and the derivative variable may be a variable that changes according to a change in the initial variable. The set of derived variables may include: and the values of the derivative variables corresponding to the plurality of positive sample users and the plurality of negative sample users.

In this embodiment, the manner of performing data processing on the initial variables in the initial variable set may include, but is not limited to, at least one of the following: variable expansion, which is to derive a plurality of labeled type variables based on one variable in a way that variable values are flattened (expanded), can also be understood as discretization; the variable combination is that two or more variables are combined in a mathematical operation mode, and the combination of the variables can be regarded as logical connection; synthetic variables, which are formed by combining (multiplying or cartesian product) individual variables, are a way for linear models to learn nonlinear variables.

In one embodiment, the data processing of the plurality of initial variables in the initial variable set comprises: and counting, summing, averaging, date compression and the like are carried out on the initial variable values of the users in the initial variable set. It will be appreciated that other data processing approaches may also be employed in some embodiments, such as: and (4) carrying out cross combination on a plurality of variables, or carrying out operations such as intersection, combination, complement and Cartesian set.

In one embodiment, two programming languages, Python (computer programming Language) and SQL (Structured Query Language) may be used to compute the set of derived variables from the initial set of variables. In one embodiment, variable cleaning may be performed before variable derivation, so as to solve the data quality problem in the initial variable set and make the cleaned data more suitable for mining.

In one embodiment, the derived variable set obtained from the initial variable set may include: user seniority data such as user line age, electronic channel signing time length and the like; user preference data such as investment financing product holding type, amount, quantity, holding duration and the like; historical loan application times, application frequency, signing times, signing frequency, expenditure times, repayment times and other historical loan behavior data; credit change trends such as historical credit amount change situation, credit mode change situation, user identity change situation and the like; the loan use conditions such as historical loan interest income, quota use rate, quota use duration and the like; historical loan overdue amount, overdue times, overdue duration and other loan overdue data; the economic development degree of provinces and cities where the historical loan signing institution is located, and the like. It will be appreciated that other data may also be included in some other scenarios, such as: the transfer frequency of the user and the like can be determined according to the actual situation, and the application does not limit the transfer frequency.

S103: carrying out univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; in which univariate analysis is used to determine the characterization capabilities of individual variables.

In this embodiment, in order to ensure the prediction capability of each derived variable on a target event (a screening scenario that needs to be applied, for example, to determine whether a target user is a positive sample user) and the uniqueness of the derived variables in the derived variable set, univariate analysis may be performed on each derived variable in the derived variable set, so that the derived variables in the derived variable set may be screened to obtain the screened derived variables.

In the present embodiment, the univariate analysis mainly focuses on the description and statistical inference of univariates, and is to reflect the basic information contained in a large amount of sample data in the simplest generalized form, and describe the centralized or discrete capture trend in the sample data. The univariate analysis is used for determining the characterization capability of a single variable, and the variable with low prediction capability and the repeated variable (with high correlation) can be screened out by utilizing the univariate analysis, so that the derived variables in the screened derived variable set are more suitable for data mining.

In one embodiment, the univariate analysis may comprise: the Stability of each variable is calculated using PSI (Population Stability Index), the IV (Information Value) of a single variable is calculated, and the Correlation between each two variables is calculated (Correlation Coefficient). Of course, other indexes can be used for univariate analysis, and the concrete determination can be determined according to the actual situation, which is not limited in the present application.

The PSI (Population Stability Index) is an Index for measuring the deviation between the predicted value and the actual value of the model, and is a model Stability evaluation Index, and the PSI indicates whether the Population distribution changes for samples at different times after grading according to the scores, or whether the proportion of the Population in each score interval to the total Population changes significantly. When the stability evaluation is performed, the sample may be analyzed according to a preset time frequency, for example, each variable in the derivative variable set may be analyzed monthly or at a time frequency of each week or year, which may be determined according to actual conditions, and the present application does not limit this. The preset time frequency may be: monthly, yearly or weekly, etc., and the details may be determined according to actual conditions, and the present application does not limit the present invention.

The above IV (Information Value) can be used to measure the predictive ability of the argument. The variable screening process takes into account a number of factors, such as: the predictive power of variables, the correlation between variables, the simplicity of variables (easy to generate and use), the robustness of variables (not easily bypassed), the interpretability of variables in business (which can be explained when challenged), etc., but the most important measure is the predictive power of variables. Thus, in this embodiment, the predictive power of a variable can be measured by IV: assume that in a classification problem, the categories of target variables are of two types: y is₁，Y₂. For one to be predictedMeasuring individual A, judging whether A belongs to Y₁Or Y₂Certain information is required, assuming that the total amount of information is I, and the required information is contained in all the independent variables C₁、C₂、C₃、……、C_nThen for one of the variables C_iIn other words, the more information it contains, it belongs to Y for judgment A₁Or Y₂The greater the contribution of C_iThe greater the information value of (C)_iThe larger the IV, the more it should go into the final set of variables.

The Correlation (Correlation Coefficient) can be used to characterize whether there is a certain dependency relationship between two variables, and to study the Correlation direction and degree of the phenomenon with the dependency relationship, which is a statistical method for studying the Correlation relationship between the variables. If the correlation between the two variables is high, indicating that the two variables have approximate effects on predicting the target event, one may choose to retain.

In an embodiment, the single variable analysis is performed on each derived variable in the derived variable set to obtain the screened derived variable set, and the method may include determining the stability of each derived variable in the derived variable set according to a preset time frequency analysis by using a population stability index, and removing the derived variable with the stability smaller than a first preset threshold from the derived variable set to obtain the first variable set. Further, the information values of the variables in the first variable set may be calculated, and the variables with the information values smaller than the second preset threshold value are removed from the first variable set, so as to obtain a second variable set. Further, the correlation coefficient may be used to determine the correlation between the variables in the second variable set, the variable with a lower information value in the two variables whose correlation is greater than or equal to the third preset threshold is removed from the second variable set to obtain a third variable set, and the third variable set is used as the filtered derivative variable set.

In this embodiment, the first preset threshold may be a value greater than 0, and preferably may be: 0.1, 0.12, etc., which can be set according to actual requirements, and the application does not limit the same. The second preset threshold may be a value greater than 0, and preferably may be: 0.05, 0.06, etc., which can be set according to actual requirements, and the application does not limit the same. The third threshold may be a value greater than 0, and preferably may be: 0.7, 0.72, etc., which can be set according to actual requirements, and the application does not limit the same.

S104: and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables.

In order to remove invalid variables on the premise of ensuring accuracy and reflect main characteristics required by predicting a target event with fewer characteristic variables as far as possible, in the embodiment, a random forest algorithm can be used for screening the derivative variables in the screened derivative variable set to obtain at least one characteristic variable. The characteristic variables are variables obtained by final screening, and may be the simplest variables used for representing whether the user is a positive sample user.

In this embodiment, in machine learning, the Random Forest (Random Forest) includes a classifier having a plurality of decision trees, which constructs a plurality of decision trees by using a Random resampling technique and a node Random splitting technique, and obtains a final classification result by voting. The random forest has the capability of analyzing complex interaction classification features, has good robustness on noise data and data with missing values, has high learning speed, and can be used as a feature selection tool for high-dimensional data by variable importance measurement. The random forest algorithm can be used for determining how much each variable makes contribution on each tree in the random forest, then the average value is taken, and finally the contribution sizes of different variables are compared.

In one embodiment, the set of derived variables after the filtering may be used as a training set, and a random forest algorithm is used for training. A Bootstrap sample set can be randomly selected from a training set in a replacement way by applying a Bootstrap method (which is an analog sampling statistical inference method based on a large amount of calculation), so as to generate a Bootstrap data sample, and further, an unstructured tree can be constructed by using the generated Bootstrap data. In the process of generating the tree, a plurality of variables participating in division are randomly selected from the screened derivative variable set, one variable with the most classification capability is selected from the plurality of variables participating in division to perform node splitting by calculating the information content (which can be distinguished by a Gini coefficient, a gain rate or an information gain) contained in each variable, so that a plurality of decision trees can be constructed, and the importance value of each variable in the screened derivative variable set can be obtained by voting.

In this embodiment, the importance values of the respective derived variables may be sorted in descending order to obtain a result of the sorting in descending order. Furthermore, according to the descending order arrangement result, the importance values of the derivative variables are accumulated in a descending order, and the derivative variables corresponding to the accumulated importance values in the preset range are used as the characteristic variables. For example: after the derived variables after screening are 20 derived variables in total, and the derived variables are output in descending order of importance values, when the sum of the importance values of the first 12 derived variables is greater than 0.96, the first 12 characteristics can be used as characteristic variables, and the remaining 8 derived variables are removed. Although some loss in final accuracy may result from removing part of the features, this loss of less than 4% may be acceptable compared to the simplicity of predicting the target event.

In the present embodiment, since the sum of the importance values of the respective derived variables in the set of derived variables after the screening is 1, both end values of the preset range are positive numbers of greater than 0 and equal to or less than 1, the preset range may be preferably (0.96,1], or [0.97,1], and the like, and may be specifically set according to the actual situation, and the present application is not limited thereto.

From the above description, it can be seen that the embodiments of the present application achieve the following technical effects: the initial variable set can be obtained; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users. The initial variables in the initial variable set can be subjected to data processing to obtain a derivative variable set. Furthermore, univariate analysis can be performed on each derivative variable in the derivative variable set to obtain the screened derivative variable set. And screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables. The variables which can effectively represent the positive sample user can be efficiently and conveniently screened from the high-dimensional data by utilizing univariate analysis and a random forest algorithm, and the accuracy of user evaluation can be further improved.

In one embodiment, the initial set of variables may be obtained as follows: a first sample data set may be obtained for a preset time period, where the first sample data set may include data of a plurality of positive sample users and a plurality of negative sample users for the preset time period. Since the sample data set may contain abnormal data, repeated data, and the like, in order to ensure the quality of data, data cleaning may be performed on the sample data set first, so as to obtain a second sample data set. Further, values of a plurality of initial variables corresponding to each positive sample user and each negative sample user in the second sample data set may be extracted, and an initial variable set may be generated according to the values of the plurality of initial variables corresponding to each positive sample user and each negative sample user.

In this embodiment, in order to make the data in the first sample data set representative and accurate, the first sample data set in a preset time period may be acquired, and the preset time period may be a certain time period in the historical time, and may be data of the last year, for example: 1/2019-12/21, although they can be selected in other ways, for example: data from 6/month 1 in 2018 to 11/month 30 in 2018 are selected, which can be determined according to actual conditions, and are not limited in the application.

In this embodiment, the test data set may be selected according to the preset time period, for example, an expiration date of the preset time period may be used as a starting time point of the test data set, so as to obtain the test data set. The test data set obtained by adopting the method can better verify the training result of the training set, and the corresponding test data set also comprises the positive sample user data and the negative sample user data.

In this embodiment, since the first sample data set is obtained from the history data, the data in the obtained first sample data set can be accurately labeled, that is, the data in the first sample data set is labeled as the positive sample user data and the negative sample user data. The positive sample users are the users expected to be predicted, and the negative sample users are opposite to the positive sample users. For example: and if the target user is expected to be predicted to be the user with high credibility, the positive sample user is the user with high credibility, and the negative sample user is the user with non-high credibility.

In one embodiment, after obtaining the at least one characteristic variable, a binning operation may be further performed on the at least one characteristic variable to obtain a binning structure of each characteristic variable, where the binning structure is used to characterize an evaluation criterion of the characteristic variable. Furthermore, score distribution can be performed on each characteristic variable according to the box-dividing structure of each characteristic variable to obtain a target scoring model, wherein the target scoring model is used for scoring a target user according to input data of the target user, and then whether the target user is a positive sample user or not can be determined according to a scoring result.

The binning operation is to discretize continuous data, for example: the variable of age is a continuous variable and can be classified into 0-18, 18-30, 30-45 and 45-60. The effect of binning mainly may include: the binned variables are more robust to abnormal data, for example, if an abnormal value of 300 is found in the age, the binned variables may be classified into a bin of > 80; nonlinear relations in data can be effectively captured, the expression capacity of the model can be improved, and fitting is increased; non-monotonicity relationships in the data can be captured; the variable values can be standardized; the categorical variables can be effectively incorporated into the model; the 'anti-oscillation performance' of the model can be effectively improved, various noises are eliminated through box separation operation, extreme value ground influence is eliminated or greatly weakened, the sufficiency of the sample size of each interval is ensured, and the model cannot be impacted due to fine fluctuation of data; the model interpretability can be effectively improved.

In this embodiment, before performing binning, the type of the variable may be determined, and the type of the variable may include: continuous variables such as income, age; ordered categorical variables such as academic calendar, job title; unordered categorical variables such as provinces. The unordered classification variables can be sorted and converted based on Bad sample Rate, and the ordered classification variables can be reasonably and sequentially converted until all variable data are processed into continuous data, and box separation operation can be performed.

In this embodiment, WOE (Weight of Evidence) can be used to perform computational analysis on each set of data after initial binning, and then the binning structure of each variable is adjusted according to IV (Information Value) Value until the best binning effect is achieved (bad user ratio is monotonic after binning).

In this embodiment, the above WOE is a coding form of the original independent variable, and the WOE actually describes the current group of the variable, and has an influence direction and magnitude for determining whether the individual will respond or belongs to which class, when the WOE is positive, the current value of the variable has a positive influence on determining whether the individual will respond, and when the WOE is negative, the variable has a negative influence. The magnitude of the WOE value is representative of the magnitude of this effect. The calculation of IV depends on WOE, and IV is an index for measuring the influence degree of the independent variable on the target variable.

In this embodiment, the above-described binning structure for each characteristic variable may be used to characterize the evaluation criterion of each characteristic variable. And (4) carrying out score distribution by utilizing the box-separating structure of each characteristic variable so as to obtain a target scoring model. Taking age as an example, the result of assigning the score may be 20 for 0-18 years, 40 for 19-35 years, 30 for 36-50 years, and 10 for more than 50 years.

In this embodiment, the target scoring model may be used to score the target user according to the input data of the target user, for example, the data of the target user is: age 27, sex male, marital status married, academic department, monthly income 10000, and score 264 given by the target scoring model. Further, whether the target user is a positive sample user or not may be determined according to the obtained scoring result.

In this embodiment, the ratio (good%) of the cumulative number of positive samples to the total number of positive samples and the ratio (bad%) of the cumulative number of negative samples to the total number of negative samples in each scoring interval may be calculated according to the sharing structure of the characteristic variables, so that the score may be assigned to each scoring interval according to the calculation result of each scoring interval.

In one scenario example, the approval process for personal K-loan (online consumer credit) application as applied in the row is basically divided into 2 phases, a pre-crediting phase and an application phase. In the pre-credit stage, a user clicks and checks the K credit limit through channels such as a mobile phone bank, an online bank, an offline cabinet machine and the like, then a credit access link is entered, if the access is passed, the output of the pre-credit limit can be obtained, and the application stage is entered; in the application stage, the user clicks application K credit, a series of rules are screened in the middle, if the application K credit passes the screening, the output of the application amount can be obtained, the user enters a signing and account opening link, and the whole application approval process is finished. The whole application approval process is non-differentiated for all users, and no matter the users are stock users or new users in the industry, so that high-quality users cannot be screened out from the stock users, and the quota is scientifically improved to provide personalized services for the high-quality users.

In the scene example, the existing K credit application approval process can be optimized, a stock high-quality user quota increasing module is added between the credit pre-granting stage and the application stage, a link for judging whether the stock user is a high-quality user is added to the stock user, and the user is screened and identified. If the stock user is judged to be a high-quality user, carrying out appropriate quota increase on the high-quality user on the premise of quota in the pre-credit stage; if the pre-trust stage has no limit, the high-quality user is given a certain limit, thereby achieving the purposes of improving the satisfaction degree of the user and improving the stickiness of the user.

In one specific embodiment, the K loan application approval process may be as shown in FIG. 2, and the quotation module may include a stock user basic admission, a stock good user scoring model, and a stock good user quotation model. The high-quality user quota module can respectively carry out access, scoring and quota increase on the two types of users according to the existence of the initial pre-granted credit limit of the users.

In this embodiment, the basic admission of the stock user is a necessary and insufficient condition for the whole stock high-quality user quota service, and the user to be quota needs to be judged for the basic credit risk admission condition, and the subsequent quota process can be performed only if the user to be quota accords with the basic admission rule. The basic admission rules may include, but are not limited to, at least one of the following: whether the current credit condition of the user belongs to the user range of inventory settlement and credit extension, whether the current internal user score of the user meets the minimum entering standard, whether the user is overdue within all the past K loan periods settled by the user history, and the like.

In this embodiment, the establishment of the high-quality user rating model may include, for example, whether the user has an initial pre-credit line, forming an original sample user set by the user who has performed a K credit settlement operation and performed pre-credit calculation without a credit source during the screening 20180601 to 20181130, and screening out a batch of users (5000 users) from the original sample user set as high-quality users of K credit products by the service department according to past service experience to form a target sample white list. According to target sample and non-target sample 1: and 5, randomly extracting 25000 non-business-identified high-quality users with high storage quantity from the original sample user set to form a user sample data set (30000). For the 3 ten thousand users, taking 20181130 as a modeling benchmark day, and taking data of the users up to 20181130 as training sample data of a scoring model; data of these users up to 20191130 were taken as test sample data of the scoring model with 20191130 as the test benchmark day.

In this embodiment, for a user in a user sample data set, acquiring various types of original data in each user line, including: the system comprises multi-dimensional data such as user basic information, address information, inline user rating information, user grade information, loan contract information, loan account transaction running information, loan payment running information, K loan transaction operation information, debit card contract information, debit card account running information, AUM information and the like. After data cleaning, an initial variable set is generated.

In this embodiment, the derivative variables can be calculated by performing data analysis on the initial variables using two programming languages, Python and SQL. The generated derivative variables can comprise user qualification data such as user line age, electronic channel signing time and the like, preference data such as types, amounts, quantities and holding time of investment and financing products, K loan behavior data such as historical loan application times, application frequency, signing times, signing frequency, expenditure times and repayment times and the like, credit change trend data such as historical loan credit change condition, credit change condition and user identity change condition, K loan use condition data such as historical K loan interest income, amount use rate and use time of the K loan, loan use data such as historical loan amount, overdue times and overdue time and the like, and province and economic developed degree of a historical K loan signing institution and the like.

In this embodiment, in the process of feature variable screening, a mode of combining univariate analysis and a machine learning algorithm of a random forest is used to achieve a better feature screening effect. Firstly, univariate analysis is performed on variables, which specifically includes: analyzing and observing the stability of the variables monthly by using a group stability index, and removing the variables with low stability (measured according to PSI > 0.1); analyzing the single prediction capability of the characteristic variables on the target event by using the information values, and removing the variables with low prediction capability (measured according to IV < 0.05); the correlation coefficient (correlation between analysis variables, for two variables with high correlation (greater than 0.7)) is used, the variable with the higher IV value is retained.

In this embodiment, the final screened characteristic variables may be subjected to binning operation, WOE is used to perform calculation and analysis on each group of binned data, then the binning structure of each variable is adjusted according to IV until the best binning effect is achieved (bad user occupation is monotonous after binning), and finally score assignment is performed on each variable to obtain an admission scoring model for high-quality user quota of stock.

In this embodiment, since the loan term of the loan transaction applicable to the model of the user rating card with high quality inventory is 1 year, the user may be rated monthly by using the model using the test sample data from 20181101 to 20191031, and whether the ratio of the number of people in each score interval to the total number of people changes significantly is observed, so as to calculate the stability index of the model, and if the calculated stability index is less than 10%, the model of the user rating with high quality inventory is considered to be relatively stable.

In this embodiment, the user with balance and credit balance who gives no credit limit can give an offer decision based on credit data (the last successful K credit amount of the user in the last n months) in a certain period of time in the near future, and the user with balance and credit balance who gives credit limit again gives an offer decision based on credit scale factor + scoring result combined with consideration to determine an offer proportion, so that accurate marketing and differential pricing can be realized.

In this embodiment, the predicted value of the model may be recorded, the daily updated data of the credit system is used for monitoring and analyzing, the predicted value obtained by the model is compared with the actual performance of the user, the variables and the relevant parameters of the model can be continuously changed, and the model is continuously iteratively optimized.

In this embodiment, the high-quality user quota model based on the random forest algorithm can reduce decision communication cost in a row on one hand, gradually reduce the past auditing mode depending on business experience judgment, reduce errors of artificial subjective judgment, and improve scientific decision rate. On the other hand, the random forest algorithm can deeply mine the existing data, accurately screen out high-quality users on the premise of controlling the risk, improve the limit for the high-quality users as much as possible, and improve the satisfaction degree of the users on K credit business of own lines, thereby improving the user viscosity and the competitiveness of the K credit business. It should be noted, however, that the foregoing specific examples are provided for the purpose of better illustrating the present application and are not to be construed as limiting the present application.

Based on the same inventive concept, the embodiment of the present application further provides a device for screening characterization data, such as the following embodiments. Because the principle of the characterization data screening device for solving the problems is similar to that of the characterization data screening method, the characterization data screening device can be implemented by the characterization data screening method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated. Fig. 3 is a block diagram of a structure of a device for screening characterization data according to an embodiment of the present application, as shown in fig. 3, which may include: an obtaining module 301, a data processing module 302, a univariate analysis module 303, and a variable screening module 304, which are described below.

An obtaining module 301, configured to obtain an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users;

the data processing module 302 may be configured to perform data processing on initial variables in the initial variable set to obtain a derivative variable set; the derived variable set is a set of new variables derived according to the initial variables;

the univariate analysis module 303 may be configured to perform univariate analysis on each derived variable in the derived variable set to obtain a filtered derived variable set; wherein univariate analysis is used to determine the characterization capability of a single variable;

the variable screening module 304 may be configured to screen derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, where the characteristic variable is a variable used to characterize whether a user is a positive sample user in the derived variables.

The embodiment of the present application further provides an electronic device, which may specifically refer to a schematic structural diagram of the electronic device based on the characterization data screening method provided in the embodiment of the present application shown in fig. 4, and the electronic device may specifically include an input device 41, a processor 42, and a memory 43. The input device 41 may specifically be used to input an initial set of variables. The processor 42 may be specifically configured to obtain an initial set of variables; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; carrying out data processing on initial variables in the initial variable set to obtain a derivative variable set; the derived variable set is a set of new variables derived according to the initial variables; carrying out univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein univariate analysis is used to determine the characterization capability of a single variable; and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables. The memory 43 may be specifically used for storing parameters such as a set of derived variables, characteristic variables, and the like.

In this embodiment, the input device may be one of the main apparatuses for information exchange between a user and a computer system. The input devices may include a keyboard, mouse, camera, scanner, light pen, handwriting input panel, voice input device, etc.; the input device is used to input raw data and a program for processing the data into the computer. The input device can also acquire and receive data transmitted by other modules, units and devices. The processor may be implemented in any suitable way. For example, a processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The memory may in particular be a memory device used in modern information technology for storing information. The memory may include multiple levels, and in a digital system, memory may be used as long as binary data can be stored; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.

In this embodiment, the functions and effects specifically realized by the electronic device can be explained by comparing with other embodiments, and are not described herein again.

The embodiment of the present application further provides a computer storage medium based on the characterization data screening method, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer storage medium may implement: acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users; carrying out data processing on initial variables in the initial variable set to obtain a derivative variable set; the derived variable set is a set of new variables derived according to the initial variables; carrying out univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein univariate analysis is used to determine the characterization capability of a single variable; and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether the user is a positive sample user in the derived variables.

In the present embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.

In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Although the present application provides method steps as in the above-described embodiments or flowcharts, additional or fewer steps may be included in the method, based on conventional or non-inventive efforts. In the case of steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. When the method is executed in an actual device or end product, the method can be executed sequentially or in parallel according to the embodiment or the method shown in the figure (for example, in the environment of a parallel processor or a multi-thread processing).

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the application should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the pending claims along with the full scope of equivalents to which such claims are entitled.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and it will be apparent to those skilled in the art that various modifications and variations can be made in the embodiment of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for screening characterization data, comprising:

acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users;

performing data processing on the initial variables in the initial variable set to obtain a derivative variable set; wherein the derivative variable set is a set of new variables derived from the initial variables;

performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein the univariate analysis is used to determine the characterization capability of a single variable;

and screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user or not in the derived variables.

2. The method of claim 1, wherein obtaining an initial set of variables comprises:

acquiring a first sample data set in a preset time period; wherein the first sample data set comprises data of a plurality of positive sample users and a plurality of negative sample users within the preset time period;

performing data cleaning on the sample data set to obtain a second sample data set;

extracting values of a plurality of initial variables corresponding to each positive sample user and each negative sample user in the second sample data set;

and generating the initial variable set according to the values of a plurality of initial variables corresponding to each positive sample user and each negative sample user.

3. The method of claim 1, wherein data processing initial variables in the initial variable set comprises: and counting, summing, averaging and date compressing the values of the initial variables of the users in the initial variable set.

4. The method of claim 1, wherein the selecting the derived variables in the selected set of derived variables using a random forest algorithm to obtain at least one feature variable comprises:

determining the importance value of each derived variable in the screened derived variable set by using the random forest algorithm;

performing descending order arrangement on the importance values of the derivative variables to obtain descending order arrangement results;

according to the descending order arrangement result, carrying out descending order accumulation on the importance values of the derivative variables;

and taking the derivative variable corresponding to the importance value obtained by accumulation in a preset range as a characteristic variable.

5. The method of claim 1, wherein performing univariate analysis on each of the derived variables in the set of derived variables to obtain a set of filtered derived variables comprises:

analyzing and determining the stability of each derivative variable in the derivative variable set according to a preset time frequency by utilizing a population stability index;

removing the derived variables with the stability smaller than a first preset threshold value from the derived variable set to obtain a first variable set;

calculating the information value of each variable in the first variable set;

removing variables with information values smaller than a second preset threshold value from the first variable set to obtain a second variable set;

determining the correlation between the variables in the second variable set by using the correlation coefficient;

removing the variable with a lower information value from the second variable set in the two variables with the correlation greater than or equal to a third preset threshold value to obtain a third variable set;

and taking the third variable set as the derived variable set after screening.

6. The method of claim 1, after obtaining at least one characteristic variable, further comprising:

performing box separation operation on the at least one characteristic variable to obtain a box separation structure of each characteristic variable, wherein the box separation structure is used for representing the evaluation standard of the characteristic variable;

and according to the box-separating structure of each characteristic variable, carrying out score distribution on each characteristic variable to obtain a target scoring model, wherein the target scoring model is used for scoring the target user according to the input data of the target user.

7. The method of claim 1, wherein the initial variables in the initial set of variables comprise at least one of: user basic information, user address, user score, user rank, loan contracts, loan accounts, loan account transaction pipelining, loan payoff pipelining, debit card contracts, debit card accounts, debit card account pipelining, asset management scale;

the derived variables in the set of derived variables include at least one of: user seniority data, investment and financing product holding type, amount, quantity, holding duration, historical loan behavior, credit granting change trend, loan use condition, loan overdue data and economic development degree of province and city where the historical loan signing institution is located.

8. A device for screening characterization data, comprising:

the acquisition module is used for acquiring an initial variable set; the initial variable set comprises values of initial variables corresponding to a plurality of positive sample users and values of initial variables corresponding to a plurality of negative sample users;

the data processing module is used for carrying out data processing on the initial variables in the initial variable set to obtain a derivative variable set; wherein the derivative variable set is a set of new variables derived from the initial variables;

the univariate analysis module is used for performing univariate analysis on each derivative variable in the derivative variable set to obtain a screened derivative variable set; wherein the univariate analysis is used to determine the characterization capability of a single variable;

and the variable screening module is used for screening the derived variables in the screened derived variable set by using a random forest algorithm to obtain at least one characteristic variable, wherein the characteristic variable is a variable which is used for representing whether a user is a positive sample user in the derived variables.

9. A device for screening characterizing data, comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 7.