CN106485528A

CN106485528A - The method and apparatus of detection data

Info

Publication number: CN106485528A
Application number: CN201510552641.2A
Authority: CN
Inventors: 谢世鹏
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-09-01
Filing date: 2015-09-01
Publication date: 2017-03-08

Abstract

This application discloses a kind of method and apparatus of detection data.Wherein, the method includes：Read the confidence interval determining based on the sample weights of mark sample, wherein, sample weights are by weight model, the mark sample obtaining in advance to be trained obtaining；Judge that data to be verified, whether in confidence interval, obtains the first judged result；According to the first judged result, judge whether data to be verified is valid data, obtain the second judged result.Present application addresses the low technical problem of check results accuracy is it is achieved that the effect of verification data legitimacy exactly during the legitimacy of verification data.

Description

The method and apparatus of detection data

Technical field

The application is related to data processing field, in particular to a kind of method and apparatus of detection data.

Background technology

In internet virtual transaction platform, the virtual resource data of all kinds of virtual objects is mixed in together, in order to more preferable Management and the legitimacy distinguishing these data, can based on data confidence interval distinguish account (as businessman) issue Whether virtual resource data (as mobile phone price) is legal.What on present e-commerce website, all kinds of businessmans issued is various types of The price of other commodity differs and substantial amounts, relies on artificial cognition to determine the problem that data is sorted out, expends very big Cost of labor, and there is larger subjectivity in artificial cognition, judged result is inaccurate.

Provide a kind of data Estimating Confidence Interval method in prior art, the method after preprocessed data, directly Calculate average and the variance of data, and according to the variance multiple setting, determine data confidence interval (the i.e. upper limit estimated Value and lower limit), to judge whether data is distributed in such data confidence interval to classify to data.This existing number Determine confidence interval according to the average only only in accordance with data value for the Estimating Confidence Interval method and variance, the confidence interval of determination is not Accurately, thus leading to differentiate that the Stability and veracity of new data to be verified is relatively low.

For above-mentioned verification data legitimacy when the low problem of check results accuracy, not yet propose at present effectively to solve Certainly scheme.

Content of the invention

The embodiment of the present application provides a kind of method and apparatus of detection data, at least to solve the legitimacy of verification data When the low technical problem of check results accuracy.

A kind of one side according to the embodiment of the present application, there is provided method of detection data, the method includes：Read The confidence interval being determined based on the sample weights of mark sample, wherein, sample weights are to obtaining in advance by weight model The mark sample taking is trained and obtains；Judge that data to be verified, whether in confidence interval, obtains the first judged result； According to the first judged result, judge whether data to be verified is valid data, obtain the second judged result.

According to the another aspect of the embodiment of the present application, additionally provide a kind of device of detection data, this device includes：Read Delivery block, for reading the confidence interval determining based on the sample weights of mark sample, wherein, sample weights are to pass through Weight model is trained to the mark sample obtaining in advance and obtains；First judge module, for judging number to be verified According to whether in confidence interval, obtain the first judged result；Second judge module, for according to the first judged result, sentencing Whether data to be verified of breaking is valid data, obtains the second judged result.

In the embodiment of the present application, by training the sample weights that obtain and determining confidence interval based on sample weights.? In the program, when determining confidence interval, by sample weights, the significance level of mark sample is distinguished, Ye Ji When determining confidence interval, improve the actively impact to confidence interval for the higher data of the credibility marking in sample, subtract The disturbing influence to confidence interval for the corrupt data in mark sample less, so that confidence interval is true close to data Confidence interval, automatically all kinds of data to be verified are classified using this confidence interval it is ensured that estimate accuracy, Stability and reliability.By the application it is achieved that verification data legitimacy effect exactly, and then solve verification Check results accuracy low technical problem during the legitimacy of data.

Brief description

Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please is used for explaining the application, does not constitute the improper restriction to the application.In accompanying drawing In：

Fig. 1 is a kind of hardware block diagram of terminal of the embodiment of the present application；

Fig. 2 is a kind of flow chart of the method for the detection data according to the embodiment of the present application；

Fig. 3 is a kind of flow chart by weight model to mark sample training according to the embodiment of the present application；

Fig. 4 is the flow chart of the method for a kind of optional detection data according to the embodiment of the present application；

Fig. 5 is a kind of schematic diagram of the device of the detection data according to the embodiment of the present application；

Fig. 6 is a kind of schematic diagram of the device of the alternatively detection data according to the embodiment of the present application；

Fig. 7 is the schematic diagram of another kind according to the embodiment of the present application alternatively device of detection data；And

Fig. 8 is a kind of network environment schematic diagram of the terminal according to the embodiment of the present application.

Specific embodiment

In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, is clearly and completely described the embodiment it is clear that described to the technical scheme in the embodiment of the present application It is only the embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of not making creative work, all should belong to The scope of the application protection.

It should be noted that term " first " in the description and claims of this application and above-mentioned accompanying drawing, " Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this Sample use data can exchange in the appropriate case so that embodiments herein described herein can with except Here the order beyond those illustrating or describing is implemented.Additionally, term " comprising " and " having " and they Any deformation, it is intended that covering non-exclusive comprising, for example, contains process, the side of series of steps or unit Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear List or for these processes, method, product or the intrinsic other steps of equipment or unit.

Embodiment 1

According to the embodiment of the present application, additionally provide a kind of embodiment of the method for detection data, it should be noted that The step that the flow process of accompanying drawing illustrates can execute in the computer system of such as one group of computer executable instructions, and And although showing logical order in flow charts, but in some cases, can be with different from order herein The shown or described step of execution.

The embodiment of the method that the embodiment of the present application is provided can be in mobile terminal, terminal or similar computing Execute in device.Taking run on computer terminals as a example, Fig. 1 is a kind of terminal of the embodiment of the present application Hardware block diagram.As shown in figure 1, terminal 10 can include one or more (in figure only illustrates one) (processor 102 can include but is not limited to Micro-processor MCV or PLD FPGA etc. to processor 102 Processing meanss), for data storage memorizer 104 and for communication function transmitting device 106.This area Those of ordinary skill is appreciated that the structure shown in Fig. 1 is only and illustrates, it is not made to the structure of above-mentioned electronic installation Become to limit.For example, terminal 10 may also include the assembly more or more less than shown in Fig. 1, or tool There are the configurations different from shown in Fig. 1.

Memorizer 104 can be used for storing software program and the module of application software, the such as detection in the embodiment of the present application Corresponding programmed instruction/the module of data method, processor 102 passes through to run the software journey being stored in memorizer 104 Sequence and module, thus executing various function application and data processing, that is, realize the leak inspection of above-mentioned application program Survey method.Memorizer 104 may include high speed random access memory, may also include nonvolatile memory, such as one or Multiple magnetic storage devices, flash memory or other non-volatile solid state memories.In some instances, memorizer 104 The memorizer remotely located with respect to processor 102 can be further included, these remote memories can by network even It is connected to terminal 10.The example of above-mentioned network includes but is not limited to the Internet, intranet, LAN, shifting Dynamic communication network and combinations thereof.

Transmitting device 106 is used for receiving via a network or sends data.Above-mentioned network instantiation may include The wireless network that the communication providerses of terminal 10 provide.In an example, transmitting device 106 includes one Individual network adapter (Network Interface Controller, NIC), it can be set with other networks by base station For connected thus can be communicated with the Internet.In an example, transmitting device 106 can be radio frequency (Radio Frequency, RF) module, it is used for wirelessly being communicated with the Internet.

Under above-mentioned running environment, this application provides a kind of detection data method as shown in Figure 2.Fig. 2 is basis A kind of flow chart of the detection data method of the embodiment of the present application.

As shown in Fig. 2 the method for this detection data comprises the steps：

Step S202：Read the confidence interval determining based on the sample weights of mark sample, wherein, sample weights are logical Cross weight model the mark sample obtaining in advance to be trained and obtains.

Step S204：Judge that data to be verified, whether in confidence interval, obtains the first judged result.

Step S206：According to the first judged result, judge whether data to be verified is valid data, obtain the second judgement Result.

Alternatively, if the first judged result indicates that data to be verified, in confidence interval, judges that data to be verified is Valid data, if the first judged result indicates that data to be verified, not in confidence interval, judges that data to be verified is Invalid data.

In the above embodiments of the present application scheme, sample weights can be obtained to mark sample training by the model building, Determine confidence interval based on this sample weights.In this scenario, when determining confidence interval, by sample weights to mark The significance level (as the quality of data) of sample is distinguished, namely when determining confidence interval, improves mark sample The actively impact to confidence interval for the higher data of credibility in this, the corrupt data reducing in mark sample is opposed The interval disturbing influence of letter so that confidence interval is close to the real confidence interval of data, using this confidence interval from Move to all kinds of data classification to be verified accuracy, stability and reliability it is ensured that estimating.Above-mentioned in the application In embodiment, the quality of the weight model differentiation data by building, and redefine the confidence interval of data, it is based on The quality of this confidence interval automatic distinguishing data to be verified, thus ensure final estimated result (i.e. the second judged result) Accurately and reliably.By the application, solve the problems, such as verification data legitimacy when check results accuracy low, realize Verification data legitimacy effect exactly.

Specifically, after terminal collects virtual resource information, from this virtual resource information, extract data to be verified, Read this confidence interval corresponding to object described by virtual resource information, judge whether this data to be verified puts at this In letter is interval, if this data to be verified is in confidence interval, judges that data to be verified is valid data, that is, determine This virtual resource information is legal information；If data to be verified is not in confidence interval, judge that data to be verified is Invalid data, that is, this virtual resource information is invalid information.Obtain each data to be verified whether valid data (and / or this virtual resource information whether be legal information) after, will determine that result is recorded, and be saved in memorizer In.

Wherein, virtual resource information describes the information of virtual resource, and virtual resource is opposed name with real-life asset Word, this virtual resource can be the various matter essential factor of circulation on the Internet, and specifically, maintenance data storehouse, program are compiled The information resources collected are exactly virtual resource, information of trading object in such as online library, online shopping mall etc..

The information of virtual resource can be used for describing the property value of virtual resource, the such as value of trading object.

Alternatively, determining that this data to be verified is invalid data and/or determines that this virtual resource information is invalid information Afterwards, this virtual resource information can be marked, insincere to identify this virtual resource information.

Alternatively, when being marked to virtual resource information, can be by invalid data (or information) and valid data (or information) is with different color marks out it is also possible to by invalid data (or information) and valid data (or letter Breath) it is marked with different indications, insincere to identify this virtual resource information.

Alternatively, determining that this data to be verified is invalid data and/or determines that this virtual resource information is invalid information Afterwards, this invalid data/information can be made delete processing on the corresponding page.

Alternatively, the present processes can be used in the legitimacy differentiation of the virtual resource parameter in virtual trading platform, As, can have in virtual trading platform all kinds of trading objects that a large amount of accounts (as virtual Merchants) issue (as mobile phone, Household articless) virtual resource Transaction Information.

Below taking the virtual trading information of the mobile phone merchant transaction in virtual trading platform as a example, it is discussed in detail in the application State embodiment：

Collect the virtual resource Transaction Information in virtual trading platform in terminal (as server), such as：Account A is (such as Mobile phone businessman) trading object B (as mobile phone) virtual resource parameter be 5000, from this virtual resource Transaction Information Middle extraction virtual resource parameter (i.e. above-mentioned mobile phone price 5000), to obtain data to be verified, reads this trading object B Corresponding confidence interval, the confidence interval of such as this mobile phone transaction is (1000,6000), judges that each data to be verified is No in confidence interval, that is, judge mobile phone price 5000 whether in the confidence interval (1000,6000) of mobile phone transaction.? In this embodiment, mobile phone price 5000 in the confidence interval (1000,6000) that this mobile phone is concluded the business, then judges to treat school Testing data (i.e. this mobile phone price 5000) is valid data, and this result (e.g., 5000 is valid data) is remembered Record is got off, if data to be verified (price as mobile phone is 50) not in confidence interval (1000,6000), is then sentenced Break data to be verified (as mobile phone price be 50) be invalid data, and by this result (e.g., 50 be non-legally Data) record.

In the above-described embodiments, by the accurate estimation of the confidence interval to mark sample, it is possible to achieve effectively treat The legitimacy of verification data (as mobile phone price data) differentiates, thus above-described embodiment solves is treating verification data Validity judgement when, the low technical problem of the accuracy of judged result.

Alternatively, before reading the confidence interval determining based on the sample weights of mark sample, the method can include： Obtain multiple mark samples, wherein, each mark sample has sample value；Extract the attribute data of each mark sample, Weight model is set up based on the attribute data of each mark sample；Mark sample is trained by weight model, obtains each The sample weights of mark sample；Extract the sample value in each mark sample, wherein, sample value is used for characterizing mark sample Object corresponding virtual resource parameter described by this；Sample value based on each mark sample and each mark sample Sample weights determine the confidence interval of multiple mark samples.

In scheme disclosed in above-described embodiment, read based on mark sample sample weights determine confidence interval it Before, weight model can be set up based on the attribute data of each mark sample getting, due to marking the attribute of sample Data (as marked the quality of sample itself) can set up as the tolerance of its credibility in data distribution rule During weight model, make use of the characteristic of the attribute data of mark sample itself, namely consider when determining sample weights The credibility of mark sample, improves accuracy and the reliability of weight model, after setting up accurate weight model, Mark sample can be trained to obtain the high sample weights of accuracy by this weight model, and be based on this sample weights and institute The sample value of each sample extracting is determining accurate confidence interval, thus improve the confidence interval of mark sample Accuracy and reliability.Then reading this confidence interval, whether judging each data to be verified in confidence interval, if treating Verification data in confidence interval, then judges that data to be verified is valid data, if data to be verified is not or not confidence area Interior, then judge that data to be verified is invalid data.And to record each data to be verified be valid data or illegally count According to judged result.By the above embodiments of the present application it is achieved that improving the accuracy of weight model and the effect of reliability Really.

Specifically, after terminal gets multiple mark samples with sample value, from this mark sample, extract it Attribute data, and the weight model of mark sample is set up based on each attribute data；Mark is trained by this weight model Sample, thus obtain the sample weights of each mark sample；Then extract the sample value in each mark sample, that is, mark Object corresponding virtual resource parameter described by note sample, is determined multiple afterwards based on each sample value and sample weights The confidence interval of mark sample.Extract data to be verified from this virtual resource information, read this virtual resource information institute The confidence interval corresponding to object of description, judges this data to be verified whether in this confidence interval, if this is to be verified Data in confidence interval, then judges that data to be verified is valid data, that is, determine that this virtual resource information is legal Information；If data to be verified is not in confidence interval, judge that data to be verified is invalid data, i.e. this virtual money Source information is invalid information.Obtain each data to be verified whether valid data (and/or this virtual resource information is No for legal information) after, will determine that result is recorded, and preserve in memory.

Alternatively, the present processes can be used in the extraction of the virtual resource parameter in virtual trading platform, e.g., All kinds of trading objects that a large amount of accounts (as virtual Merchants) are issued can be had (as mobile phone, household in virtual trading platform Articles for use) virtual resource Transaction Information.Below with the virtual trading information of the mobile phone merchant transaction in virtual trading platform As a example, the above embodiments of the present application are discussed in detail：

Get multiple mark samples in virtual trading platform in terminal (as server), wherein, each marks sample There is sample value, e.g., the mark sample that account gets is：Virtual resource parameter C of trading object B of account A (as the price 5000 of the commodity B of businessman A), wherein, this mark sample has sample value (e.g., the price of mobile phone 5000), extract the attribute data (as the brand of mobile phone, price, performance parameter) of each mark sample, based on each The attribute data of mark sample sets up weight model；Mark sample is trained by weight model, obtains each and mark sample Sample weights；Extract the sample value in each mark sample, wherein, sample value is the object described by mark sample The virtual parameter (as the price 5000 of mobile phone) of (trading object B, such as mobile phone).Adopt in terminal (as server) Collect the virtual resource Transaction Information in virtual trading platform, extract virtual resource ginseng from this virtual resource Transaction Information Number (i.e. above-mentioned mobile phone price 5000), to obtain data to be verified, reads the corresponding confidence interval of this trading object B, Confidence interval as the transaction of this mobile phone is (1000,6000), judges each data to be verified whether in confidence interval, that is, Judge mobile phone price 5000 whether in the confidence interval (1000,6000) of mobile phone transaction.In this embodiment, mobile phone Price 5000 in the confidence interval (1000,6000) that this mobile phone is concluded the business, then judges data to be verified (i.e. this mobile phone Price 5000) it is valid data, and this result (e.g., 5000 is valid data) is recorded, if to be verified Data (price as mobile phone is 50) not in confidence interval (1000,6000), then judges data to be verified (such as The price of mobile phone is 50) it is invalid data, and this result (e.g., 50 is non-legally data) is recorded.

Alternatively, extract the attribute data of each mark sample, weight is set up based on the attribute data of each mark sample Model includes：Extract the weight parameter of each mark sample, wherein, weight parameter includes：Bonus point parameter, scoring power Weight, deduction of points parameter and point deduction weight, bonus point parameter be used for favorable comment fraction that the account of description mark sample obtained, Scoring weight is used for describing the favorable comment fraction of account and the ratio of the credit of account, parameter of deducting points is used for describing account and is obtained The deduction of points fraction obtaining, point deduction weight are used for describing owning in the account quantity and terminal being punished in the terminal of account place The ratio of account quantity；Weight model is obtained based on bonus point parameter, scoring weight, deduction of points parameter and point deduction weight, Wherein, weight model is：

f_w(g, p)=w₁*{g-a*w₂* p }, wherein,

G is bonus point parameter；P is deduction of points parameter；w₁For the weight that scores；w₂For button Fraction weight, β₁And β₂Learning parameter for weight model.

In the above-described embodiments, by weight model is set up based on the attribute data of each mark sample extracting, should Weight model can be trained obtaining sample weights to weight samples, the weight that this weight model learns automatically also known as training Model, the same with training other machines learning model, can be using the iterative optimization method of gradient decline.Above-mentioned enforcement Using a kind of machine learning algorithm having supervision, BP algorithm, it is learnt example according to mark sample, adjusts each ginseng Number carrys out matching sample results.In whole training process, comprise two processes, i.e. the reverse mistake of forward process and error Journey.So that weight model output reaches minimum with the residual error of mark sample labeling value, residual error can be calculated to whole The parameter gradients of individual weight model, thus the method iteration optimization weight model parameter with stochastic gradient descent.This excellent Change method is on the premise of ensureing virtual resource parameter learning accuracy rate, it is possible to increase the efficiency of training process.Trained Journey step as shown in Figure 3：

Step S301:The weight parameter of initialization weight model.

Specifically, extract the weight parameter of each mark sample, wherein, weight parameter includes：Bonus point parameter, scoring Weight, deduction of points parameter and point deduction weight, bonus point parameter be used for favorable comment fraction that the account of description mark sample obtained, Scoring weight is used for describing the favorable comment fraction of account and the ratio of the credit of account, deduction of points parameter is used for describing account and is obtained Deduction of points fraction, point deduction weight be used for describing all accounts in the account quantity and terminal being punished in the terminal of account place The ratio of amount amount.

Step S302:The parameter of weight model updates.

Step S303:The calculating of the sample weights of mark sample.

Weight model, wherein, weight mould are obtained based on bonus point parameter, scoring weight, deduction of points parameter and point deduction weight Type can be：

f_w(g, p)=w₁*{g-a*w₂* p }, wherein,

Step S304:The determination of data confidence interval.

Specifically, train mark sample using weight model, obtain the sample weights instruction of each mark sample, based on each The sample weights of the sample value of individual mark sample and each mark sample determine the confidence interval of multiple mark samples.

Step S305:The calculating of residual error.

Specifically, calculate the difference (i.e. residual error) of the weight parameter in weight model and sample value, such that it is able to calculate The parameter gradients to whole weight model for the residual error.

Step S306:Judge whether weight model restrains.Wherein, if weight model convergence, training process terminates. If weight model is not converged, execution step S307, and return execution step S302.

Step S307:Gradient calculation.

The iterative optimization method being declined using gradient, and return the parameter renewal calculating of execution weight model.

In the training process, during the initialization of weight parameter, it can be 0.01 that gradient declines learning rate, training parameter β₁ And β₂Span can be [0,100].

In the training process, the initialization of virtual resource parameter is critically important, using sample weights model output sample power Before retraining data Estimating Confidence Interval model, need sample weights are normalized, zoom to unified interval.

Alternatively, mark sample is trained by weight model, the sample weights obtaining each mark sample include：Pass through Weight model f_w(g, p) training mark sample, determines the learning parameter β of weight model₁With learning parameter β₂；Pass through Determine weight model f of learning parameter_w(g, p) calculates the weight of each mark sample；Power to each mark sample It is normalized again, obtain marking the sample weights of sample.

Alternatively, the sample weights of the sample value based on each mark sample and each mark sample determine multiple mark samples This confidence interval includes：Determine model using Gauss distribution is interval to described mark Sample Establishing；Obtain described interval Determine the corresponding described confidence interval of model.

Specifically, it is possible to use Gauss distribution is interval to mark Sample Establishing to determine model f (x),Wherein, x is the sample value of mark sample, and μ is the average of f (x),w_iFor the sample weights of i-th mark sample, x_iRepresent the sample value of i-th mark sample, n For marking the sum of sample, i is natural number, and σ is the standard deviation of f (x),Obtain area Between determine model f (x) corresponding confidence interval region, region=[μ-k* σ, μ+k* σ], wherein, k is normal Number.

In the above-described embodiments, due to can obtain the regularity of distribution symbol of the mark sample in above-described embodiment by analytical data Close Gauss distribution rule, can be with Gauss distribution to this mark Sample Establishing Estimating Confidence Interval model, thus obtain being somebody's turn to do The corresponding confidence interval of model.According to statistical theory, in nature, the different pieces of information regularity of distribution is different, corresponding number Different according to model, using rational data model could preferably fitting data rule, therefore, above-described embodiment adopts Meet the weight model of mark sample rule, such that it is able to more accurate confidence interval is obtained based on this model.

Specifically, parameter k in above-described embodiment is the true fiducial interval range specified, that is, in fiducial interval range Interior data to be verified is legal (or normal) data of the category, and the data to be verified outside fiducial interval range is non- Method (improper) data, the general value of constant k is 3.0 about.

Alternatively, obtain multiple mark samples to include：Obtain sample data, if there is no positive counter-example mark in sample data, Then sample data is carried out with positive counter-example mark, the sample data after being marked；Line number is entered to the sample data after mark According to filtration and normalized, obtain marking sample.

In the above-described embodiments, can carry out by acquisition sample data and to the data not having positive counter-example mark in data Positive counter-example mark, can avoid the labeling operation that the data of mark is repeated, thus improve to sample data Mark process speed, improve data processing efficiency.After sample data after obtaining mark, due to collecting To sample data in there are a lot of noise datas (abnormal data), these abnormal datas can lead in the training process Parameter cannot restrain, thus the accuracy of weighing factor model and robustness, so needing the distribution according to data itself Rule is carried out to data, carries out data filtering process to the sample data after mark, rejecting abnormalities data point, Reduce data noise, it is to avoid abnormal data disturbs the weight model of sample, so that sample data possesses the condition of convergence, Meet the regularity of distribution of data itself, improve accuracy and generalization ability.Then the sample data after filtering is returned One change is processed, and obtains marking sample, so that the weight model of sample data can restrain faster, thus obtaining Mark sample there is higher accuracy rate.

Specifically, because the collection of sample data and the quality of mark are directly connected to training and obtain the accurate of sample weights Property and reliability, are very important basic links.When obtaining sample data, sample data includes two parts：Real When data division and history examination ＆ verification data division.Wherein, history examination ＆ verification data is the data that history artificial judgment is crossed, Through this kind of sample data is carried out with Classification Management, this kind of sample data includes normal category data (i.e. positive example sample data) With non-category data (i.e. negative data data), and real time data part (i.e. the data of extract real-time) is intended to root It is labeled according to needs, improper categorical data is labeled as with non-category data (bearing example sample data), just Regular data is labeled as normal data (i.e. positive example sample data).In training process, keep negative example sample data and positive example The ratio of sample data is 1:7.

During to data filtering and pretreatment, there are a lot of noise datas due in the sample data collected, These abnormal sample datas can lead to parameter cannot restrain in the training process, thus affecting accuracy and the Shandong of model Rod, so need to carry out abnormal data cleaning according to the regularity of distribution of sample data itself to data.According to following The value of the region of rejection that region of rejection RejectionRegion formula calculates comes whether judgment sample data is data noise：

Wherein, t_α/2It is a kind of marginal value of sample data (as student) t-distribution (data distribution), n is that sample is big Little.

When sample data is more than the value of this region of rejection, this sample data is regarded as data noise, then to sample number According to being carried out, the concrete formula of cleaning sample data and step are as follows：

First, filtration treatment is carried out to sample data, when to sample data filtration treatment, can be first to sample data (as commodity price) (as price) is ranked up by size, calculates average and the standard deviation of data；Judge this number again Whether strong point (each sample data) is abnormal data.If judgment value is more than region of rejection, reject this data point, jump Step to sorting the data, continues next data is judged；If judgment value is not more than region of rejection, stop Filter process.

Specifically, before training learns the Estimating Confidence Interval model of weight (sample weights) automatically, in order that mould Type more rapid convergence, obtains higher accuracy rate, needs to do normalized to data, mainly includes two steps：Right Data carries out averaging operation and carries out data filtering process to the sample data after mark, wherein, data is gone Averaging operation specifically adopts formula as follows：

Wherein, x_newFor newly obtaining the sample data after normalization, n is data x Total number, x_iFor i-th sample data, x is the sample value of the sample data after marking.

Standard deviation normalization operation is carried out to data and adopts equation below：

Wherein, s is standard deviation.

Alternatively, carry out data filtering to the sample data after mark to process and can include：Sample after obtaining according to mark The sequence that the sample value sequence of notebook data obtains, and the average of the sample data after the mark in the sequence of calculation and standard Difference；Sequentially obtain judgment value δ of the sample data after marking in sequence, wherein, sample after mark for the judgment value The sample value of sample data after the standard deviation of data, mark and the sample value of sample data after multiple mark flat Average and determine；If judgment value is more than the region of rejection obtaining in advance, reject the sample number after the corresponding mark of judgment value According to, and return the sequence that the sample value sequence of the sample data after obtaining according to mark obtains, and the mark in the sequence of calculation The average of the sample data after note and the step of standard deviation, until judgment value is not more than region of rejection.

Specifically, carry out data filtering to the sample data after mark to process and can include：Sample after obtaining according to mark The sequence that the sample value sequence of notebook data obtains, and the average of the sample data after the mark in the sequence of calculation and standard Difference；Sequentially obtain judgment value δ of the sample data after marking in sequence,S is multiple The standard deviation of the sample data after mark, x is the sample value of the sample data after marking, and mean (X) is multiple marks The meansigma methodss of the sample value of sample data afterwards；If judgment value is more than the region of rejection obtaining in advance, reject judgment value pair Sample data after the mark answered, and return the sequence that the sample value sequence of the sample data after obtaining according to mark obtains Row, and the step of the average of the sample data after the mark in the sequence of calculation and standard deviation, until judgment value is no more than refused Distant and inaccessible land.

Alternatively, after obtaining the second judged result, the method includes：Accuracy rate school is carried out to the second judged result Test, obtain the accuracy rate of the second judged result；If accuracy rate is less than predetermined threshold value, the sample power of adjustment mark sample Weight；Redefine confidence interval based on the sample weights after adjustment.

Specifically, in multiple data, normal data (i.e. positive example sample data) and improper data (bear example sample Notebook data) mix, if normal data (i.e. positive example sample data) can not be mistaken for improper data (i.e. Negative example sample data) classification, then need to control False Rate in relatively low scope.The algorithm that the application proposes can not be complete Entirely reach above-mentioned strictly accurately require, so need introduce manual evaluation, model parameter is finely adjusted so that Model is operated in optimum state.This process can pass through the receiver operating characteristic curve (Receiver of model output Operating Characteristic, ROC curve) it is adjusted, control FP so that model below 1% Accuracy rate is close to 100%.So, manual evaluation fine setting is just for the data of partly a small amount of classification.

In data environment, because the extraneous factor such as season, constantly fluctuating with environment in the confidence interval of data, needs Set up model modification mechanism.Using the mechanism updating a model for seven days it is ensured that the data confidence interval estimated exists Verity in environment and ageing, then updates discrimination model on line.

During model training, the quality of labeled data and manual evaluation to the accuracy rate of final discrimination model and are recalled Rate is particularly important.In order to embody advantage on normal data differentiates for this invention, by the method in the present embodiment and tradition Method compares, and result is as shown in table 1 below.

Table 1

Algorithm	Accuracy rate	Recall rate	Manual examination and verification amount
				Traditional method	79%	7.6%	100%
The method of the present embodiment	96.5%	7.5%	1.1%

Using the multiple class of Taobao, commodity now calculate data in table 1, and from table 1, data can be seen that this Application is substantially better than traditional method, and in the case that recall rate is almost without sacrificing, accuracy rate is obviously improved, significantly Reduce manual examination and verification amount.

In embodiment as shown in Figure 4, number can be detected using the data Estimating Confidence Interval method based on weight model According to comprising the following steps that of, this flow process：

Step S401：Collect the sample data of each classification and it is carried out with positive counter-example mark, wherein, positive counter-example mark Be that data to be verified is divided into two classifications to each class, respectively category data (positive example sample data) and non-should Categorical data (negative example sample data).

Step S402：Mark sample is filtered and pretreatment, rejecting abnormalities data point.

Specifically, by above-mentioned steps, it is possible to reduce data noise, it is to avoid abnormal data interference in learning model, improve Accuracy and generalization ability.Rejecting abnormalities data point includes：Rejecting abnormal data, is normalized to data.

Step S403：Extract the weight parameter of each sample data, build the weight model based on sample data.Wherein, This weight parameter (getting final product Reliability characteristics) includes：Bonus point parameter bonus point fraction (as bonus point fraction become reconciled scoring number), Scoring weight (as the ratio of favorable comment fraction and credit score), deduction of points parameter (as deduction of points fraction, difference scoring number, are located Point penalty number and warning fraction) and point deduction weight (ratios of the account quantity being such as punished and all account quantity).

Step S404：Using weight model, mark sample is trained and obtains sample weights, read mark sample The confidence interval that sample weights determine.As using the data of weight model output, being inputted data confidence interval mould Type, the confidence interval of output sample data.

Step S405：Accuracy rate verification is carried out to judged result, obtains this accuracy rate, if accuracy rate is less than predetermined threshold value, Then adjust sample weights.

Specifically, due to, in the data of management all categories, needing to manage data (the positive example sample number of the non-category According to), but normal data (i.e. valid data) can not be treated as improper data (i.e. invalid data), so needing Finely tune parameter by manual evaluation, make model be transferred to optimum.According to the weight model obtaining, ROC curve can be obtained, False Rate (False Positive, FP) can be controlled below 1%, thus the accuracy rate obtaining model is close 100%, staying data less than 1% to manual examination and verification, thus improve the accuracy rate of model, and reducing the work of examination ＆ verification Measure.

Step S406：Redefine confidence interval based on the sample weights after adjustment.

Specifically, model timing update mechanism can be set up, regularly update the data the estimation model of confidence interval.

In the scheme of above-described embodiment, by processing mass data (i.e. number to be verified based on big data digging technology According to), in conjunction with machine learning techniques from big data learning model (i.e. weight model), for building data confidence interval Estimate model (i.e. confidence interval).Meanwhile, the study mechanism based on weight model by machine learning field, draws Enter in data Estimating Confidence Interval model, for automatically obtaining sample weights from data (marking sample) learning, Confidence interval (as the estimation model of confidence interval) based on weight model is obtained by training study.Big data learns Mechanism, under the Internet big data background, is applied to data warehouse, speech processes and natural language processing, achieves frightened The effect of people.Compared with traditional data Estimating Confidence Interval method, the above embodiments of the present application propose to introduce credibility Feature (weight parameter of sample data), by the experience of screening high quality training data by the machine based on weight model Make automatic learning simulation, the credibility of profound mining data, reduce corrupt data (bearing example sample data) right The harmful effect of Estimating Confidence Interval model (i.e. confidence interval), retains higher data (the i.e. positive example sample of credibility Data) actively impact to Estimating Confidence Interval model (i.e. confidence interval), sample number with a low credibility will not be subject to According to the impact changing it is ensured that the accuracy of data Estimating Confidence Interval model, reliability, stability.In addition, adopting With study mechanism, apply for that the method proposing can be provided personalized service and be avoided the repeated work of substantial amounts of adjusting parameter Make.

In the method for detection data that the application proposes, sample can be realized with reference to traditional method using artificial data screening The screening of data and adjustment.Specifically, artificial screening data, retains high-quality data, calculates the variance of data And average, according to setting fiducial interval range, the fluctuation of observed data, if finding to have deviation, need constantly to adjust number According to collection it is ensured that its validity.

Alternatively, in order to choose the data of better quality, in data preprocessing phase, strict setting data screening conditions, Assessment accuracy rate, then constantly adjusts data screening condition, is then estimated again.Constantly repeat this process, until Evaluated effect is less than the threshold value setting.To improve sample data using critical data pretreatment with reference to traditional method Quality.

The parameter of the method initialization data pattern of the detection data being provided using the application, is added when there being new data When, assess the quality of data；If qualified, add data set, the confidence interval that output is estimated；Then Evaluated effect, As unqualified, rejecting data.Constantly repeat this process, until Evaluated effect is less than the threshold value setting.Then using biography The confidence interval of system method estimated data.

It should be noted that for aforesaid each method embodiment, in order to be briefly described, therefore it is all expressed as one and be The combination of actions of row, but those skilled in the art should know, and the application is not subject to limiting of described sequence of movement System, because according to the application, some steps can be carried out using other orders or simultaneously.Secondly, art technology Personnel also should know, embodiment described in this description belongs to preferred embodiment, involved action and module Not necessarily necessary to the application.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned enforcement The method of example can be realized by the mode of software plus necessary general hardware platform naturally it is also possible to pass through hardware, but The former is more preferably embodiment in many cases.Based on such understanding, the technical scheme of the application substantially or Say that what prior art was contributed partly can be embodied in the form of software product, this computer software product is deposited Storage, in a storage medium (as ROM/RAM, magnetic disc, CD), includes some instructions use so that a station terminal Described in equipment (can be mobile phone, computer, server, or network equipment etc.) execution each embodiment of the application Method.

Embodiment 2

According to the embodiment of the present application, additionally provide a kind of device for implementing above-mentioned detection data method, as Fig. 5 institute Show, this device includes：Read module 30, the first judge module 40 and the second judge module 50.

Wherein, read module 30, for reading the confidence interval determining based on the sample weights of mark sample, wherein, Sample weights are by weight model, the mark sample obtaining in advance to be trained obtaining.

First judge module 40, for judging that data to be verified, whether in confidence interval, obtains the first judged result.

Second judge module 50, for according to the first judged result, judging whether data to be verified is valid data, obtains To the second judged result.

If above-mentioned second judge module 50 can be used for data to be verified in confidence interval, judge number to be verified According to for valid data, if data to be verified is not in confidence interval, judge that data to be verified is invalid data.

Alternatively, this device can also include：Logging modle, for obtaining the second judgement in the second judge module 50 After result, record the second judged result that each data to be verified is valid data or invalid data.

In device disclosed in the above embodiments of the present application, the sample weights that obtain can be trained by read module and be based on Sample weights determine confidence interval.In the apparatus, when determining confidence interval, mould is judged by read module and first Block makes sample weights that the significance level of mark sample is distinguished, namely when determining confidence interval, improves mark The actively impact to confidence interval for the higher data of credibility in note sample, reduces the corrupt data in mark sample Disturbing influence to confidence interval, so that confidence interval is close to the real confidence interval of data, using the first judgement Classify it is ensured that estimates is accurate to all kinds of data to be verified automatically in confidence interval in module and the second judge module Property, stability and reliability.By the application, solve verification data legitimacy when check results accuracy low Problem is it is achieved that verification data legitimacy effect exactly.

Specifically, after virtual resource information is collected by the terminal of read module, from this virtual resource information Extract data to be verified, the confidence area corresponding to the object described by this virtual resource information is read by read module Between, and this data to be verified is judged whether in this confidence interval by the first judge module, using the second judge module If judging this data to be verified in confidence interval, judging that data to be verified is valid data, that is, determining that this is virtual Resource information is legal information；If data to be verified is not in confidence interval, judge that data to be verified is illegally to count According to that is, this virtual resource information is invalid information.Obtaining each data to be verified, valid data (and/or should Whether virtual resource information is legal information) after, will determine that result is recorded by logging modle, and be saved in In memorizer.

Alternatively, as shown in fig. 6, this device can include：Sample acquisition module 21, the first extraction module 22, Training module 23, the second extraction module 24 and the first determining module 25.

Wherein, sample acquisition module 21, for read based on mark sample sample weights determine confidence interval it Before, obtain multiple mark samples, wherein, each mark sample has sample value.

First extraction module 22, the attribute for extracting the attribute data of each mark sample, based on each mark sample Data sets up weight model.

Training module 23, for training mark sample by weight model, obtains the sample weights of each mark sample.

Second extraction module 24, for extracting the sample value in each mark sample, wherein, sample value is used for characterizing mark Object corresponding virtual resource parameter described by note sample.

First determining module 25, the sample weights for the sample value based on each mark sample and each mark sample are true The confidence interval of fixed multiple mark sample.

In the device that above-described embodiment is recorded, read based on putting that the sample weights of mark sample determine in read module Before letter interval, weight can be set up based on the attribute data of each mark sample getting by the first extraction module Model, the attribute data (as marked the quality of sample itself) due to marking sample can be made in data distribution rule For the tolerance of its credibility, when setting up weight model, fundamentally make use of this characteristic of mark sample itself, Thus improve accuracy and the reliability of weight model, after extraction module sets up accurate weight model, training Module can train mark sample to obtain the high sample weights of accuracy by this weight model, and combines the second extraction mould Block determines accurate confidence interval based on the sample value of this sample weights and each sample being extracted, thus improve The accuracy of confidence interval of mark sample and reliability.Then read this confidence interval, by the first judge module and Whether second judge module judges each data to be verified in confidence interval, if data to be verified is in confidence interval, Judge that data to be verified is valid data, if data to be verified is not in confidence interval, judge data to be verified For invalid data.And the judgement knot that each data to be verified is valid data or invalid data is recorded by logging modle Really.By the above embodiments of the present application it is achieved that improving the accuracy of weight model and the effect of reliability.

Alternatively, as shown in fig. 7, the first extraction module 22 includes：First extracting sub-module 221 and model obtain Module 223.

First extracting sub-module 221, for extracting the weight parameter of each mark sample, wherein, weight parameter includes： Bonus point parameter, scoring weight, deduction of points parameter and point deduction weight, bonus point parameter is used for the account institute of description mark sample The favorable comment fraction obtaining, scoring weight are used for describing the favorable comment fraction of account and the ratio of credit of account, deduction of points parameter For describing the deduction of points fraction that account obtained, point deduction weight is used for describing the account number being punished in the terminal of account place The ratio of all account quantity in amount and terminal.

Model acquisition module 223, for being obtained based on bonus point parameter, scoring weight, deduction of points parameter and point deduction weight Weight model, wherein, weight model is：f_w(g, p)=w₁*{g-a*w₂* p }, wherein,G is bonus point parameter；P is deduction of points parameter；w₁For the weight that scores；w₂For button Fraction weight, β₁And β₂Learning parameter for weight model.

Alternatively, training module includes：Second determining module, the 3rd determining module and processing module.

Wherein, the second determining module, for by weight model f_w(g, p) training mark sample, determines weight model Learning parameter β₁With learning parameter β₂；3rd determining module, for the weight model by determining learning parameter f_w(g, p) calculates the weight of each mark sample；Processing module, the weight for each is marked with sample carries out normalizing Change is processed, and obtains marking the sample weights of sample.

Alternatively, the first determining module includes：Set up module and interval acquisition module.

Wherein, set up module, for determining model using Gauss distribution is interval to mark Sample Establishing.

Interval acquisition module, for obtaining the interval determination corresponding confidence interval of model.

Specifically, set up module to can be used for determining model f (x) using Gauss distribution is interval to mark Sample Establishing,Wherein, μ is the average of f (x), and x is the sample value of mark sample,w_iFor the sample weights of i-th mark sample, x_iRepresent the sample value of i-th mark sample, n For marking the sum of sample, i is natural number, and σ is the standard deviation of f (x),

Interval acquisition module can be used for obtaining the corresponding confidence interval region of interval determination model f (x), Region=[μ-k* σ, μ+k* σ], wherein, k is constant.

Above-described embodiment is using the weight model meeting mark sample rule, more accurate such that it is able to be obtained based on this model True confidence interval.

Alternatively, obtain multiple mark samples to include：Data acquisition module and pretreatment module.

Data acquisition module, for obtaining sample data, if do not have positive counter-example mark, to sample number in sample data According to carrying out positive counter-example mark, the sample data after being marked；Pretreatment module, for the sample data after mark Carry out data filtering and normalized, obtain marking sample.

In the above-described embodiments, can carry out by acquisition sample data and to the data not having positive counter-example mark in data Positive counter-example mark, can avoid the labeling operation that the data of mark is repeated, thus improve to sample data Mark process speed, improve data processing efficiency.Data acquisition module obtain mark after sample data it Afterwards, there are a lot of noise datas (abnormal data) due in the sample data collected, these abnormal datas are in training During parameter can be led to cannot to restrain, thus the accuracy of weighing factor model and robustness, so needing pretreatment Module is carried out to data according to the regularity of distribution of data itself, carries out data filtering to the sample data after mark Process, rejecting abnormalities data point, reduce data noise, it is to avoid abnormal data disturbs the weight model of sample, so that Obtain sample data and possess the condition of convergence, meet the regularity of distribution of data itself, improve accuracy and generalization ability.Then By pretreatment module, the sample data after filtering is normalized, obtains marking sample, so that sample The weight model of data can restrain faster, thus the mark sample obtaining has higher accuracy rate.

Alternatively, pretreatment module includes：First processing module, judgment value acquisition module and Second processing module.

Wherein, first processing module, for obtaining the sequence obtaining according to the sample value sequence of the sample data after marking, And the average of the sample data after the mark in the sequence of calculation and standard deviation.

Judgment value acquisition module, for sequentially obtaining judgment value δ of the sample data after mark in sequence, wherein, sentences Disconnected value is after the sample value of the sample data after the standard deviation of the sample data after mark, mark and multiple mark The meansigma methodss of the sample value of sample data and determine.

Second processing module, if being more than, for judgment value, the region of rejection obtaining in advance, rejects the corresponding mark of judgment value Sample data afterwards, and return the sequence that the sample value sequence of the sample data after obtaining according to mark obtains, and calculate The average of the sample data after mark in sequence and the step of standard deviation, until judgment value is not more than region of rejection.

Specifically, the sample of the sample data after the first processing module in above-described embodiment can be used for obtaining according to mark The sequence that the sequence of this value obtains, and the average of the sample data after the mark in the sequence of calculation and standard deviation；Judgment value obtains Delivery block, for sequentially obtaining judgment value δ of the sample data after mark in sequence,S is The standard deviation of the sample data after multiple marks, x is the sample value of the sample data after marking, and mean (X) is multiple The meansigma methodss of the sample value of the sample data after mark；Second processing module, if be more than for judgment value obtain in advance Region of rejection, then reject the sample data after the corresponding mark of judgment value, and return the sample data after obtaining according to mark The sample value sequence that obtains of sequence, and the step of the average of the sample data after the mark in the sequence of calculation and standard deviation Suddenly, until judgment value is not more than region of rejection.

Alternatively, device includes：Verify acquisition module, adjusting module and redefine module.

Wherein, verify acquisition module, for, after obtaining the second judged result, carrying out accurately to the second judged result Rate verifies, and obtains the accuracy rate of the second judged result.

Adjusting module, if be less than predetermined threshold value, the sample weights of adjustment mark sample for accuracy rate.

Redefine module, for redefining confidence interval based on the sample weights after adjustment.

Specifically, verify acquisition module, for judged result is carried out with accuracy rate verification, obtain the accurate of judged result Rate.

Device in above-described embodiment, by processing mass data (i.e. number to be verified based on big data digging technology According to), in conjunction with machine learning techniques from big data learning model (i.e. weight model), for building data confidence interval Estimate model (i.e. confidence interval).Meanwhile, the study mechanism based on weight model by machine learning field, draws Enter in data Estimating Confidence Interval model, for automatically obtaining sample weights from data (marking sample) learning, Confidence interval (as the estimation model of confidence interval) based on weight model is obtained by training study.Big data learns Mechanism, under the Internet big data background, is applied to data warehouse, speech processes and natural language processing, achieves frightened The effect of people.Compared with traditional data Estimating Confidence Interval method, the above embodiments of the present application propose to introduce credibility Feature (weight parameter of sample data), by the experience of screening high quality training data by the machine based on weight model Make automatic learning simulation, the credibility of profound mining data, reduce corrupt data (bearing example sample data) right The harmful effect of Estimating Confidence Interval model (i.e. confidence interval), retains higher data (the i.e. positive example sample of credibility Data) actively impact to Estimating Confidence Interval model (i.e. confidence interval), sample number with a low credibility will not be subject to According to the impact changing it is ensured that the accuracy of data Estimating Confidence Interval model, reliability, stability.In addition, adopting With study mechanism, apply for that the method proposing can be provided personalized service and be avoided the repeated work of substantial amounts of adjusting parameter Make.

Embodiment 3

Embodiments herein can provide a kind of terminal, and this terminal can be in terminal group Any one computer terminal.Alternatively, in the present embodiment, above computer terminal can also replace with The terminal units such as mobile terminal.

Alternatively, in the present embodiment, as shown in figure 8, above computer terminal may be located at the many of computer network At least one of the individual network equipment network equipment 101, this network equipment 101 can be set with other networks by network Standby 103 connections.

Alternatively, this terminal A in embodiment as shown in Figure 8 can include：One or more (in figures Only illustrate one) processor, memorizer, and transmitting device.

Wherein, memorizer can be used for storing software program and module, the such as side of the detection data in the embodiment of the present application Method and the corresponding programmed instruction/module of device, processor passes through to run software program and the mould being stored in memorizer Block, thus executing various function application and data processing, that is, the method realizing above-mentioned detection data.Memorizer can Including high speed random access memory, nonvolatile memory can also be included, such as one or more magnetic storage device, Flash memory or other non-volatile solid state memories.In some instances, memorizer can further include with respect to place The remotely located memorizer of reason device, these remote memories can be by network connection to terminal A.The reality of above-mentioned network Example includes but is not limited to the Internet, intranet, LAN, mobile radio communication and combinations thereof.

Processor can call information and the application program of memory storage by transmitting device, to execute following step： Read the confidence interval determining based on the sample weights of mark sample, wherein, sample weights are to pre- by weight model The mark sample first obtaining is trained and obtains；Judge that data to be verified, whether in confidence interval, obtains the first judgement Result；According to the first judged result, judge whether data to be verified is valid data, obtain the second judged result.

Optionally, above-mentioned processor can also carry out following steps：Determined based on the sample weights of mark sample reading Confidence interval before, obtain multiple mark samples, wherein, each mark sample there is sample value.Extract each mark The attribute data of note sample, sets up weight model based on the attribute data of each mark sample.Trained by weight model Mark sample, obtains the sample weights of each mark sample.Extract the sample value in each mark sample, wherein, sample This value is used for characterizing the object corresponding virtual resource parameter described by mark sample.Sample based on each mark sample The sample weights of value and each mark sample determine the confidence interval of multiple mark samples.

Optionally, above-mentioned processor can also carry out following steps：Extract the weight parameter of each mark sample, wherein, Weight parameter includes：Bonus point parameter, scoring weight, deduction of points parameter and point deduction weight, bonus point parameter is used for description mark The favorable comment fraction that the account of note sample is obtained, scoring weight are used for describing the credit of the favorable comment fraction of account and account Ratio, deduction of points parameter are for describing the deduction of points fraction that account is obtained, point deduction weight is used for describing account place terminal The ratio of all account quantity in the account quantity being punished and terminal.Based on bonus point parameter, scoring weight, deduction of points Parameter and point deduction weight obtain weight model.

Optionally, above-mentioned processor can also carry out following steps：Interval to mark Sample Establishing really using Gauss distribution Cover half type；Obtain the interval determination corresponding confidence interval of model.

Optionally, above-mentioned processor can also carry out following steps：Obtain sample data, if in sample data not just Counter-example identifies, then sample data is carried out with positive counter-example mark, the sample data after being marked.To the sample after mark Data carries out data filtering and normalized, obtains marking sample.

Optionally, above-mentioned processor can also carry out following steps：The sample value of the sample data after obtaining according to mark Sort the sequence obtaining, and the average of the sample data after the mark in the sequence of calculation and standard deviation；Sequentially obtain sequence The judgment value of the sample data after middle mark, wherein, the standard deviation of sample data after mark for the judgment value, mark Meansigma methodss of the sample value of sample data after the sample value of sample data afterwards and multiple mark and determine；If judging Value is more than the region of rejection that obtains in advance, then reject the sample data after the corresponding mark of judgment value, and return acquisition according to The sample value sequence that obtains of sequence of the sample data after mark, and the sample data after the mark in the sequence of calculation is equal Value and the step of standard deviation, until judgment value is not more than region of rejection.

Optionally, above-mentioned processor can also carry out following steps：After obtaining the second judged result, sentence to second Disconnected result carries out accuracy rate verification, obtains the accuracy rate of the second judged result；If accuracy rate is less than predetermined threshold value, adjust The sample weights of whole mark sample；Redefine confidence interval based on the sample weights after adjustment.

Using the embodiment of the present application, there is provided a kind of scheme of detection data method, the program can be obtained by training Sample weights and determine confidence interval based on sample weights.In this scenario, when determining confidence interval, by sample Weight is distinguished to the significance level of mark sample, namely when determining confidence interval, improves in mark sample The actively impact to confidence interval for the higher data of credibility, reduces the corrupt data marking in sample to confidence interval Disturbing influence so that confidence interval is close to the real confidence interval of data, using this confidence interval automatically to each The classification of class data to be verified it is ensured that estimate accuracy, stability and reliability.By the application, solve Check results accuracy low problem during the legitimacy of verification data is it is achieved that verification data legitimacy effect exactly.

It will appreciated by the skilled person that the structure shown in Fig. 8 is only illustrating, the computer shown in Fig. 8 Terminal can also be smart mobile phone (as Android phone, iOS mobile phone etc.), panel computer, palm PC and shifting The terminal units such as dynamic internet device (Mobile Internet Devices, MID), PAD.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can To be completed come the device-dependent hardware of command terminal by program, this program can be stored in a computer-readable storage In medium, storage medium can include：Flash disk, read only memory (Read-Only Memory, ROM), random Memory access (Random Access Memory, RAM), disk or CD etc..

Embodiment 4

Embodiments herein additionally provides a kind of storage medium.Alternatively, in the present embodiment, above-mentioned storage medium Can be used for preserving the program code performed by the method for detection data that above-described embodiment one is provided.

Alternatively, in the present embodiment, above-mentioned storage medium may be located in computer network Computer terminal group In any one terminal, or it is located in any one mobile terminal in mobile terminal group.

Alternatively, in the present embodiment, storage medium is arranged to store the program code for executing following steps： Read the confidence interval determining based on the sample weights of mark sample, wherein, sample weights are to pre- by weight model The mark sample first obtaining is trained and obtains；Judge that data to be verified, whether in confidence interval, obtains the first judgement Result；According to the first judged result, judge whether data to be verified is valid data, obtain the second judged result.

Alternatively, in the present embodiment, storage medium is arranged to store the program code being additionally operable to execute following steps： Before reading the confidence interval determining based on the sample weights of mark sample, obtain multiple mark samples, wherein, often Individual mark sample has sample value；Extract the attribute data of each mark sample, the attribute number based on each mark sample According to setting up weight model；Mark sample is trained by weight model, obtains the sample weights of each mark sample；Extract Sample value in each mark sample, wherein, the object that sample value is used for characterizing described by mark sample is corresponding virtual Resource parameters；The sample weights of the sample value based on each mark sample and each mark sample determine multiple mark samples Confidence interval.

Alternatively, in the present embodiment, storage medium is arranged to store the program code being additionally operable to execute following steps： Determine model using Gauss distribution is interval to mark Sample Establishing；Obtain the interval determination corresponding confidence interval of model.

Alternatively, in the present embodiment, storage medium is arranged to store the program code being additionally operable to execute following steps： Obtain sample data, if there is no positive counter-example mark in sample data, positive counter-example mark being carried out to sample data, obtains Sample data after mark；Data filtering and normalized are carried out to the sample data after mark, obtains marking sample.

Alternatively, in the present embodiment, storage medium is arranged to store the program code being additionally operable to execute following steps： The sequence that the sample value sequence of the sample data after obtaining according to mark obtains, and the sample after the mark in the sequence of calculation The average of data and standard deviation；Sequentially obtain the judgment value of the sample data after marking in sequence, wherein, judgment value is led to Cross mark after the standard deviation of sample data, mark after the sample value of sample data and the sample number after multiple mark According to the meansigma methodss of sample value and determine；If judgment value is more than the region of rejection obtaining in advance, reject judgment value corresponding Sample data after mark, and return the sequence that the sample value sequence of the sample data after obtaining according to mark obtains, and The average of the sample data after mark in the sequence of calculation and the step of standard deviation, until judgment value is not more than region of rejection.

Alternatively, in the present embodiment, storage medium is arranged to store the program code being additionally operable to execute following steps： After obtaining the second judged result, the second judged result is carried out with accuracy rate verification, obtain the standard of the second judged result Really rate；If accuracy rate is less than predetermined threshold value, the sample weights of adjustment mark sample；Based on the sample weights after adjustment Redefine confidence interval.

Above-mentioned the embodiment of the present application sequence number is for illustration only, does not represent the quality of embodiment.

In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part describing in detail, may refer to the associated description of other embodiment.

It should be understood that disclosed technology contents in several embodiments provided herein, other can be passed through Mode realize.Wherein, device embodiment described above is only the schematically division of for example described unit, It is only a kind of division of logic function, actual can have other dividing mode when realizing, for example multiple units or assembly Can in conjunction with or be desirably integrated into another system, or some features can be ignored, or does not execute.Another, institute The coupling each other of display or discussion or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.

The described unit illustrating as separating component can be or may not be physically separate, show as unit The part showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple NEs.Some or all of unit therein can be selected according to the actual needs to realize the present embodiment The purpose of scheme.

In addition, can be integrated in a processing unit in each functional unit in each embodiment of the application it is also possible to It is that unit is individually physically present it is also possible to two or more units are integrated in a unit.Above-mentioned integrated Unit both can be to be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.

If described integrated unit is realized and as independent production marketing or use using in the form of SFU software functional unit When, can be stored in a computer read/write memory medium.Based on such understanding, the technical scheme of the application The part substantially in other words prior art being contributed or all or part of this technical scheme can be with softwares The form of product embodies, and this computer software product is stored in a storage medium, including some instructions in order to Each is real to make a computer equipment (can be personal computer, server or network equipment etc.) execution the application Apply all or part of step of a methods described.And aforesaid storage medium includes：USB flash disk, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, Magnetic disc or CD etc. are various can be with the medium of store program codes.

The above is only the preferred implementation of the application it is noted that ordinary skill people for the art For member, on the premise of without departing from the application principle, some improvements and modifications can also be made, these improve and moisten Decorations also should be regarded as the protection domain of the application.

Claims

1. a kind of method of detection data is it is characterised in that include：

Read the confidence interval determining based on the sample weights of mark sample, wherein, described sample weights are to pass through Weight model is trained to the described mark sample obtaining in advance and obtains；

Judge that data to be verified, whether in described confidence interval, obtains the first judged result；

According to the first judged result, judge whether described data to be verified is valid data, obtain the second judgement knot Really.

2. method according to claim 1 is it is characterised in that determined based on the sample weights of mark sample reading Confidence interval before, methods described includes：

Obtain multiple described mark samples, wherein, each described mark sample has sample value；

Extract the attribute data of each described mark sample, set up based on the attribute data marking sample each described Described weight model；

Described mark sample is trained by described weight model, obtains the sample weights of each described mark sample；

Extract the sample value in each described mark sample, wherein, described sample value is used for characterizing described mark sample Object corresponding virtual resource parameter described by this；

Determined multiple based on the sample weights of the sample value marking sample each described and each described mark sample The confidence interval of described mark sample.

3. method according to claim 2 it is characterised in that extract each described mark sample attribute data, Set up described weight model based on the attribute data marking sample each described to include：

Extract the weight parameter of each described mark sample, wherein, described weight parameter includes：Bonus point parameter, Scoring weight, deduction of points parameter and point deduction weight, described bonus point parameter is used for describing the account of described mark sample The favorable comment fraction being obtained, described scoring weight are used for describing the favorable comment fraction of described account and the letter of described account Ratio, described deduction of points parameter are used for describing deduction of points fraction, the described point deduction weight use that described account is obtained In all account quantity in the account quantity being punished and described terminal are described on described account place terminal Ratio；

Institute is obtained based on described bonus point parameter, described scoring weight, described deduction of points parameter and described point deduction weight State weight model, wherein, described weight model is：

f_w(g, p)=w₁*{g-a*w₂* p }, wherein,

DescribedG is described bonus point parameter；P is described deduction of points parameter；w₁ For described scoring weight；w₂For described point deduction weight, β₁And β₂Learning parameter for described weight model.

4. according to the method in claim 2 or 3 it is characterised in that described mark is trained by described weight model Sample, the sample weights obtaining each described mark sample include：

By described weight model f_w(g, p) trains described mark sample, determines the study ginseng of described weight model Number β₁With learning parameter β₂；

By determining described weight model f of described learning parameter_w(g, p) calculates each described mark sample Weight；

The weight marking sample each described is normalized, obtains the sample power of described mark sample Weight.

5. method according to claim 2 is it is characterised in that based on marking the sample value of sample and each each described The sample weights of individual described mark sample determine that the confidence interval of multiple described mark samples includes：

Determine model using Gauss distribution is interval to described mark Sample Establishing；

Obtain the described interval determination corresponding described confidence interval of model.

6. method according to claim 2 includes it is characterised in that obtaining multiple described mark samples：

Obtaining sample data, if there is no positive counter-example mark in described sample data, described sample data being carried out Positive counter-example mark, the sample data after being marked；

Data filtering and normalized are carried out to the sample data after described mark, obtains described mark sample.

7. method according to claim 6 is it is characterised in that carry out data mistake to the sample data after described mark Filter processes and includes：

The sequence that the sample value sequence of the sample data after obtaining according to described mark obtains, and calculate described sequence In described mark after the average of sample data and standard deviation；

Sequentially obtain judgment value δ of the sample data after mark described in described sequence, wherein, described judgment value The standard deviation of sample data after described mark, the sample value of sample data after described mark and multiple Described mark after the meansigma methodss of the sample value of sample data and determine；

If described judgment value is more than the region of rejection obtaining in advance, after rejecting the corresponding described mark of described judgment value Sample data, and return the sample value sequence that obtains of sequence of the sample data after obtaining according to described mark, And the step of the average of sample data after the described mark calculating in described sequence and standard deviation, sentence until described Disconnected value is not more than described region of rejection.

8. method according to claim 1 is it is characterised in that after obtaining the second judged result, methods described Including：

Described second judged result is carried out with accuracy rate verification, obtains the accuracy rate of described second judged result；

If described accuracy rate is less than predetermined threshold value, adjust the sample weights of described mark sample；

Redefine described confidence interval based on the sample weights after adjustment.

9. a kind of device of detection data is it is characterised in that include：

Read module, for reading the confidence interval determining based on the sample weights of mark sample, wherein, described Sample weights are by weight model, the described mark sample obtaining in advance to be trained obtaining；

First judge module, for whether judging data to be verified in described confidence interval, obtains the first judgement knot Really；

Second judge module, for according to the first judged result, judging whether described data to be verified is legal number According to obtaining the second judged result.

10. device according to claim 9 is it is characterised in that described device includes：

Sample acquisition module, for read based on mark sample sample weights determine confidence interval before, Obtain multiple described mark samples, wherein, each described mark sample has sample value；

First extraction module, for extracting the attribute data of each described mark sample, based on mark each described The attribute data of sample sets up described weight model；

Training module, for training described mark sample by described weight model, obtains each described mark sample This sample weights；

Second extraction module, for extracting the sample value in each described mark sample, wherein, described sample value For characterizing the object corresponding virtual resource parameter described by described mark sample；

First determining module, for based on the sample value marking sample each described and each described mark sample Sample weights determine the confidence interval of multiple described mark samples.

11. devices according to claim 10 are it is characterised in that described first extraction module includes：

First extracting sub-module, for extracting the weight parameter of each described mark sample, wherein, described weight Parameter includes：Bonus point parameter, scoring weight, deduction of points parameter and point deduction weight, described bonus point parameter is used for retouching State described mark sample account obtained favorable comment fraction, described scoring weight be used for the good of described account is described Scoring number is used for describing the deduction of points that described account is obtained with the ratio of credit of described account, described deduction of points parameter Fraction, described point deduction weight are used for describing the account quantity being punished in the terminal of described account place and described terminal On all account quantity ratio；

Model acquisition module, for based on described bonus point parameter, described scoring weight, described deduction of points parameter and Described point deduction weight obtains described weight model, and wherein, described weight model is：

f_w(g, p)=w₁*{g-a*w₂* p }, wherein,

12. devices according to claim 10 or 11 are it is characterised in that described training module includes：

Second determining module, for by described weight model f_w(g, p) trains described mark sample, determines institute State the learning parameter β of weight model₁With learning parameter β₂；

3rd determining module, for described weight model f by determining described learning parameter_w(g, p) calculates each The weight of individual described mark sample；

Processing module, for being normalized to the weight marking sample each described, obtains described mark The sample weights of sample.

13. devices according to claim 10 are it is characterised in that described first determining module includes：

Set up module, for determining model using Gauss distribution is interval to described mark Sample Establishing

Interval acquisition module, for obtaining the described interval determination corresponding described confidence interval of model.

14. devices according to claim 10 are it is characterised in that described sample acquisition module includes：

Data acquisition module, for obtaining sample data, if there is no positive counter-example mark in described sample data, Described sample data is carried out with positive counter-example mark, the sample data after being marked；

Pretreatment module, for carrying out data filtering and normalized to the sample data after described mark, obtains To described mark sample.

15. devices according to claim 14 are it is characterised in that described pretreatment module includes：

First processing module, the sequence obtaining according to the sample value sequence of the sample data after described mark for acquisition Row, and calculate the average of sample data after the described mark in described sequence and standard deviation；

Judgment value acquisition module, for sequentially obtaining the judgment value of the sample data after mark described in described sequence δ, wherein, sample after the standard deviation of the sample data after described mark, described mark for the described judgment value The sample value of data and multiple described mark after the meansigma methodss of the sample value of sample data and determine；

Second processing module, if being more than, for described judgment value, the region of rejection obtaining in advance, rejects described judgement It is worth the sample data after corresponding described mark, and return the sample of the sample data after obtaining according to described mark The sequence that value sequence obtains, and calculate the average of sample data after the described mark in described sequence and standard deviation Step, until described judgment value be not more than described region of rejection.

16. devices according to claim 9 are it is characterised in that described device also includes：

Verification acquisition module, for, after obtaining the second judged result, carrying out standard to described second judged result Really rate verification, obtains the accuracy rate of described second judged result；

Adjusting module, if being less than predetermined threshold value for described accuracy rate, adjusts the sample power of described mark sample Weight；

Redefine module, for redefining described confidence interval based on the sample weights after adjustment.