Nothing Special   »   [go: up one dir, main page]

CN109740750A - Method of data capture and device - Google Patents

Method of data capture and device Download PDF

Info

Publication number
CN109740750A
CN109740750A CN201811542893.7A CN201811542893A CN109740750A CN 109740750 A CN109740750 A CN 109740750A CN 201811542893 A CN201811542893 A CN 201811542893A CN 109740750 A CN109740750 A CN 109740750A
Authority
CN
China
Prior art keywords
sample
data
collection
probability
sample data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811542893.7A
Other languages
Chinese (zh)
Other versions
CN109740750B (en
Inventor
李超然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing Deephi Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Deephi Intelligent Technology Co Ltd filed Critical Beijing Deephi Intelligent Technology Co Ltd
Priority to CN201811542893.7A priority Critical patent/CN109740750B/en
Publication of CN109740750A publication Critical patent/CN109740750A/en
Application granted granted Critical
Publication of CN109740750B publication Critical patent/CN109740750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of method of data capture and devices, this method comprises: receiving sample data to be collected;The current accounting for belonging to the sample data of the sample data generic to be collected in sample collection data set is obtained, the sample collection data set is fixed-size data set;The collection probability of the sample data of the sample data generic to be collected is determined according to the target accounting of the current accounting and the sample data of the sample data generic to be collected;The sample data to be collected is added in the sample collection data set according to the collection probability, for training neural network model.It through the above scheme can be in the sample data set for constantly having the category distribution requirement for obtaining meeting machine learning in the case where new samples generation.

Description

Method of data capture and device
Technical field
The present invention relates to depth learning technology field more particularly to a kind of method of data capture and devices.
Background technique
Common neural network needs to be trained using great amount of samples data in deep learning.If sample data is concentrated The category distribution of sample data is unbalanced, and neural network model will failure to train.For classification problem, sample data is not It is balanced, that is, the number difference of the sample data of each classification is very big in data set.More specifically, for example, being asked in one two classification In topic, if 100 sample datas (100 row data, every data line are the characterization of a sample) is shared, wherein 80 samples Data belong to classification 1, remaining 20 sample data belongs to classification 2, and classification 1: classification 2=80:20=4:1, this just belongs to classification It is unbalanced.In intensified learning, AI (artificial intelligence) and environment interaction can generate great amount of samples data, if by sample data into Row classification, then the generating probability of different classes of sample data is different.
It is the problem of being typically present in machine learning that sample data, which concentrates the classification of sample data unbalanced,.For it is specific, Fixed sample data set, common solution are, by the classification progress data lack sampling more to number of samples, or The classification less to number of samples carries out data oversampling, to obtain the sample data set of classification balance by resampling;Separately A kind of mode is to utilize the manually generated new samples data of available sample data;There are also method be not from data set, and It is by punishing the algorithm of classifier come the effect of improved model training.What these methods were directed to is all the model of fixed data set Training problem.
And it is very big for sample size or sample size is unknown and constantly has for the case where new samples generation, it samples again It is difficult to accomplish.So the case where constantly generating for sample data in intensified learning, the unbalanced problem of classification does not have still To very good solution.
Summary of the invention
In view of this, the present invention provides a kind of method of data capture and device, constantly to there is the case where new samples generation Under obtain meeting the sample data set that the category distribution of machine learning requires.
To achieve the goals above, the present invention is realized using following scheme:
In an embodiment of the present invention, method of data capture, comprising:
Receive sample data to be collected;
Obtain currently accounting for for the sample data for belonging to the sample data generic to be collected in sample collection data set Than the sample collection data set is fixed-size data set;
It is determined according to the target accounting of the current accounting and the sample data of the sample data generic to be collected The collection probability of the sample data of the sample data generic to be collected;
The sample data to be collected is added in the sample collection data set according to the collection probability, to be used for Training neural network model.
In an embodiment of the present invention, computer equipment, including memory, processor and storage are on a memory and can be The computer program run on processor, the processor realize the step of above-described embodiment the method when executing described program Suddenly.
In an embodiment of the present invention, computer readable storage medium, is stored thereon with computer program, and feature exists In the program realizes above-described embodiment the method when being executed by processor the step of.
Method of data capture, computer equipment and computer readable storage medium of the invention, by being fixed using size Data set collect sample data, it may be possible to learn the current accounting of sample data of all categories;According to a certain classification sample number According to current accounting and it is expected accounting can determine reasonable collection probability;According to whether collecting determine the probability by new sample number According to being added in data set, the sample data in data set is enabled to become required for more meeting neural network model training Category distribution situation.Therefore, it can be collected in the case where constantly there is the generation of new samples data and obtain meeting neural network model The sample data set that trained category distribution requires.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.In the accompanying drawings:
Fig. 1 is the flow diagram of the method for data capture of one embodiment of the invention;
Fig. 2 is that the method flow schematic diagram for collecting probability is determined in one embodiment of the invention;
Fig. 3 is that the method flow schematic diagram for collecting probability is determined in another embodiment of the present invention;
Fig. 4 is that sample data to be collected is added to sample collection data set according to collection probability in one embodiment of the invention In method flow schematic diagram;
Fig. 5 is the flow diagram of the method for data capture of another embodiment of the present invention;
Fig. 6 is the flow diagram of the method for data capture of one embodiment of the invention;
Fig. 7 is the structural schematic diagram of the transacter of one embodiment of the invention;
Fig. 8 is the structural schematic diagram that probability determination module is collected in one embodiment of the invention;
Fig. 9 is the structural schematic diagram that probability determination module is collected in another embodiment of the present invention;
Figure 10 is the structural schematic diagram of data collection module in one embodiment of the invention;
Figure 11 is the structural schematic diagram of the transacter of another embodiment of the present invention.
Specific embodiment
Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, with reference to the accompanying drawing to this hair Bright embodiment is described in further details.Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but simultaneously It is not as a limitation of the invention.
Fig. 1 is the flow diagram of the method for data capture of one embodiment of the invention.As shown in Figure 1, some embodiments Method of data capture, it may include:
Step S110: sample data to be collected is received;
Step S120: the sample data for belonging to the sample data generic to be collected in sample collection data set is obtained Current accounting, the sample collection data set be fixed-size data set;
Step S130: according to the target of the current accounting and the sample data of the sample data generic to be collected Accounting determines the collection probability of the sample data of the sample data generic to be collected;
Step S140: the sample data to be collected is added to by the sample collection data set according to the collection probability In, for training neural network model.
In above-mentioned steps S110, sample data that should be to be collected can be the sample data in data flow, can be not required to It is to be understood that sample total, extracts data from data flow in real time.It can continuously be generated by signal source, for example, strengthening In study, the new samples data that constantly generate.Those sample datas may include metadata (i.e. data itself) and class label, Their classification can be different.Received sample data to be collected can be temporarily stored in a data and concentrate, and carry out wait be read out Subsequent processing.
In above-mentioned steps S120, the data in the sample collection data set are used directly for neural network model instruction Practice, as needed can only include metadata, or the data pair comprising being made of metadata and class label.The sample collection The size of data set refers to that data set can accommodate most multidata number, can be realized by various different modes, for example, team Column, chained list etc..The size of sample collection data set usually should be greater than total classification number of data, the visual neural network mould of specific value It depending on the needs of type training, such as can be 100 times of classification sum.It can have been had collected in the sample collection data set The sample data that a large amount of signal sources generate.The current accounting of the sample data of a certain classification in the sample collection data set, can be with By counting the sum of the sample of the category, then received divided by the sum of sample data in the sample collection data set or the sample The size of collection data set obtains.Wherein, classification can be according to the classification for the data centering being stored in the sample collection data set Label obtains, or the class in the class label data set by counting corresponding in the sample collection data set specially stored Distinguishing label obtains.
In above-mentioned steps S130, the target accounting of the sample data of certain classification refers to desired accounting, can be according to mind Requirement setting through network model training, can specifically determine according to classification sum etc., for example, in the feelings for requiring classification fully equalize Under condition, when classification sum is ncWhen, a kind of target accounting of the sample data of classification can be 1/nc.If a certain classification sample The current accounting of data is less than its desired accounting, illustrates that the sample data of the category is less, conversely, more.If currently accounted for Than being less than target proportion, biggish collection probability can be set, if current accounting is greater than target proportion, can be set lesser Collect probability.
In above-mentioned steps S140, which can be realized by random number.Expire in sample collection data set In the case where, it can use the sample data being newly added and replace old sample data, such as be added to sample collection earliest Sample data in data set.In the case where sample collection data set is discontented, sample collection data set can be added directly to In.
In the present embodiment, by collecting sample data using fixed-size data set, it may be possible to learn all kinds of other styles The current accounting of notebook data;It can determine reasonable collect generally according to the current accounting of a certain classification sample data and desired accounting Rate;According to collecting whether determine the probability is added to new sample data in data set, the sample number in data set is enabled to According to becoming more to meet category distribution situation required for neural network model training.Therefore, this programme can constantly have new sample Notebook data is collected in the case where generating obtains meeting the sample data set that the category distribution of machine learning requires.Different classes of In the case that the generating probability of sample data is different, the sample data of generation is filtered and is collected, enables to be collected into The data category concentrated of sample data approach desired category distribution, such as tend to be balanced, and data set be can be continuous The new data that data flow generates is collected, to update the sample for being used for model training.
In some embodiments, above-mentioned steps S120, that is, obtain and belong to the sample to be collected in sample collection data set The current accounting of the sample data of data generic, it may include:
Statistics calculates the accounting of the label of sample data generic to be collected described in class label data set, obtains sample Belong to the current accounting of the sample data of the sample data generic to be collected in this collection data set;The class label Data set is used to store the class label of each sample data in the sample collection data set, the class label data set it is big It is small identical as the size of the sample collection data set.
Class label in class label data set can be when the sample data that sample collection data set is collected It is added to class label data set, or can be another after the current all sample datas for having obtained sample collection data One is added to class label in class label data set.When needing the class label by a sample data to be added to classification mark When signing in data set, class label can be isolated from sample data, and after required conversion process, be added to classification Label data is concentrated.Class label data set can be only used for storing the corresponding class of sample data of above-mentioned sample collection data set Distinguishing label.
In the present embodiment, by specially storing each sample data in sample collection data set using class label data set Class label, can be convenient for the current class situation of sample data in express statistic sample collection data set.
In some embodiments, above-mentioned steps S130, that is, according to the current accounting and the sample data institute to be collected Belong to the collection probability of the sample data of the determining sample data generic to be collected of target accounting of the sample data of classification, Can include:
Step S131: it is less than or equal to the target accounting of the sample data generic to be collected in the current accounting In the case where, by the collection probability for the sample data that the first determine the probability is the sample data generic to be collected, in institute Current accounting is stated greater than in the case where the target accounting, is the collection probability by the second determine the probability;First probability Greater than second probability.
The occurrence of first probability and second probability can be according to the difference (example of above-mentioned current accounting and target accounting Such as, difference, mean square deviation etc.) it determines.
In the present embodiment, when the current accounting of a certain classification is less than or equal to target accounting, illustrate the data of the category It is less, by biggish first probability, the sample data of more categories can be obtained;It is greater than in the current accounting of the category When target accounting, illustrate that the data of the category are more, by lesser second probability, the sample of the less category can be obtained Data;The sample data of the category can be allow to become closer to desired account for new sample data is constantly received with this Than.
Fig. 2 is that the method flow schematic diagram for collecting probability is determined in one embodiment of the invention.As shown in Fig. 2, above-mentioned steps S131, that is, in the case where the current accounting is less than or equal to the target accounting of the sample data generic to be collected, By the collection probability for the sample data that the first determine the probability is the sample data generic to be collected, in the current accounting It is the collection probability by the second determine the probability in the case where greater than the target accounting, it may include:
Step S1311: the current class distribution of sample data in the sample collection data set is obtained;
Step S1312: the target class of sample data in the current class distribution and the sample collection data set is calculated Not Fen Bu between mean square deviation;
Step S1313: it is less than or equal in the mean square deviation and is set according to the total sample number of the sample collection data set Error threshold in the case where, when the target that the current accounting is less than or equal to the sample data generic to be collected accounts for Than when, the first determine the probability that will be obtained by 0.5 plus the mean square deviation is the sample of the sample data generic to be collected The collection probability of notebook data will subtract what the mean square deviation obtained by 0.5 when the current accounting is greater than the target accounting Second determine the probability is the collection probability of the sample data of the sample data generic to be collected.
In above-mentioned steps S1311, which can be Different categories of samples data in sample collection data set Ratio, accounting etc..In above-mentioned steps S1312, it is assumed that the sum of classification is nc, in currently accounting for for wherein the i-th class sample data Than for piAnd target accounting isIn the case where, mean square deviation can be expressed asIn above-mentioned steps S1313, The error threshold can be according to the size n of sample collection data settarIt determines, for example, can be 5/ntar.The mean square deviation is less than Or when equal to the error threshold set according to the total sample number of the sample collection data set, it is believed that current class distribution connects Close-target category distribution, the first probability obtained according to 0.5 plus the mean square deviation can make to collect certain one kind with slightly larger probability Other sample data.When the current accounting is greater than the target accounting, it is believed that current class distribution and target category Differing distribution is larger, subtracts the second probability that the mean square deviation obtains according to 0.5, can make to collect a certain classification with slightly smaller probability Sample data.
In the present embodiment, mean square deviation is calculated according to current class distribution and target category distribution, and with the mean square deviation one The mode that half probability fluctuates up and down determines the first probability or the second probability, can collect probability and meet classification adjustment needs, and not It is too big to sample class distributed oscillation.
In other embodiments, the mean value of accounting of all categories and the sample in current class distribution can be calculated separately The mean value of accounting of all categories in this collection data set calculates the mean value of accounting of all categories and the sample in the current class distribution Difference in this collection data set between the mean value of accounting of all categories, when the current accounting is less than or equal to the sample to be collected It is the sample to be collected by the first determine the probability obtained by 0.5 plus the difference when the target accounting of notebook data generic The collection probability of the sample data of notebook data generic will be subtracted when the current accounting is greater than the target accounting by 0.5 The second determine the probability for going the difference to obtain is the collection probability of the sample data of the sample data generic to be collected.
Fig. 3 is that the method flow schematic diagram for collecting probability is determined in another embodiment of the present invention.As shown in figure 3, shown in Fig. 2 Determine collect probability method, may also include that
Step S1314: it is greater than the error set according to the total sample number of the sample collection data set in the mean square deviation In the case where threshold value, when the current accounting be less than or equal to it is described when collecting the target accounting of sample data generic, The first determine the probability that one end out of (0.5,1) range close to 1 is taken is the sample data generic to be collected The collection probability of sample data will be leaned on when the current accounting is less than or equal to the target accounting out of (0,0.5) range The second determine the probability that nearly 0 one end is taken is the collection probability of the sample data of the sample data generic to be collected.
In above-mentioned steps S1314, set when the mean square deviation is greater than according to the total sample number of the sample collection data set When fixed error threshold, illustrate that current class distribution is larger with target sample differing distribution.It can refer to (0.75,1) range close to 1 Interior value, for example, 0.9,0.99 etc..Close 0 can refer to the value in (0,0.25) range, for example, 0.1,0.15 etc..It is described Mean square deviation is greater than the error threshold set according to the total sample number of the sample collection data set, by out of (0.5,1) range Close to 1 one end value as collect probability, can quickly reach the target accounting of the sample data of required classification, by from One end value in (0,0.5) range close to 0 can reduce the number of the more classification of sample data as probability is collected as far as possible According to increasing speed for amount.
In the present embodiment, by, in the biggish situation of target sample differing distribution, passing through setting in current class distribution Very big collection probability quicklys increase the gathering speed of the sample data of required classification, by be arranged the collection probability of very little come Reduce the sample data of unwanted classification as far as possible, the sample data of a certain classification can be made to reach its target accounting as early as possible.
Fig. 4 is that sample data to be collected is added to sample collection data set according to collection probability in one embodiment of the invention In method flow schematic diagram.As shown in figure 4, in above-mentioned steps S140, according to the collection probability by the sample to be collected Data are added in the sample collection data set, it may include:
Step S141: a random number is generated;
Step S142: in the case where the random number is less than or equal to the collection probability, by the sample to be collected Data are added in the sample collection data set;It, will not be described in the case where the random number is greater than the collection probability Sample data to be collected is added in the sample collection data set.
In above-mentioned steps S141, random number can be generated by various random number generating apparatus.In above-mentioned steps S142 In, in the case where the random number is less than or equal to the collection probability, it can determine and need above-mentioned sample number to be collected It, can will be wait collect according to the addition ident value at this point it is possible to return to addition ident value according to being added in sample collection data set Sample data is added in sample collection data set.In the case where the random number is greater than the collection probability, can determine It does not need for above-mentioned sample data to be collected to be added in sample collection data set, then the sample to be collected can be abandoned directly Data, or carry out other processing.
In the present embodiment, sample data is collected with the collection probability of aforementioned determination by realizing using random number, can be made Obtaining sample data, according to target category distribution is collected automatically.
In some embodiments, in above-mentioned steps S142, it is less than or equal to the feelings for collecting probability in the random number Under condition, the sample data to be collected is added in the sample collection data set, it may include:
In the case where the random number is less than or equal to the collection probability, if the sample collection data set has been expired, The sample data added earliest in the sample collection data set is replaced using the sample data to be collected.
It is old by being replaced using new sample data in the case where sample collection data set has been expired in the present embodiment Sample data, required sample data can be collected into the case where being kept fixed data set size.
If the sample collection data set is less than, sample data to be collected directly can be added to sample collection data set In, to improve the gathering speed of sample data.
It in other embodiments, can be according to finding current class accounting than it if the sample collection data set has been expired Target accounting is higher by the sample data of more classification, is rejected.With this, the speed for reaching target category distribution can be improved Degree.
Fig. 5 is the flow diagram of the method for data capture of another embodiment of the present invention.As shown in figure 5, number shown in FIG. 1 According to collection method, after step s 140, that is, the sample data to be collected is added to by the sample according to the collection probability After in this collection data set, it may also include that
Step S150: the class label data set is added to wait collect class label corresponding to sample data by described In.
The class label data set is the class label for storing each sample data in the sample collection data set, The size of the class label data set is identical as the size of the sample collection data set.
In the present embodiment, the sample data to be collected is added to by the sample collection data according to the collection probability It concentrates, illustrates to have determined that the sample data to be collected is added in the sample collection data set, in the case, by institute It states and is added in the class label data set wait collect class label corresponding to sample data, it can be with synchronized update classification mark Data set is signed, so that the class label in class label data set is corresponding with wait collect the sample data in sample data, from And convenient for counting the current accounting of sample data of all categories.
To make those skilled in the art be best understood from the present invention, it will illustrate reality of the invention with a specific embodiment below Apply process.
Fig. 6 is the flow diagram of the method for data capture of one embodiment of the invention.As shown in Figure 6, it is assumed that pass through one Signal source persistently generates different classes of sample data.The sample data of generation includes data itself (metadata) and classification mark Label.It can be by being analyzed and counted to the sample data in data flow, so using the method for data capture of the embodiment of the present invention It decides whether to be added to sample data in the sample collection data set of one fixed size afterwards.
Definition signal source is S, will persistently generate sample data Dsm.The sample data D of generationsmInclude metadata d and classification Label l, i.e. Dsm={ d, l }.Assuming that the number of classification is n in labelc.In the present embodiment, the sample collection of fixed size is utilized The sample data that data set collecting signal source S is generated.The data set of fixed size refers to the data at most possessed in data set Maximum number.For example, setting the total size of sample collection data set as n, then can at most possess n in the sample collection data set Item sample data.When sample collection data set has been expired, but also needs to add new sample data, can be advised with a kind of setting The old sample data in sample collection data set is then replaced, for example, replacing sample collection data using new sample data Current oldest sample data in collection.
It is possible, firstly, to initialize for the data set of statistics and the data set for storing authentic specimen data.For uniting The data set of meter can be only used for the class label of storage sample data, it may include, data stream statistics data set Dst1For depositing It puts, class label data set Dst2Deng.For storing the data set of authentic specimen data, storage sample data can be only used for Metadata, or simultaneously for storing the data being made of metadata and respective classes label to (sample data), it may include sample This collection data set Dtar
Data stream statistics data set Dst1It can be used for the probability that the sample data of each classification in statistical data stream occurs, Its size for example may be set to nst1=nc*100.Data stream statistics data set Dst1Total size setting it is bigger, statistics obtains Each classification sample data occur probability precision it is higher.When beginning, data stream statistics data set Dst1Data will be collected Stream each of works as sample data, as data stream statistics data set Dst1When having expired, new sample data can be used and replace Data stream statistics data set Dst1Current oldest sample data in the middle.
Class label data set Dst2It can be used for counting current sample collection data set DtarIn each classification sample number It is how many according to the ratio of storage, total size nst2With sample collection data set DtarTotal size ntarUnanimously, that is, nst2=ntar。 Class label data set Dst2With sample collection data set DtarThe difference is that class label data set Dst2Only storage sample The class label of notebook data does not store the metadata (data itself) of sample data.When a new sample data arrives It waits, sample collection data set DtarIt will be by class label data set Dst2Statistical result judged by certain rule after To decide whether to for new sample data to be added to sample collection data set DtarIn the middle.
Assuming that needing sample data in sample collection data set DtarIn distribution be that the sample data of each classification is in sample This collection data set DtarThe ratio occupied in the middle is equal.Target category distribution refers to point of desired sample data classification Cloth, i.e. ddst={ pi| i=1,2,3 ... nc, wherein piThe i-th class is represented in sample collection data set DtarIn desired proportion.When Preceding category distribution refers to current sample collection data set DtarThe distribution of middle sample data classification,That is, WhereinThe i-th class sample data is represented in sample collection data set DtarIn desired proportion.As sample collection data set DtarNot yet When having full, all new samples data can be added, it, can be according to class label data set D when having expiredst2's Statistical result executes the judgment rule of setting, determines whether to add new samples data according to the judging result of setting rule to sample This collection data set Dtar
Data-gathering process may comprise steps of:
(1) update data stream statistical data collection Dst1Data, the distribution situation of available current data stream, Ke Yiyong The category distribution of sample data in statistical data stream, the reference as subsequent parameter probability valuing.Data stream statistics data set Dst1 It can be also used for keeping in new sample data, and new sample data taken out in case of need and is added to sample collection Data set DtarIn.In other embodiments, data stream statistics data set D can not had tost1, but directly received from signal source new Sample data, be added to sample collection data set D for judging whethertarIn.
(2) judged according to desired category distribution (target category distribution), current classification can be divided into two Different set.When the desired proportion (target accounting) for judging a classification is greater than current sample collection data set DtarIn should The ratio (or accounting) of classification sample data, i.e. sample collection data set DtarIn this classification sample data it is very little, at this time New sample data can be added to set Sx, set S is otherwise addedy
(3) the mean square error mse between current class distribution and target category distribution is calculated, i.e.,If mse < 5/nst2, then subsequent step (4) are executed, otherwise execute subsequent step (5).
(4) if the classification of new sample data is in set SxIn, then it can be with paccThe probability addition of=0.5+mse is new Sample data to sample collection data set DtarIn the middle, otherwise with paccThe probability of=0.5-mse adds new sample data and arrives Sample collection data set DtarIn the middle, it then can execute subsequent step (6).
(5) if the classification of new sample data is in set SxIn, then with pxthrProbability addition new data work as to data set In, otherwise with pythrProbability add new samples data to sample collection data set DtarIn the middle.Probability pxthrAnd Probability pythrFor The threshold value of setting.Probability pxthrValue range can be (0.5,1), in order to reach destination probability, Probability p fasterxthrIt can take 0.99.Probability pythrValue range can be (0,0.5), in order to reach destination probability, Probability p fasterythr0.01 can be taken. Then, subsequent step (6) are executed.
(6) random number is generated, if random number is less than above-mentioned probability, it is determined that add new sample data to sample collection number According to collection DtarIn, then True is returned, False is otherwise returned.Then the result of judgement can be returned into sample collection data set DtarIf returning True, new sample data will be added to sample collection data set DtarIn, while by new sample This class label is added to class label data set Dst2, otherwise abandon data.As sample collection data set DtarIt is not full When, all new samples data will be added to sample collection data set Dtar, while the class label of new sample being added to Class label data set Dst2, as sample collection data set DtarIt has been expired that, can judge whether to add by executing above-mentioned judgment rule Enter new sample data.
In the present embodiment, for constantly having the case where new data generation, solve what data flow in this case generated It is uncertain that the case where unbalanced problem of the classification of data, solution, which is data volume size of population,.And utilize the prior art not It can be sampled by all data to fixed size to solve the problems, such as that classification is unbalanced.Sample total is required no knowledge about, Data are extracted from data flow in real time.New data can be used and replace legacy data in data set, so as to update always Content in data set.Solve the data for how going to collect a fixed size in the case where constantly there are new samples to generate Collection, and how to judge whether to need asking for the data for replacing available data concentration with new data in the case where data set has been expired Topic.
Based on inventive concept identical with method of data capture shown in FIG. 1, the embodiment of the invention also provides a kind of numbers According to collection device, as described in following example.The principle and method of data capture phase solved the problems, such as due to the transacter Seemingly, therefore the implementation of the transacter may refer to the implementation of method of data capture, and overlaps will not be repeated.
Fig. 7 is the structural schematic diagram of the transacter of one embodiment of the invention.As shown in fig. 7, some embodiments Transacter, it may include: data receipt unit 210, current accounting acquiring unit 220, collect probability determining unit 230 and Data collection module 240, above-mentioned each unit sequence link.
Data receipt unit 210, for receiving sample data to be collected;
Current accounting acquiring unit 220 belongs to the sample data institute to be collected for obtaining in sample collection data set Belong to the current accounting of the sample data of classification, the sample collection data set is fixed-size data set;
Probability determining unit 230 is collected, for according to the current accounting and the sample data generic to be collected Sample data target accounting determine the sample data generic to be collected sample data collection probability;
Data collection module 240, for the sample data to be collected to be added to the sample according to the collection probability In this collection data set, for training neural network model.
In some embodiments, current accounting acquiring unit 220, it may include: current accounting obtains module.
Current accounting obtains module, calculates the affiliated class of sample data to be collected described in class label data set for counting The accounting of other label obtains the sample data for belonging to the sample data generic to be collected in sample collection data set Current accounting;The class label data set is used to store the class label of each sample data in the sample collection data set, The size of the class label data set is identical as the size of the sample collection data set.
In some embodiments, probability determining unit 230 is collected, it may include: collect probability determination module.
Probability determination module is collected, for being less than or equal to the affiliated class of sample data to be collected in the current accounting In the case where other target accounting, by the receipts for the sample data that the first determine the probability is the sample data generic to be collected Collect probability, is the collection probability by the second determine the probability in the case where the current accounting is greater than the target accounting;Institute The first probability is stated greater than second probability.
Fig. 8 is the structural schematic diagram that probability determination module is collected in one embodiment of the invention.As shown in figure 8, collecting probability Determining module, it may include: current class distributed acquisition module 2311, mean square deviation computing module 2312 and first collect probability and generate Module 2313, above-mentioned each sequence of modules connection.
Current class distributed acquisition module 2311, for obtaining the current class of sample data in the sample collection data set It is not distributed;
Mean square deviation computing module 2312, for calculating sample in the current class distribution and the sample collection data set Mean square deviation between the target category distribution of data;
First collects probability generation module 2313, for being less than or equal in the mean square deviation according to the sample collection number In the case where the error threshold set according to the total sample number of collection, when the current accounting is less than or equal to the sample number to be collected According to generic target accounting when, the first determine the probability that will be obtained by 0.5 plus the mean square deviation is the sample to be collected The collection probability of the sample data of notebook data generic will be subtracted when the current accounting is greater than the target accounting by 0.5 The second determine the probability for going the mean square deviation to obtain is that the collection of the sample data of the sample data generic to be collected is general Rate.
Fig. 9 is the structural schematic diagram that probability determination module is collected in another embodiment of the present invention.As shown in figure 9, shown in Fig. 8 Collection probability determination module, may also include that the second collection probability generation module 2314, with mean square deviation computing module 2312 connect It connects.
Second collects probability generation module 2314, for being greater than in the mean square deviation according to the sample collection data set In the case where the error threshold of total sample number setting, when the current accounting be less than or equal to it is described wait collect belonging to sample data It is the sample to be collected by the first determine the probability that one end out of (0.5,1) range close to 1 is taken when the target accounting of classification The collection probability of the sample data of notebook data generic will when the current accounting is less than or equal to the target accounting The second determine the probability that one end out of (0,0.5) range close to 0 is taken is the sample of the sample data generic to be collected The collection probability of notebook data.
Figure 10 is the structural schematic diagram of data collection module in one embodiment of the invention.As shown in Figure 10, data collection list Member 240, it may include: random number generation module 241 and data collection module 242, the two are connected with each other.
Random number generation module 241, for generating a random number;
Data collection module 242 is used in the case where the random number is less than or equal to the collection probability, will be described Sample data to be collected is added in the sample collection data set;The case where the random number is greater than the collection probability Under, the sample data to be collected is not added in the sample collection data set.
In some embodiments, data collection module 242, it may include: sample collection data set update module.
Sample collection data set update module, for the case where the random number is less than or equal to the collection probability Under, if the sample collection data set has been expired, replaced in the sample collection data set most using the sample data to be collected The sample data early added.
Figure 11 is the structural schematic diagram of the transacter of another embodiment of the present invention.As shown in figure 11, shown in Fig. 7 Transacter may also include that class label data set update module 250, connect with data collection module 240.
Class label data set update module 250, for adding described wait collect class label corresponding to sample data It adds in the class label data set.
The embodiment of the present invention also provides a kind of computer equipment, including memory, processor and storage are on a memory simultaneously The computer program that can be run on a processor, the processor realize above-described embodiment the method when executing described program Step.
The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, the program The step of above-described embodiment the method is realized when being executed by processor.
In conclusion the method for data capture of the embodiment of the present invention, transacter, computer equipment and computer can Storage medium is read, by collecting sample data using fixed-size data set, it may be possible to learn sample data of all categories Current accounting;It can determine reasonable collection probability according to the current accounting of a certain classification sample data and desired accounting;According to It collects whether determine the probability is added to new sample data in data set, the sample data in data set is enabled to become more Meet category distribution situation required for neural network model training.Therefore, this programme can constantly have new samples data raw It is collected in the case where and obtains meeting the sample data set that the category distribution of machine learning requires.
In the description of this specification, reference term " one embodiment ", " specific embodiment ", " some implementations Example ", " such as ", the description of " example ", " specific example " or " some examples " etc. mean it is described in conjunction with this embodiment or example Particular features, structures, materials, or characteristics are included at least one embodiment or example of the invention.In the present specification, Schematic expression of the above terms may not refer to the same embodiment or example.Moreover, the specific features of description, knot Structure, material or feature can be combined in any suitable manner in any one or more of the embodiments or examples.Each embodiment Involved in the step of sequence be used to schematically illustrate implementation of the invention, sequence of steps therein is not construed as limiting, can be as needed It appropriately adjusts.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects Describe in detail it is bright, it should be understood that the above is only a specific embodiment of the present invention, the guarantor being not intended to limit the present invention Range is protected, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this Within the protection scope of invention.

Claims (10)

1. a kind of method of data capture characterized by comprising
Receive sample data to be collected;
Obtain the current accounting for belonging to the sample data of the sample data generic to be collected in sample collection data set, institute Stating sample collection data set is fixed-size data set;
According to the determination of the target accounting of the current accounting and the sample data of the sample data generic to be collected The collection probability of the sample data of sample data generic to be collected;
The sample data to be collected is added in the sample collection data set according to the collection probability, for training Neural network model.
2. method of data capture as described in claim 1, which is characterized in that obtain belong in sample collection data set it is described to Collect the current accounting of the sample data of sample data generic, comprising:
Statistics calculates the accounting of the label of sample data generic to be collected described in class label data set, obtains sample receipts Belong to the current accounting of the sample data of the sample data generic to be collected in collection data set;The class label data Collect the class label for storing each sample data in the sample collection data set, the size of the class label data set with The size of the sample collection data set is identical.
3. method of data capture as described in claim 1, which is characterized in that according to the current accounting and the sample to be collected The target accounting of the sample data of notebook data generic determines the sample data of the sample data generic to be collected Collect probability, comprising:
In the case where the current accounting is less than or equal to the target accounting of the sample data generic to be collected, by the One probability is determined as the collection probability of the sample data of the sample data generic to be collected, and is greater than in the current accounting It is the collection probability by the second determine the probability in the case where the target accounting;It is general that first probability is greater than described second Rate.
4. method of data capture as claimed in claim 3, which is characterized in that the current accounting be less than or equal to it is described to It is described wait collect belonging to sample data by the first determine the probability in the case where the target accounting for collecting sample data generic The collection probability of the sample data of classification, it is in the case where the current accounting is greater than the target accounting, the second probability is true It is set to the collection probability, comprising:
Obtain the current class distribution of sample data in the sample collection data set;
It calculates equal between the target category distribution of sample data in the current class distribution and the sample collection data set Variance;
It is less than or equal to the feelings of the error threshold set according to the total sample number of the sample collection data set in the mean square deviation Under condition, when the current accounting be less than or equal to it is described when collecting the target accounting of sample data generic, will by 0.5 plus The first determine the probability that the upper mean square deviation obtains is that the collection of the sample data of the sample data generic to be collected is general Rate, when the current accounting be greater than the target accounting when, be by the second determine the probability that the mean square deviation obtains is subtracted by 0.5 The collection probability of the sample data of the sample data generic to be collected.
5. method of data capture as claimed in claim 4, which is characterized in that the current accounting be less than or equal to it is described to It is described wait collect belonging to sample data by the first determine the probability in the case where the target accounting for collecting sample data generic The collection probability of the sample data of classification, it is in the case where the current accounting is greater than the target accounting, the second probability is true It is set to the collection probability, further includes:
In the case where the mean square deviation is greater than the error threshold set according to the total sample number of the sample collection data set, when The current accounting be less than or equal to it is described when collecting the target accounting of sample data generic, will be from (0.5,1) range The first determine the probability that interior one end close to 1 is taken is that the collection of the sample data of the sample data generic to be collected is general Rate is taken one end out of (0,0.5) range close to 0 when the current accounting is less than or equal to the target accounting Second determine the probability is the collection probability of the sample data of the sample data generic to be collected.
6. method of data capture as described in claim 1, which is characterized in that according to the collection probability by the sample to be collected Notebook data is added in the sample collection data set, comprising:
Generate a random number;
In the case where the random number is less than or equal to the collection probability, the sample data to be collected is added to described In sample collection data set;In the case where the random number is greater than the collection probability, not by the sample data to be collected It is added in the sample collection data set.
7. method of data capture as claimed in claim 6, which is characterized in that be less than or equal to the collection in the random number In the case where probability, the sample data to be collected is added in the sample collection data set, comprising:
In the case where the random number is less than or equal to the collection probability, if the sample collection data set has been expired, utilize The sample data to be collected replaces the sample data added earliest in the sample collection data set.
8. method of data capture as claimed in claim 2, which is characterized in that according to the collection probability by the sample to be collected After notebook data is added in the sample collection data set, further includes:
It is added to described in the class label data set wait collect class label corresponding to sample data.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the step of processor realizes claim 1 to 8 the method when executing described program.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The step of claim 1 to 8 the method is realized when execution.
CN201811542893.7A 2018-12-17 2018-12-17 Data collection method and device Active CN109740750B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811542893.7A CN109740750B (en) 2018-12-17 2018-12-17 Data collection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811542893.7A CN109740750B (en) 2018-12-17 2018-12-17 Data collection method and device

Publications (2)

Publication Number Publication Date
CN109740750A true CN109740750A (en) 2019-05-10
CN109740750B CN109740750B (en) 2021-06-15

Family

ID=66360404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811542893.7A Active CN109740750B (en) 2018-12-17 2018-12-17 Data collection method and device

Country Status (1)

Country Link
CN (1) CN109740750B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529172A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Data processing method and data processing apparatus
US20220147668A1 (en) * 2020-11-10 2022-05-12 Advanced Micro Devices, Inc. Reducing burn-in for monte-carlo simulations via machine learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975992A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on adaptive upsampling
CN105975993A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on boundary upsampling
CN106897918A (en) * 2017-02-24 2017-06-27 上海易贷网金融信息服务有限公司 A kind of hybrid machine learning credit scoring model construction method
CN106909981A (en) * 2015-12-23 2017-06-30 阿里巴巴集团控股有限公司 Model training, sample balance method and device and personal credit points-scoring system
CN108647727A (en) * 2018-05-10 2018-10-12 广州大学 Unbalanced data classification lack sampling method, apparatus, equipment and medium
CN108694413A (en) * 2018-05-10 2018-10-23 广州大学 Adaptively sampled unbalanced data classification processing method, device, equipment and medium
CN108920477A (en) * 2018-04-11 2018-11-30 华南理工大学 A kind of unbalanced data processing method based on binary tree structure
CN108960561A (en) * 2018-05-04 2018-12-07 阿里巴巴集团控股有限公司 A kind of air control model treatment method, device and equipment based on unbalanced data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909981A (en) * 2015-12-23 2017-06-30 阿里巴巴集团控股有限公司 Model training, sample balance method and device and personal credit points-scoring system
CN105975992A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on adaptive upsampling
CN105975993A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on boundary upsampling
CN106897918A (en) * 2017-02-24 2017-06-27 上海易贷网金融信息服务有限公司 A kind of hybrid machine learning credit scoring model construction method
CN108920477A (en) * 2018-04-11 2018-11-30 华南理工大学 A kind of unbalanced data processing method based on binary tree structure
CN108960561A (en) * 2018-05-04 2018-12-07 阿里巴巴集团控股有限公司 A kind of air control model treatment method, device and equipment based on unbalanced data
CN108647727A (en) * 2018-05-10 2018-10-12 广州大学 Unbalanced data classification lack sampling method, apparatus, equipment and medium
CN108694413A (en) * 2018-05-10 2018-10-23 广州大学 Adaptively sampled unbalanced data classification processing method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529172A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Data processing method and data processing apparatus
US20220147668A1 (en) * 2020-11-10 2022-05-12 Advanced Micro Devices, Inc. Reducing burn-in for monte-carlo simulations via machine learning

Also Published As

Publication number Publication date
CN109740750B (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN111967910A (en) User passenger group classification method and device
CN106909654B (en) Multi-level classification system and method based on news text information
CN104156734B (en) A kind of complete autonomous on-line study method based on random fern grader
CN105426905B (en) Robot barrier object recognition methods based on histogram of gradients and support vector machines
CN110084165A (en) The intelligent recognition and method for early warning of anomalous event under the open scene of power domain based on edge calculations
CN107832725A (en) Video front cover extracting method and device based on evaluation index
CN109871954B (en) Training sample generation method, abnormality detection method and apparatus
CN110221965A (en) Test cases technology, test method, device, equipment and system
CN107918656A (en) Video front cover extracting method and device based on video title
CN103957116B (en) A kind of decision-making technique and system of cloud fault data
CN110288007A (en) The method, apparatus and electronic equipment of data mark
CN113037783B (en) Abnormal behavior detection method and system
CN114169248B (en) Product defect data analysis method and system, electronic device and readable storage medium
CN104881685A (en) Video classification method based on shortcut depth nerve network
CN110046297A (en) Operation and maintenance violation identification method and device and storage medium
CN109740750A (en) Method of data capture and device
Lyu et al. Probabilistic object detection via deep ensembles
CN109086737B (en) Convolutional neural network-based shipping cargo monitoring video identification method and system
Kafshgari et al. Smart split-federated learning over noisy channels for embryo image segmentation
CN107633058A (en) A kind of data dynamic filtration system and method based on deep learning
CN108009152A (en) A kind of data processing method and device of the text similarity analysis based on Spark-Streaming
CN111797935B (en) Semi-supervised depth network picture classification method based on group intelligence
CN108376140A (en) Government data carding method based on fuzzy matching and device
CN111445025A (en) Method and device for determining hyper-parameters of business model
CN108932459A (en) Face recognition model training method and device and recognition algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200201

Address after: 100041, room 2, building 3, building 30, Xing Xing street, Shijingshan District, Beijing,

Applicant after: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd.

Address before: 100083 the first floor of the western small building, No. 18, No. 18, Xue Qing Lu Jia, Beijing

Applicant before: Beijing Shenji Intelligent Technology Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant