CN109740750A - Method of data capture and device - Google Patents
Method of data capture and device Download PDFInfo
- Publication number
- CN109740750A CN109740750A CN201811542893.7A CN201811542893A CN109740750A CN 109740750 A CN109740750 A CN 109740750A CN 201811542893 A CN201811542893 A CN 201811542893A CN 109740750 A CN109740750 A CN 109740750A
- Authority
- CN
- China
- Prior art keywords
- sample
- data
- collection
- probability
- sample data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of method of data capture and devices, this method comprises: receiving sample data to be collected;The current accounting for belonging to the sample data of the sample data generic to be collected in sample collection data set is obtained, the sample collection data set is fixed-size data set;The collection probability of the sample data of the sample data generic to be collected is determined according to the target accounting of the current accounting and the sample data of the sample data generic to be collected;The sample data to be collected is added in the sample collection data set according to the collection probability, for training neural network model.It through the above scheme can be in the sample data set for constantly having the category distribution requirement for obtaining meeting machine learning in the case where new samples generation.
Description
Technical field
The present invention relates to depth learning technology field more particularly to a kind of method of data capture and devices.
Background technique
Common neural network needs to be trained using great amount of samples data in deep learning.If sample data is concentrated
The category distribution of sample data is unbalanced, and neural network model will failure to train.For classification problem, sample data is not
It is balanced, that is, the number difference of the sample data of each classification is very big in data set.More specifically, for example, being asked in one two classification
In topic, if 100 sample datas (100 row data, every data line are the characterization of a sample) is shared, wherein 80 samples
Data belong to classification 1, remaining 20 sample data belongs to classification 2, and classification 1: classification 2=80:20=4:1, this just belongs to classification
It is unbalanced.In intensified learning, AI (artificial intelligence) and environment interaction can generate great amount of samples data, if by sample data into
Row classification, then the generating probability of different classes of sample data is different.
It is the problem of being typically present in machine learning that sample data, which concentrates the classification of sample data unbalanced,.For it is specific,
Fixed sample data set, common solution are, by the classification progress data lack sampling more to number of samples, or
The classification less to number of samples carries out data oversampling, to obtain the sample data set of classification balance by resampling;Separately
A kind of mode is to utilize the manually generated new samples data of available sample data;There are also method be not from data set, and
It is by punishing the algorithm of classifier come the effect of improved model training.What these methods were directed to is all the model of fixed data set
Training problem.
And it is very big for sample size or sample size is unknown and constantly has for the case where new samples generation, it samples again
It is difficult to accomplish.So the case where constantly generating for sample data in intensified learning, the unbalanced problem of classification does not have still
To very good solution.
Summary of the invention
In view of this, the present invention provides a kind of method of data capture and device, constantly to there is the case where new samples generation
Under obtain meeting the sample data set that the category distribution of machine learning requires.
To achieve the goals above, the present invention is realized using following scheme:
In an embodiment of the present invention, method of data capture, comprising:
Receive sample data to be collected;
Obtain currently accounting for for the sample data for belonging to the sample data generic to be collected in sample collection data set
Than the sample collection data set is fixed-size data set;
It is determined according to the target accounting of the current accounting and the sample data of the sample data generic to be collected
The collection probability of the sample data of the sample data generic to be collected;
The sample data to be collected is added in the sample collection data set according to the collection probability, to be used for
Training neural network model.
In an embodiment of the present invention, computer equipment, including memory, processor and storage are on a memory and can be
The computer program run on processor, the processor realize the step of above-described embodiment the method when executing described program
Suddenly.
In an embodiment of the present invention, computer readable storage medium, is stored thereon with computer program, and feature exists
In the program realizes above-described embodiment the method when being executed by processor the step of.
Method of data capture, computer equipment and computer readable storage medium of the invention, by being fixed using size
Data set collect sample data, it may be possible to learn the current accounting of sample data of all categories;According to a certain classification sample number
According to current accounting and it is expected accounting can determine reasonable collection probability;According to whether collecting determine the probability by new sample number
According to being added in data set, the sample data in data set is enabled to become required for more meeting neural network model training
Category distribution situation.Therefore, it can be collected in the case where constantly there is the generation of new samples data and obtain meeting neural network model
The sample data set that trained category distribution requires.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.In the accompanying drawings:
Fig. 1 is the flow diagram of the method for data capture of one embodiment of the invention;
Fig. 2 is that the method flow schematic diagram for collecting probability is determined in one embodiment of the invention;
Fig. 3 is that the method flow schematic diagram for collecting probability is determined in another embodiment of the present invention;
Fig. 4 is that sample data to be collected is added to sample collection data set according to collection probability in one embodiment of the invention
In method flow schematic diagram;
Fig. 5 is the flow diagram of the method for data capture of another embodiment of the present invention;
Fig. 6 is the flow diagram of the method for data capture of one embodiment of the invention;
Fig. 7 is the structural schematic diagram of the transacter of one embodiment of the invention;
Fig. 8 is the structural schematic diagram that probability determination module is collected in one embodiment of the invention;
Fig. 9 is the structural schematic diagram that probability determination module is collected in another embodiment of the present invention;
Figure 10 is the structural schematic diagram of data collection module in one embodiment of the invention;
Figure 11 is the structural schematic diagram of the transacter of another embodiment of the present invention.
Specific embodiment
Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, with reference to the accompanying drawing to this hair
Bright embodiment is described in further details.Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but simultaneously
It is not as a limitation of the invention.
Fig. 1 is the flow diagram of the method for data capture of one embodiment of the invention.As shown in Figure 1, some embodiments
Method of data capture, it may include:
Step S110: sample data to be collected is received;
Step S120: the sample data for belonging to the sample data generic to be collected in sample collection data set is obtained
Current accounting, the sample collection data set be fixed-size data set;
Step S130: according to the target of the current accounting and the sample data of the sample data generic to be collected
Accounting determines the collection probability of the sample data of the sample data generic to be collected;
Step S140: the sample data to be collected is added to by the sample collection data set according to the collection probability
In, for training neural network model.
In above-mentioned steps S110, sample data that should be to be collected can be the sample data in data flow, can be not required to
It is to be understood that sample total, extracts data from data flow in real time.It can continuously be generated by signal source, for example, strengthening
In study, the new samples data that constantly generate.Those sample datas may include metadata (i.e. data itself) and class label,
Their classification can be different.Received sample data to be collected can be temporarily stored in a data and concentrate, and carry out wait be read out
Subsequent processing.
In above-mentioned steps S120, the data in the sample collection data set are used directly for neural network model instruction
Practice, as needed can only include metadata, or the data pair comprising being made of metadata and class label.The sample collection
The size of data set refers to that data set can accommodate most multidata number, can be realized by various different modes, for example, team
Column, chained list etc..The size of sample collection data set usually should be greater than total classification number of data, the visual neural network mould of specific value
It depending on the needs of type training, such as can be 100 times of classification sum.It can have been had collected in the sample collection data set
The sample data that a large amount of signal sources generate.The current accounting of the sample data of a certain classification in the sample collection data set, can be with
By counting the sum of the sample of the category, then received divided by the sum of sample data in the sample collection data set or the sample
The size of collection data set obtains.Wherein, classification can be according to the classification for the data centering being stored in the sample collection data set
Label obtains, or the class in the class label data set by counting corresponding in the sample collection data set specially stored
Distinguishing label obtains.
In above-mentioned steps S130, the target accounting of the sample data of certain classification refers to desired accounting, can be according to mind
Requirement setting through network model training, can specifically determine according to classification sum etc., for example, in the feelings for requiring classification fully equalize
Under condition, when classification sum is ncWhen, a kind of target accounting of the sample data of classification can be 1/nc.If a certain classification sample
The current accounting of data is less than its desired accounting, illustrates that the sample data of the category is less, conversely, more.If currently accounted for
Than being less than target proportion, biggish collection probability can be set, if current accounting is greater than target proportion, can be set lesser
Collect probability.
In above-mentioned steps S140, which can be realized by random number.Expire in sample collection data set
In the case where, it can use the sample data being newly added and replace old sample data, such as be added to sample collection earliest
Sample data in data set.In the case where sample collection data set is discontented, sample collection data set can be added directly to
In.
In the present embodiment, by collecting sample data using fixed-size data set, it may be possible to learn all kinds of other styles
The current accounting of notebook data;It can determine reasonable collect generally according to the current accounting of a certain classification sample data and desired accounting
Rate;According to collecting whether determine the probability is added to new sample data in data set, the sample number in data set is enabled to
According to becoming more to meet category distribution situation required for neural network model training.Therefore, this programme can constantly have new sample
Notebook data is collected in the case where generating obtains meeting the sample data set that the category distribution of machine learning requires.Different classes of
In the case that the generating probability of sample data is different, the sample data of generation is filtered and is collected, enables to be collected into
The data category concentrated of sample data approach desired category distribution, such as tend to be balanced, and data set be can be continuous
The new data that data flow generates is collected, to update the sample for being used for model training.
In some embodiments, above-mentioned steps S120, that is, obtain and belong to the sample to be collected in sample collection data set
The current accounting of the sample data of data generic, it may include:
Statistics calculates the accounting of the label of sample data generic to be collected described in class label data set, obtains sample
Belong to the current accounting of the sample data of the sample data generic to be collected in this collection data set;The class label
Data set is used to store the class label of each sample data in the sample collection data set, the class label data set it is big
It is small identical as the size of the sample collection data set.
Class label in class label data set can be when the sample data that sample collection data set is collected
It is added to class label data set, or can be another after the current all sample datas for having obtained sample collection data
One is added to class label in class label data set.When needing the class label by a sample data to be added to classification mark
When signing in data set, class label can be isolated from sample data, and after required conversion process, be added to classification
Label data is concentrated.Class label data set can be only used for storing the corresponding class of sample data of above-mentioned sample collection data set
Distinguishing label.
In the present embodiment, by specially storing each sample data in sample collection data set using class label data set
Class label, can be convenient for the current class situation of sample data in express statistic sample collection data set.
In some embodiments, above-mentioned steps S130, that is, according to the current accounting and the sample data institute to be collected
Belong to the collection probability of the sample data of the determining sample data generic to be collected of target accounting of the sample data of classification,
Can include:
Step S131: it is less than or equal to the target accounting of the sample data generic to be collected in the current accounting
In the case where, by the collection probability for the sample data that the first determine the probability is the sample data generic to be collected, in institute
Current accounting is stated greater than in the case where the target accounting, is the collection probability by the second determine the probability;First probability
Greater than second probability.
The occurrence of first probability and second probability can be according to the difference (example of above-mentioned current accounting and target accounting
Such as, difference, mean square deviation etc.) it determines.
In the present embodiment, when the current accounting of a certain classification is less than or equal to target accounting, illustrate the data of the category
It is less, by biggish first probability, the sample data of more categories can be obtained;It is greater than in the current accounting of the category
When target accounting, illustrate that the data of the category are more, by lesser second probability, the sample of the less category can be obtained
Data;The sample data of the category can be allow to become closer to desired account for new sample data is constantly received with this
Than.
Fig. 2 is that the method flow schematic diagram for collecting probability is determined in one embodiment of the invention.As shown in Fig. 2, above-mentioned steps
S131, that is, in the case where the current accounting is less than or equal to the target accounting of the sample data generic to be collected,
By the collection probability for the sample data that the first determine the probability is the sample data generic to be collected, in the current accounting
It is the collection probability by the second determine the probability in the case where greater than the target accounting, it may include:
Step S1311: the current class distribution of sample data in the sample collection data set is obtained;
Step S1312: the target class of sample data in the current class distribution and the sample collection data set is calculated
Not Fen Bu between mean square deviation;
Step S1313: it is less than or equal in the mean square deviation and is set according to the total sample number of the sample collection data set
Error threshold in the case where, when the target that the current accounting is less than or equal to the sample data generic to be collected accounts for
Than when, the first determine the probability that will be obtained by 0.5 plus the mean square deviation is the sample of the sample data generic to be collected
The collection probability of notebook data will subtract what the mean square deviation obtained by 0.5 when the current accounting is greater than the target accounting
Second determine the probability is the collection probability of the sample data of the sample data generic to be collected.
In above-mentioned steps S1311, which can be Different categories of samples data in sample collection data set
Ratio, accounting etc..In above-mentioned steps S1312, it is assumed that the sum of classification is nc, in currently accounting for for wherein the i-th class sample data
Than for piAnd target accounting isIn the case where, mean square deviation can be expressed asIn above-mentioned steps S1313,
The error threshold can be according to the size n of sample collection data settarIt determines, for example, can be 5/ntar.The mean square deviation is less than
Or when equal to the error threshold set according to the total sample number of the sample collection data set, it is believed that current class distribution connects
Close-target category distribution, the first probability obtained according to 0.5 plus the mean square deviation can make to collect certain one kind with slightly larger probability
Other sample data.When the current accounting is greater than the target accounting, it is believed that current class distribution and target category
Differing distribution is larger, subtracts the second probability that the mean square deviation obtains according to 0.5, can make to collect a certain classification with slightly smaller probability
Sample data.
In the present embodiment, mean square deviation is calculated according to current class distribution and target category distribution, and with the mean square deviation one
The mode that half probability fluctuates up and down determines the first probability or the second probability, can collect probability and meet classification adjustment needs, and not
It is too big to sample class distributed oscillation.
In other embodiments, the mean value of accounting of all categories and the sample in current class distribution can be calculated separately
The mean value of accounting of all categories in this collection data set calculates the mean value of accounting of all categories and the sample in the current class distribution
Difference in this collection data set between the mean value of accounting of all categories, when the current accounting is less than or equal to the sample to be collected
It is the sample to be collected by the first determine the probability obtained by 0.5 plus the difference when the target accounting of notebook data generic
The collection probability of the sample data of notebook data generic will be subtracted when the current accounting is greater than the target accounting by 0.5
The second determine the probability for going the difference to obtain is the collection probability of the sample data of the sample data generic to be collected.
Fig. 3 is that the method flow schematic diagram for collecting probability is determined in another embodiment of the present invention.As shown in figure 3, shown in Fig. 2
Determine collect probability method, may also include that
Step S1314: it is greater than the error set according to the total sample number of the sample collection data set in the mean square deviation
In the case where threshold value, when the current accounting be less than or equal to it is described when collecting the target accounting of sample data generic,
The first determine the probability that one end out of (0.5,1) range close to 1 is taken is the sample data generic to be collected
The collection probability of sample data will be leaned on when the current accounting is less than or equal to the target accounting out of (0,0.5) range
The second determine the probability that nearly 0 one end is taken is the collection probability of the sample data of the sample data generic to be collected.
In above-mentioned steps S1314, set when the mean square deviation is greater than according to the total sample number of the sample collection data set
When fixed error threshold, illustrate that current class distribution is larger with target sample differing distribution.It can refer to (0.75,1) range close to 1
Interior value, for example, 0.9,0.99 etc..Close 0 can refer to the value in (0,0.25) range, for example, 0.1,0.15 etc..It is described
Mean square deviation is greater than the error threshold set according to the total sample number of the sample collection data set, by out of (0.5,1) range
Close to 1 one end value as collect probability, can quickly reach the target accounting of the sample data of required classification, by from
One end value in (0,0.5) range close to 0 can reduce the number of the more classification of sample data as probability is collected as far as possible
According to increasing speed for amount.
In the present embodiment, by, in the biggish situation of target sample differing distribution, passing through setting in current class distribution
Very big collection probability quicklys increase the gathering speed of the sample data of required classification, by be arranged the collection probability of very little come
Reduce the sample data of unwanted classification as far as possible, the sample data of a certain classification can be made to reach its target accounting as early as possible.
Fig. 4 is that sample data to be collected is added to sample collection data set according to collection probability in one embodiment of the invention
In method flow schematic diagram.As shown in figure 4, in above-mentioned steps S140, according to the collection probability by the sample to be collected
Data are added in the sample collection data set, it may include:
Step S141: a random number is generated;
Step S142: in the case where the random number is less than or equal to the collection probability, by the sample to be collected
Data are added in the sample collection data set;It, will not be described in the case where the random number is greater than the collection probability
Sample data to be collected is added in the sample collection data set.
In above-mentioned steps S141, random number can be generated by various random number generating apparatus.In above-mentioned steps S142
In, in the case where the random number is less than or equal to the collection probability, it can determine and need above-mentioned sample number to be collected
It, can will be wait collect according to the addition ident value at this point it is possible to return to addition ident value according to being added in sample collection data set
Sample data is added in sample collection data set.In the case where the random number is greater than the collection probability, can determine
It does not need for above-mentioned sample data to be collected to be added in sample collection data set, then the sample to be collected can be abandoned directly
Data, or carry out other processing.
In the present embodiment, sample data is collected with the collection probability of aforementioned determination by realizing using random number, can be made
Obtaining sample data, according to target category distribution is collected automatically.
In some embodiments, in above-mentioned steps S142, it is less than or equal to the feelings for collecting probability in the random number
Under condition, the sample data to be collected is added in the sample collection data set, it may include:
In the case where the random number is less than or equal to the collection probability, if the sample collection data set has been expired,
The sample data added earliest in the sample collection data set is replaced using the sample data to be collected.
It is old by being replaced using new sample data in the case where sample collection data set has been expired in the present embodiment
Sample data, required sample data can be collected into the case where being kept fixed data set size.
If the sample collection data set is less than, sample data to be collected directly can be added to sample collection data set
In, to improve the gathering speed of sample data.
It in other embodiments, can be according to finding current class accounting than it if the sample collection data set has been expired
Target accounting is higher by the sample data of more classification, is rejected.With this, the speed for reaching target category distribution can be improved
Degree.
Fig. 5 is the flow diagram of the method for data capture of another embodiment of the present invention.As shown in figure 5, number shown in FIG. 1
According to collection method, after step s 140, that is, the sample data to be collected is added to by the sample according to the collection probability
After in this collection data set, it may also include that
Step S150: the class label data set is added to wait collect class label corresponding to sample data by described
In.
The class label data set is the class label for storing each sample data in the sample collection data set,
The size of the class label data set is identical as the size of the sample collection data set.
In the present embodiment, the sample data to be collected is added to by the sample collection data according to the collection probability
It concentrates, illustrates to have determined that the sample data to be collected is added in the sample collection data set, in the case, by institute
It states and is added in the class label data set wait collect class label corresponding to sample data, it can be with synchronized update classification mark
Data set is signed, so that the class label in class label data set is corresponding with wait collect the sample data in sample data, from
And convenient for counting the current accounting of sample data of all categories.
To make those skilled in the art be best understood from the present invention, it will illustrate reality of the invention with a specific embodiment below
Apply process.
Fig. 6 is the flow diagram of the method for data capture of one embodiment of the invention.As shown in Figure 6, it is assumed that pass through one
Signal source persistently generates different classes of sample data.The sample data of generation includes data itself (metadata) and classification mark
Label.It can be by being analyzed and counted to the sample data in data flow, so using the method for data capture of the embodiment of the present invention
It decides whether to be added to sample data in the sample collection data set of one fixed size afterwards.
Definition signal source is S, will persistently generate sample data Dsm.The sample data D of generationsmInclude metadata d and classification
Label l, i.e. Dsm={ d, l }.Assuming that the number of classification is n in labelc.In the present embodiment, the sample collection of fixed size is utilized
The sample data that data set collecting signal source S is generated.The data set of fixed size refers to the data at most possessed in data set
Maximum number.For example, setting the total size of sample collection data set as n, then can at most possess n in the sample collection data set
Item sample data.When sample collection data set has been expired, but also needs to add new sample data, can be advised with a kind of setting
The old sample data in sample collection data set is then replaced, for example, replacing sample collection data using new sample data
Current oldest sample data in collection.
It is possible, firstly, to initialize for the data set of statistics and the data set for storing authentic specimen data.For uniting
The data set of meter can be only used for the class label of storage sample data, it may include, data stream statistics data set Dst1For depositing
It puts, class label data set Dst2Deng.For storing the data set of authentic specimen data, storage sample data can be only used for
Metadata, or simultaneously for storing the data being made of metadata and respective classes label to (sample data), it may include sample
This collection data set Dtar。
Data stream statistics data set Dst1It can be used for the probability that the sample data of each classification in statistical data stream occurs,
Its size for example may be set to nst1=nc*100.Data stream statistics data set Dst1Total size setting it is bigger, statistics obtains
Each classification sample data occur probability precision it is higher.When beginning, data stream statistics data set Dst1Data will be collected
Stream each of works as sample data, as data stream statistics data set Dst1When having expired, new sample data can be used and replace
Data stream statistics data set Dst1Current oldest sample data in the middle.
Class label data set Dst2It can be used for counting current sample collection data set DtarIn each classification sample number
It is how many according to the ratio of storage, total size nst2With sample collection data set DtarTotal size ntarUnanimously, that is, nst2=ntar。
Class label data set Dst2With sample collection data set DtarThe difference is that class label data set Dst2Only storage sample
The class label of notebook data does not store the metadata (data itself) of sample data.When a new sample data arrives
It waits, sample collection data set DtarIt will be by class label data set Dst2Statistical result judged by certain rule after
To decide whether to for new sample data to be added to sample collection data set DtarIn the middle.
Assuming that needing sample data in sample collection data set DtarIn distribution be that the sample data of each classification is in sample
This collection data set DtarThe ratio occupied in the middle is equal.Target category distribution refers to point of desired sample data classification
Cloth, i.e. ddst={ pi| i=1,2,3 ... nc, wherein piThe i-th class is represented in sample collection data set DtarIn desired proportion.When
Preceding category distribution refers to current sample collection data set DtarThe distribution of middle sample data classification,That is,
WhereinThe i-th class sample data is represented in sample collection data set DtarIn desired proportion.As sample collection data set DtarNot yet
When having full, all new samples data can be added, it, can be according to class label data set D when having expiredst2's
Statistical result executes the judgment rule of setting, determines whether to add new samples data according to the judging result of setting rule to sample
This collection data set Dtar。
Data-gathering process may comprise steps of:
(1) update data stream statistical data collection Dst1Data, the distribution situation of available current data stream, Ke Yiyong
The category distribution of sample data in statistical data stream, the reference as subsequent parameter probability valuing.Data stream statistics data set Dst1
It can be also used for keeping in new sample data, and new sample data taken out in case of need and is added to sample collection
Data set DtarIn.In other embodiments, data stream statistics data set D can not had tost1, but directly received from signal source new
Sample data, be added to sample collection data set D for judging whethertarIn.
(2) judged according to desired category distribution (target category distribution), current classification can be divided into two
Different set.When the desired proportion (target accounting) for judging a classification is greater than current sample collection data set DtarIn should
The ratio (or accounting) of classification sample data, i.e. sample collection data set DtarIn this classification sample data it is very little, at this time
New sample data can be added to set Sx, set S is otherwise addedy。
(3) the mean square error mse between current class distribution and target category distribution is calculated, i.e.,If mse < 5/nst2, then subsequent step (4) are executed, otherwise execute subsequent step (5).
(4) if the classification of new sample data is in set SxIn, then it can be with paccThe probability addition of=0.5+mse is new
Sample data to sample collection data set DtarIn the middle, otherwise with paccThe probability of=0.5-mse adds new sample data and arrives
Sample collection data set DtarIn the middle, it then can execute subsequent step (6).
(5) if the classification of new sample data is in set SxIn, then with pxthrProbability addition new data work as to data set
In, otherwise with pythrProbability add new samples data to sample collection data set DtarIn the middle.Probability pxthrAnd Probability pythrFor
The threshold value of setting.Probability pxthrValue range can be (0.5,1), in order to reach destination probability, Probability p fasterxthrIt can take
0.99.Probability pythrValue range can be (0,0.5), in order to reach destination probability, Probability p fasterythr0.01 can be taken.
Then, subsequent step (6) are executed.
(6) random number is generated, if random number is less than above-mentioned probability, it is determined that add new sample data to sample collection number
According to collection DtarIn, then True is returned, False is otherwise returned.Then the result of judgement can be returned into sample collection data set
DtarIf returning True, new sample data will be added to sample collection data set DtarIn, while by new sample
This class label is added to class label data set Dst2, otherwise abandon data.As sample collection data set DtarIt is not full
When, all new samples data will be added to sample collection data set Dtar, while the class label of new sample being added to
Class label data set Dst2, as sample collection data set DtarIt has been expired that, can judge whether to add by executing above-mentioned judgment rule
Enter new sample data.
In the present embodiment, for constantly having the case where new data generation, solve what data flow in this case generated
It is uncertain that the case where unbalanced problem of the classification of data, solution, which is data volume size of population,.And utilize the prior art not
It can be sampled by all data to fixed size to solve the problems, such as that classification is unbalanced.Sample total is required no knowledge about,
Data are extracted from data flow in real time.New data can be used and replace legacy data in data set, so as to update always
Content in data set.Solve the data for how going to collect a fixed size in the case where constantly there are new samples to generate
Collection, and how to judge whether to need asking for the data for replacing available data concentration with new data in the case where data set has been expired
Topic.
Based on inventive concept identical with method of data capture shown in FIG. 1, the embodiment of the invention also provides a kind of numbers
According to collection device, as described in following example.The principle and method of data capture phase solved the problems, such as due to the transacter
Seemingly, therefore the implementation of the transacter may refer to the implementation of method of data capture, and overlaps will not be repeated.
Fig. 7 is the structural schematic diagram of the transacter of one embodiment of the invention.As shown in fig. 7, some embodiments
Transacter, it may include: data receipt unit 210, current accounting acquiring unit 220, collect probability determining unit 230 and
Data collection module 240, above-mentioned each unit sequence link.
Data receipt unit 210, for receiving sample data to be collected;
Current accounting acquiring unit 220 belongs to the sample data institute to be collected for obtaining in sample collection data set
Belong to the current accounting of the sample data of classification, the sample collection data set is fixed-size data set;
Probability determining unit 230 is collected, for according to the current accounting and the sample data generic to be collected
Sample data target accounting determine the sample data generic to be collected sample data collection probability;
Data collection module 240, for the sample data to be collected to be added to the sample according to the collection probability
In this collection data set, for training neural network model.
In some embodiments, current accounting acquiring unit 220, it may include: current accounting obtains module.
Current accounting obtains module, calculates the affiliated class of sample data to be collected described in class label data set for counting
The accounting of other label obtains the sample data for belonging to the sample data generic to be collected in sample collection data set
Current accounting;The class label data set is used to store the class label of each sample data in the sample collection data set,
The size of the class label data set is identical as the size of the sample collection data set.
In some embodiments, probability determining unit 230 is collected, it may include: collect probability determination module.
Probability determination module is collected, for being less than or equal to the affiliated class of sample data to be collected in the current accounting
In the case where other target accounting, by the receipts for the sample data that the first determine the probability is the sample data generic to be collected
Collect probability, is the collection probability by the second determine the probability in the case where the current accounting is greater than the target accounting;Institute
The first probability is stated greater than second probability.
Fig. 8 is the structural schematic diagram that probability determination module is collected in one embodiment of the invention.As shown in figure 8, collecting probability
Determining module, it may include: current class distributed acquisition module 2311, mean square deviation computing module 2312 and first collect probability and generate
Module 2313, above-mentioned each sequence of modules connection.
Current class distributed acquisition module 2311, for obtaining the current class of sample data in the sample collection data set
It is not distributed;
Mean square deviation computing module 2312, for calculating sample in the current class distribution and the sample collection data set
Mean square deviation between the target category distribution of data;
First collects probability generation module 2313, for being less than or equal in the mean square deviation according to the sample collection number
In the case where the error threshold set according to the total sample number of collection, when the current accounting is less than or equal to the sample number to be collected
According to generic target accounting when, the first determine the probability that will be obtained by 0.5 plus the mean square deviation is the sample to be collected
The collection probability of the sample data of notebook data generic will be subtracted when the current accounting is greater than the target accounting by 0.5
The second determine the probability for going the mean square deviation to obtain is that the collection of the sample data of the sample data generic to be collected is general
Rate.
Fig. 9 is the structural schematic diagram that probability determination module is collected in another embodiment of the present invention.As shown in figure 9, shown in Fig. 8
Collection probability determination module, may also include that the second collection probability generation module 2314, with mean square deviation computing module 2312 connect
It connects.
Second collects probability generation module 2314, for being greater than in the mean square deviation according to the sample collection data set
In the case where the error threshold of total sample number setting, when the current accounting be less than or equal to it is described wait collect belonging to sample data
It is the sample to be collected by the first determine the probability that one end out of (0.5,1) range close to 1 is taken when the target accounting of classification
The collection probability of the sample data of notebook data generic will when the current accounting is less than or equal to the target accounting
The second determine the probability that one end out of (0,0.5) range close to 0 is taken is the sample of the sample data generic to be collected
The collection probability of notebook data.
Figure 10 is the structural schematic diagram of data collection module in one embodiment of the invention.As shown in Figure 10, data collection list
Member 240, it may include: random number generation module 241 and data collection module 242, the two are connected with each other.
Random number generation module 241, for generating a random number;
Data collection module 242 is used in the case where the random number is less than or equal to the collection probability, will be described
Sample data to be collected is added in the sample collection data set;The case where the random number is greater than the collection probability
Under, the sample data to be collected is not added in the sample collection data set.
In some embodiments, data collection module 242, it may include: sample collection data set update module.
Sample collection data set update module, for the case where the random number is less than or equal to the collection probability
Under, if the sample collection data set has been expired, replaced in the sample collection data set most using the sample data to be collected
The sample data early added.
Figure 11 is the structural schematic diagram of the transacter of another embodiment of the present invention.As shown in figure 11, shown in Fig. 7
Transacter may also include that class label data set update module 250, connect with data collection module 240.
Class label data set update module 250, for adding described wait collect class label corresponding to sample data
It adds in the class label data set.
The embodiment of the present invention also provides a kind of computer equipment, including memory, processor and storage are on a memory simultaneously
The computer program that can be run on a processor, the processor realize above-described embodiment the method when executing described program
Step.
The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, the program
The step of above-described embodiment the method is realized when being executed by processor.
In conclusion the method for data capture of the embodiment of the present invention, transacter, computer equipment and computer can
Storage medium is read, by collecting sample data using fixed-size data set, it may be possible to learn sample data of all categories
Current accounting;It can determine reasonable collection probability according to the current accounting of a certain classification sample data and desired accounting;According to
It collects whether determine the probability is added to new sample data in data set, the sample data in data set is enabled to become more
Meet category distribution situation required for neural network model training.Therefore, this programme can constantly have new samples data raw
It is collected in the case where and obtains meeting the sample data set that the category distribution of machine learning requires.
In the description of this specification, reference term " one embodiment ", " specific embodiment ", " some implementations
Example ", " such as ", the description of " example ", " specific example " or " some examples " etc. mean it is described in conjunction with this embodiment or example
Particular features, structures, materials, or characteristics are included at least one embodiment or example of the invention.In the present specification,
Schematic expression of the above terms may not refer to the same embodiment or example.Moreover, the specific features of description, knot
Structure, material or feature can be combined in any suitable manner in any one or more of the embodiments or examples.Each embodiment
Involved in the step of sequence be used to schematically illustrate implementation of the invention, sequence of steps therein is not construed as limiting, can be as needed
It appropriately adjusts.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects
Describe in detail it is bright, it should be understood that the above is only a specific embodiment of the present invention, the guarantor being not intended to limit the present invention
Range is protected, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this
Within the protection scope of invention.
Claims (10)
1. a kind of method of data capture characterized by comprising
Receive sample data to be collected;
Obtain the current accounting for belonging to the sample data of the sample data generic to be collected in sample collection data set, institute
Stating sample collection data set is fixed-size data set;
According to the determination of the target accounting of the current accounting and the sample data of the sample data generic to be collected
The collection probability of the sample data of sample data generic to be collected;
The sample data to be collected is added in the sample collection data set according to the collection probability, for training
Neural network model.
2. method of data capture as described in claim 1, which is characterized in that obtain belong in sample collection data set it is described to
Collect the current accounting of the sample data of sample data generic, comprising:
Statistics calculates the accounting of the label of sample data generic to be collected described in class label data set, obtains sample receipts
Belong to the current accounting of the sample data of the sample data generic to be collected in collection data set;The class label data
Collect the class label for storing each sample data in the sample collection data set, the size of the class label data set with
The size of the sample collection data set is identical.
3. method of data capture as described in claim 1, which is characterized in that according to the current accounting and the sample to be collected
The target accounting of the sample data of notebook data generic determines the sample data of the sample data generic to be collected
Collect probability, comprising:
In the case where the current accounting is less than or equal to the target accounting of the sample data generic to be collected, by the
One probability is determined as the collection probability of the sample data of the sample data generic to be collected, and is greater than in the current accounting
It is the collection probability by the second determine the probability in the case where the target accounting;It is general that first probability is greater than described second
Rate.
4. method of data capture as claimed in claim 3, which is characterized in that the current accounting be less than or equal to it is described to
It is described wait collect belonging to sample data by the first determine the probability in the case where the target accounting for collecting sample data generic
The collection probability of the sample data of classification, it is in the case where the current accounting is greater than the target accounting, the second probability is true
It is set to the collection probability, comprising:
Obtain the current class distribution of sample data in the sample collection data set;
It calculates equal between the target category distribution of sample data in the current class distribution and the sample collection data set
Variance;
It is less than or equal to the feelings of the error threshold set according to the total sample number of the sample collection data set in the mean square deviation
Under condition, when the current accounting be less than or equal to it is described when collecting the target accounting of sample data generic, will by 0.5 plus
The first determine the probability that the upper mean square deviation obtains is that the collection of the sample data of the sample data generic to be collected is general
Rate, when the current accounting be greater than the target accounting when, be by the second determine the probability that the mean square deviation obtains is subtracted by 0.5
The collection probability of the sample data of the sample data generic to be collected.
5. method of data capture as claimed in claim 4, which is characterized in that the current accounting be less than or equal to it is described to
It is described wait collect belonging to sample data by the first determine the probability in the case where the target accounting for collecting sample data generic
The collection probability of the sample data of classification, it is in the case where the current accounting is greater than the target accounting, the second probability is true
It is set to the collection probability, further includes:
In the case where the mean square deviation is greater than the error threshold set according to the total sample number of the sample collection data set, when
The current accounting be less than or equal to it is described when collecting the target accounting of sample data generic, will be from (0.5,1) range
The first determine the probability that interior one end close to 1 is taken is that the collection of the sample data of the sample data generic to be collected is general
Rate is taken one end out of (0,0.5) range close to 0 when the current accounting is less than or equal to the target accounting
Second determine the probability is the collection probability of the sample data of the sample data generic to be collected.
6. method of data capture as described in claim 1, which is characterized in that according to the collection probability by the sample to be collected
Notebook data is added in the sample collection data set, comprising:
Generate a random number;
In the case where the random number is less than or equal to the collection probability, the sample data to be collected is added to described
In sample collection data set;In the case where the random number is greater than the collection probability, not by the sample data to be collected
It is added in the sample collection data set.
7. method of data capture as claimed in claim 6, which is characterized in that be less than or equal to the collection in the random number
In the case where probability, the sample data to be collected is added in the sample collection data set, comprising:
In the case where the random number is less than or equal to the collection probability, if the sample collection data set has been expired, utilize
The sample data to be collected replaces the sample data added earliest in the sample collection data set.
8. method of data capture as claimed in claim 2, which is characterized in that according to the collection probability by the sample to be collected
After notebook data is added in the sample collection data set, further includes:
It is added to described in the class label data set wait collect class label corresponding to sample data.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor
Calculation machine program, which is characterized in that the step of processor realizes claim 1 to 8 the method when executing described program.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
The step of claim 1 to 8 the method is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811542893.7A CN109740750B (en) | 2018-12-17 | 2018-12-17 | Data collection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811542893.7A CN109740750B (en) | 2018-12-17 | 2018-12-17 | Data collection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109740750A true CN109740750A (en) | 2019-05-10 |
CN109740750B CN109740750B (en) | 2021-06-15 |
Family
ID=66360404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811542893.7A Active CN109740750B (en) | 2018-12-17 | 2018-12-17 | Data collection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109740750B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112529172A (en) * | 2019-09-18 | 2021-03-19 | 华为技术有限公司 | Data processing method and data processing apparatus |
US20220147668A1 (en) * | 2020-11-10 | 2022-05-12 | Advanced Micro Devices, Inc. | Reducing burn-in for monte-carlo simulations via machine learning |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105975992A (en) * | 2016-05-18 | 2016-09-28 | 天津大学 | Unbalanced data classification method based on adaptive upsampling |
CN105975993A (en) * | 2016-05-18 | 2016-09-28 | 天津大学 | Unbalanced data classification method based on boundary upsampling |
CN106897918A (en) * | 2017-02-24 | 2017-06-27 | 上海易贷网金融信息服务有限公司 | A kind of hybrid machine learning credit scoring model construction method |
CN106909981A (en) * | 2015-12-23 | 2017-06-30 | 阿里巴巴集团控股有限公司 | Model training, sample balance method and device and personal credit points-scoring system |
CN108647727A (en) * | 2018-05-10 | 2018-10-12 | 广州大学 | Unbalanced data classification lack sampling method, apparatus, equipment and medium |
CN108694413A (en) * | 2018-05-10 | 2018-10-23 | 广州大学 | Adaptively sampled unbalanced data classification processing method, device, equipment and medium |
CN108920477A (en) * | 2018-04-11 | 2018-11-30 | 华南理工大学 | A kind of unbalanced data processing method based on binary tree structure |
CN108960561A (en) * | 2018-05-04 | 2018-12-07 | 阿里巴巴集团控股有限公司 | A kind of air control model treatment method, device and equipment based on unbalanced data |
-
2018
- 2018-12-17 CN CN201811542893.7A patent/CN109740750B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106909981A (en) * | 2015-12-23 | 2017-06-30 | 阿里巴巴集团控股有限公司 | Model training, sample balance method and device and personal credit points-scoring system |
CN105975992A (en) * | 2016-05-18 | 2016-09-28 | 天津大学 | Unbalanced data classification method based on adaptive upsampling |
CN105975993A (en) * | 2016-05-18 | 2016-09-28 | 天津大学 | Unbalanced data classification method based on boundary upsampling |
CN106897918A (en) * | 2017-02-24 | 2017-06-27 | 上海易贷网金融信息服务有限公司 | A kind of hybrid machine learning credit scoring model construction method |
CN108920477A (en) * | 2018-04-11 | 2018-11-30 | 华南理工大学 | A kind of unbalanced data processing method based on binary tree structure |
CN108960561A (en) * | 2018-05-04 | 2018-12-07 | 阿里巴巴集团控股有限公司 | A kind of air control model treatment method, device and equipment based on unbalanced data |
CN108647727A (en) * | 2018-05-10 | 2018-10-12 | 广州大学 | Unbalanced data classification lack sampling method, apparatus, equipment and medium |
CN108694413A (en) * | 2018-05-10 | 2018-10-23 | 广州大学 | Adaptively sampled unbalanced data classification processing method, device, equipment and medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112529172A (en) * | 2019-09-18 | 2021-03-19 | 华为技术有限公司 | Data processing method and data processing apparatus |
US20220147668A1 (en) * | 2020-11-10 | 2022-05-12 | Advanced Micro Devices, Inc. | Reducing burn-in for monte-carlo simulations via machine learning |
Also Published As
Publication number | Publication date |
---|---|
CN109740750B (en) | 2021-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111967910A (en) | User passenger group classification method and device | |
CN106909654B (en) | Multi-level classification system and method based on news text information | |
CN104156734B (en) | A kind of complete autonomous on-line study method based on random fern grader | |
CN105426905B (en) | Robot barrier object recognition methods based on histogram of gradients and support vector machines | |
CN110084165A (en) | The intelligent recognition and method for early warning of anomalous event under the open scene of power domain based on edge calculations | |
CN107832725A (en) | Video front cover extracting method and device based on evaluation index | |
CN109871954B (en) | Training sample generation method, abnormality detection method and apparatus | |
CN110221965A (en) | Test cases technology, test method, device, equipment and system | |
CN107918656A (en) | Video front cover extracting method and device based on video title | |
CN103957116B (en) | A kind of decision-making technique and system of cloud fault data | |
CN110288007A (en) | The method, apparatus and electronic equipment of data mark | |
CN113037783B (en) | Abnormal behavior detection method and system | |
CN114169248B (en) | Product defect data analysis method and system, electronic device and readable storage medium | |
CN104881685A (en) | Video classification method based on shortcut depth nerve network | |
CN110046297A (en) | Operation and maintenance violation identification method and device and storage medium | |
CN109740750A (en) | Method of data capture and device | |
Lyu et al. | Probabilistic object detection via deep ensembles | |
CN109086737B (en) | Convolutional neural network-based shipping cargo monitoring video identification method and system | |
Kafshgari et al. | Smart split-federated learning over noisy channels for embryo image segmentation | |
CN107633058A (en) | A kind of data dynamic filtration system and method based on deep learning | |
CN108009152A (en) | A kind of data processing method and device of the text similarity analysis based on Spark-Streaming | |
CN111797935B (en) | Semi-supervised depth network picture classification method based on group intelligence | |
CN108376140A (en) | Government data carding method based on fuzzy matching and device | |
CN111445025A (en) | Method and device for determining hyper-parameters of business model | |
CN108932459A (en) | Face recognition model training method and device and recognition algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200201 Address after: 100041, room 2, building 3, building 30, Xing Xing street, Shijingshan District, Beijing, Applicant after: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd. Address before: 100083 the first floor of the western small building, No. 18, No. 18, Xue Qing Lu Jia, Beijing Applicant before: Beijing Shenji Intelligent Technology Co., Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |