CN101067930A

CN101067930A - Intelligent audio frequency identifying system and identifying method

Info

Publication number: CN101067930A
Application number: CN 200710075008
Authority: CN
Inventors: 徐扬生; 覃剑钊; 程俊; 吴新宇; 李崇国
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2007-06-07
Filing date: 2007-06-07
Publication date: 2007-11-07
Anticipated expiration: 2027-06-07
Also published as: WO2008148289A1; CN101067930B

Abstract

This invention relates to an intelligent audio differentiation system and a method, in which, the system includes an audio data set for collecting and storing various sample audio data, a tracing unit and a differentiating unit, said tracing unit is used in picking up the character vector of the sample audio data and searching for and setting up a mapping relation from the sample audio data character vector to the affiliated kind, said differentiating unit is used in storing data of the mapping relation between them and picking up the character vectors of the being differentiated audio data to give a result based on the character vector of the being differentiated audio data.

Description

A kind of intelligent audio identification system and discrimination method

Technical field

The present invention relates to a kind of can be to the system and method for the automatic identification of voice data.

Background technology

The sense of hearing is that the mankind obtain one of important source of external information, also is that the mankind are used to differentiate the outside important channel that a situation arises, as: when hearing the sound of barking, have dog near just can judging; When the sound that hears a scream, come to harm with regard to having the people near the decidable.Can offer the many important information of the present invention by analysis to audio frequency.Most of at present functions of mainly finishing based on the analytic system of audio frequency are that the original audio that collects is carried out pre-service, as: denoising, extract or strengthen the audio frequency of specific characteristic, but at last the identification of audio frequency is all needed people's participation.And in the numerous application scenario of nature, need carry out automatic identification to the different sounds, for example, for the wild animal scholar who is engaged in the wild animal research work in the open air, need some rare wild animals of expensive time-tracking, if can there be the voice data automatic identification system to come the cry of certain wild animal of identification, behind the sound that picks out this kind animal, send signal, then can help the wild animal scholar to follow the trail of.And for example can be provided with the audio frequency automatic identification system in elevator, family, just can carry out automatic identification to abnormal noises such as birdie, the sound of quarrelling and fighting noisily, strike note, the broken sound of glass, explosive sound, gunshots, and send alerting signal and give the monitor staff, thereby improve the reaction time that the monitor staff handles abnormal conditions.Therefore, realize automatically audio frequency being carried out identification, will have important, using value widely.

Summary of the invention

Technical matters to be solved by this invention is: a kind of intelligent audio identification system and automatic identification method are provided, voice data is carried out automatic identification.

The present invention solves the problems of the technologies described above the technical scheme that is adopted to be:

A kind of intelligent audio discrimination method may further comprise the steps:

A, the various sample voice datas of collection mark the sample voice data that collects;

B, one by one from described sample voice data, extract the reflection its essential characteristic proper vector;

C, divide according to described proper vector under category regions, make the proper vector that comprises such sample as much as possible in each different classes of zone after dividing, set up sorter from proper vector to mapping relations the affiliated classification;

D, voice data to be identified is handled, extracted its proper vector;

E, the proper vector of voice data to be identified is input to described sorter, sorter is differentiated according to its proper vector, obtains the identification result to this voice data to be identified.

Described method, wherein said step B comprises the steps:

B1, described sample voice data is carried out pre-service, obtain training data;

B2, from training data, extract the characteristic component of reflection training data essential characteristic;

B3, described characteristic component is made up, obtain described proper vector.

Described method, wherein said step D comprises the steps:

D1, described voice data to be identified is carried out pre-service, obtain Identification Data;

D2, from Identification Data, extract the characteristic component of reflection Identification Data essential characteristic;

D3, described characteristic component is made up, obtain described proper vector.

Described method, wherein: the described characteristic component of described step B2 or D2 comprises: the energy distribution feature of audio frequency in the energy feature of audio frequency or a plurality of period in the centre frequency of audio frequency, some characteristic frequency sections.

Described method, wherein: the described proper vector of described step B3 or D3 be audio power spectrum sum in the centre frequency of audio frequency and some the characteristic frequency sections vector and.

Described method, wherein: category regions was divided according to the numerical value of described proper vector under step C was described, and was limited by curve or curved surface.

Described method, wherein: described step e comprises following processing:

E1, the proper vector of voice data to be identified is input to described sorter, sorter is differentiated according to its proper vector, obtain voice data to be identified be divided under classification classification results and refuse to declare index, it is described that to refuse to declare index be the parameter that is used for weighing the classification results confidence level;

E2, judge the confidence level of classification results according to refusing to declare index, when classification index is higher than default thresholding, judge that described classification results is credible, sorter provides classification under the voice data to be identified; When refusing to declare index and be lower than default thresholding, when sorter provides under the voice data to be identified classification, point out this classification results insincere.

Described method, wherein: described steps A comprises that the sample voice data to collecting carries out identification, determining and indicating this sample voice data is any sound.

A kind of intelligent audio identification system comprises that one is used to gather and store audio data set, a training unit and the identification unit of all kinds of sample voice datas; Described training unit is used to extract the proper vector of sample voice data, and seeks and set up from sample voice data proper vector to the mapping relations the affiliated classification; Described identification unit is used to deposit the data of mapping relations between the voice data proper vector set up and the affiliated classification, and extracts voice data proper vector to be identified, and according to the proper vector of voice data to be identified, provides identification result.

Described system, wherein: training unit comprises first pretreatment module, first characteristic extracting module and training module, described pretreatment module is used for the sample voice data is carried out denoising, obtains training data; Described characteristic extracting module is used for extracting from training data the proper vector of sample voice data, and described training module is used to seek and set up from sample voice data proper vector to the mapping relations the affiliated classification.

Described system, wherein: described identification unit comprises second pretreatment module, second characteristic extracting module and sorter, described second pretreatment module is used for voice data to be identified is carried out denoising, obtains Identification Data; Described second characteristic extracting module is used for extracting from Identification Data the proper vector of voice data to be identified, described sorter is used to deposit the data of mapping relations between the voice data proper vector of described training module output and the affiliated classification, and according to the proper vector of voice data to be identified of input, output identification result.

Beneficial effect of the present invention is: adopt intelligent audio identification system of the present invention and method, can carry out automatic identification to voice data, and system has good real time performance and extended capability.

Description of drawings

Fig. 1 is a system chart of the present invention;

Fig. 2 is a training unit block scheme of the present invention;

Fig. 3 is an identification unit block scheme of the present invention;

Fig. 4 sets up the synoptic diagram of proper vector to mapping relations between the classification for being four time-likes when the sample voice data;

Fig. 5 sets up the synoptic diagram of proper vector to mapping relations between the classification for being two time-likes when the sample voice data.

Embodiment

With embodiment the present invention is described in further detail with reference to the accompanying drawings below:

A kind of intelligent audio identification system as shown in Figure 1 comprises the audio data set 1 that is used to gather and store all kinds of sample voice datas, training unit 2 and an identification unit 3 at least.Training unit 2 is used to extract the proper vector of sample voice data, and seeks and set up from sample voice data proper vector to the mapping relations the affiliated classification; Identification unit 3 is used to deposit the data of mapping relations between the voice data proper vector set up and the affiliated classification, and extracts voice data proper vector to be identified, and according to the proper vector of voice data to be identified, provides identification result.Wherein, training unit comprises first pretreatment module, 21, the first characteristic extracting module 22 and training module 23 as shown in Figure 2.Identification unit comprises second pretreatment module, 31, the second characteristic extracting module 32 and sorter 33 as shown in Figure 3.

The foundation of audio data set 1 is to provide necessary learning sample for follow-up training unit 2.User's classification of identification audio frequency as required collects voice data.The foundation of this data set can be adopted oneself recording, collects audio material from network, and ways such as purchase audio material CD are collected learning sample.In general, the audio frequency of each class all needs to collect a plurality of samples, and in the process of sample collection, need the sample that collect manually be marked, and promptly answers the sample that collects by people's ear, comes then to determine that what sound this sample is.In order to guarantee the identification effect of system, sample should be overcharged collection as much as possible.

In training unit 2, at first need the sample voice data of gathering is carried out pre-service, promptly remove processing such as noise by 21 pairs of sample voice datas of pretreatment module from audio data set 1, sample audio frequency to be identified is separated from the audio frequency background of complexity, obtained treated training data; Then, from training data, extract the composition of reflected sample voice data essential characteristic by characteristic extracting module 22, as: the energy distribution feature of audio frequency in the energy feature (can obtain) of audio frequency or a plurality of time period in the centre frequency of audio frequency, some frequency band by sound signal is carried out Fourier Tranform, and these characteristics combination are got up, obtain corresponding proper vector.For example: the centre frequency of this sample audio frequency is 33, certain audio section self-energy spectrum and be 1000, the proper vector that then obtains is vector and (33,1000) of centre frequency and certain audio section self-energy spectrum sum.Then, utilize the characteristic component that extracts to train the sorter 33 that is used for the identification audio frequency by training module 23, so-called training classifier, search out many classification curves or curved surfaces by training module 23 according to the proper vector of N sample voice data exactly, be separated out N specification area by classification curve or curved surface, the proper vector of each sample voice data is distributed in the different separately specification areas, specification area is divided according to the numerical value of proper vector, just sets up a kind of mapping relations from the characteristic vector space to the classification.For example, when the sample voice data only is four different classes of data, classification problem at four classes, when proper vector is bidimensional, training module 23 just is equivalent to and finds two straight lines to make the proper vector of four class samples be distributed in respectively in four zones that two straight lines cut apart, as shown in Figure 4, the 1st category feature vector that triangle obtains when being training, the 2nd category feature vector that circle obtains when being training, the 3rd category feature vector that pentagram obtains when being training, pentagon the time obtains the 4th category feature vector for training, and straight line 1 and straight line 2 are that the sorting track that obtained by this four category features vector (is that four zones that these two sorting tracks are divided should comprise respectively feature space has been divided into four sub spaces.And training module is with the data that train, and the deposit data of mapping relations is in sorter 33 between voice data proper vector of just having set up and the affiliated classification.

The division principle of category regions is in the inventive method: by the division to characteristic vector space, the proper vector that only comprises similar sample in each different classes of zone after feasible the division, or the proper vector that comprises such sample as much as possible, few proper vector that comprises non-such sample of trying one's best.

The effect of identification unit 3 is according to voice data to be identified, and the sorter 33 that utilizes training module 23 training to obtain obtains identification result.Second pretreatment module 31 in the identification unit, and second characteristic extracting module 32 is identical with second pretreatment module 21 and second feature extraction, 22 effects in the training unit respectively.

After getting access to audio samples to be identified, at first to carry out pre-service to it, the Identification Data after obtaining handling by pretreatment module 31; Then, the feature extracting method in employing and the characteristic extracting module 22 carries out feature extraction to voice data to be identified, obtains the proper vector of voice data to be identified; Afterwards, the input of the proper vector of extracting as sorter 33 (being obtained by training module 23), this sorter is according to the proper vector output identification result of input.For example, when proper vector to be classified is distributed in the folded space of the latter half of the first half of straight line 1 and straight line 2 (as the hexagon among Fig. 4), the present invention just treats this that it is the 1st class that characteristic of division vector is differentiated.If proper vector to be classified is distributed in the folded space of straight line 1 and straight line 2 the first half (as the heptagon among Fig. 4), the present invention just treats this that it is the 2nd affiliated class of circular feature vector that characteristic of division vector is differentiated, and the like, octagon in the accompanying drawing can be divided into the 3rd class, star-like the 4th class that is divided into of hexagonal.

This shows, sorter has provided the identification result of audio categories to be identified according to the proper vector of the voice data to be identified of input, and the sample audio frequency of gathering when audio data set is many more, the specification area of being divided is many more, then thin more to voice data classification to be identified, classification results approaches real sound class more.

In pattern classification system, the normal sorter that uses has neural network, support vector machine, Adaboost etc.Introduce the process of obtaining the linear classification face based on the sorter of linear support vector machine below.

Be example at first to distinguish two class problems:

The proper vector of given two classes and affiliated classification thereof: (x ₁, y ₁) ..., (x _l, y _l) ∈ R ⁿ* { ± 1}.The interphase w of linear support vector machine can obtain by finding the solution following optimization problem:

\min_{w, b, η} C Σ_{i = 1}^{l} η_{i} + \frac{1}{2} {| | w | |}^{2}

s.t.y _i[w·x _i-b]+η _i≥1

η _i≥0，i＝1，...，l

C＞0th wherein, the punishment parameter of fixing.When we obtain a new proper vector x, if wx-b 〉=0 thinks that then this proper vector belongs to classification 1; If wx-b＜0 thinks that then this proper vector belongs to classification-1.The absolute value of wx-b can be used as refuses to declare index, when | wx-b| thinks then that greater than certain threshold values θ this classification is reliable.

Adopt the one-against-one method that it is expanded to the multiclass problem then.For the classification problem of a k class, one-against-one method construct k (k-1)/2 classifying face.These classifying faces adopt above-mentioned classifying face building method at two class problems to obtain k (k-1)/2 classifying face by take out the combination of various two classes from the k class then.We adopt the method for ballot to determine the classification that proper vector x is affiliated.If: the classifying face at i class and j class is w _IjIf, w _IjX-b＞=0 throws a ticket then for the i class; If | w _IjX-b|＜0 throws a ticket then for the j class.After finishing ballot according to k (k-1)/2 classifying face, obtaining the maximum classification of poll will be as last classification results.Simultaneously, each class is when obtaining ballot, and it is right to need | w _IjX-b| adds up, and is last and will be as refusing to declare index, when this adds up and thinks then that greater than certain threshold values θ this classification is reliable.

This identification result of sorter output comprises the classification results of the affiliated classification of audio frequency to be identified and refuses to declare index, proper vector with voice data to be identified among Fig. 4 is that hexagon is an example, because this hexagon is distributed in the 1st folded space-like of the latter half of the first half of straight line 1 and straight line 2, affiliated classification is the 1st class, to fall into the 1st space-like position placed in the middle more when hexagon, similar more with the proper vector of the 1st class sample, illustrate that the classification results confidence level is high more, when hexagon falls into the 1st space-like position the closer to sorting track, illustrate that the classification results confidence level is low more.

Refusing to declare index is the parameter that is used for weighing the classification results confidence level.For the sorter based on canon of probability, the classification results of output is the probability that belongs to certain class, and this probability promptly can be used as refuses to declare index, if the probability that belongs to all classes among the result of output is then refused this sample class is differentiated all less than certain probability; Concerning based on the sorter of classifying face, the proper vector of sample and promptly can be used as from the distance of its nearest classifying face and to refuse to declare index, if the proper vector of sample and from the distance of its nearest classifying face less than certain numerical value, then refuse this sample class is differentiated.

Refuse to declare the confidence level that index is used to judge classification results, (for example: the present invention can set up a small-scale test set can to set threshold value according to experiment in actual applications, seeking a threshold value then can be with most incredible sample in the test set according to declaring, then establishing this value is threshold value) a default threshold value, when refusing to declare index greater than default threshold value, the classification results that sorter provides is credible; When refusing to declare index less than default threshold value, illustrate that the classification results confidence level of sorter is lower, sorter in the classification, points out this classification results insincere under providing voice data to be identified.For example: as shown in Figure 5, classification problem at two classes, when proper vector was bidimensional, training linear classifier found straight line to make the proper vector of a class sample be distributed in one side of straight line with regard to being equivalent to, and the proper vector of another kind of sample is distributed in the another side of straight line.The category feature vector that triangle among Fig. 5 obtains when being training, the another kind of proper vector that circle obtains when being training, straight line is the sorting track that is obtained by this two category features vector.When the proper vector of voice data to be identified is distributed in the left side (as the square among Fig. 5) of straight line, the proper vector that the present invention is just to be classified with this is differentiated and is the class under the triangle character vector.If proper vector to be classified is distributed in the right (as the pentagram among Fig. 4) of straight line, the present invention just treats this that it is the affiliated class of circular feature vector that characteristic of division vector is differentiated.At this moment, linear classification face parameter (can obtain linear classification face parameter by asking for linear classification face normal vector) will be used to distinguish this with the symbol (plus or minus) for the treatment of the characteristic of division dot product and treat the characteristic of division vector, and the absolute value of dot product is and refuses to declare index, then be used to weigh the confidence level of classification, it is big more to refuse to declare index (absolute value of dot product), the confidence level of classification is high more, when the absolute value of dot product during greater than default thresholding, thinks that then classification is reliable.

In actual applications, can utilize this method and system to realize the various no audio frequency that nature exists is carried out identification, also can utilize this system earlier specific several audio frequency to be carried out identification, and based on the result of identification, realize follow-up function, train the sorter quick, that extended capability is good, the system of assurance has good real time performance and extended capability.。

Intelligent audio identification system of the present invention can be used for the intelligent monitoring under the multiple occasion.As: this system can be installed in elevator, birdie, the undesired sounds such as sound, strike note of quarrelling and fighting noisily are carried out automatic identification, and send alerting signal and give the monitor staff, thereby improve the reaction time that abnormal conditions in the elevator are handled, can alleviate elevator monitoring personnel's work load simultaneously.This system also can be used for family's monitoring.After this system indoors is installed, system can carry out identification at indoor contingent abnormal noise to the strike note on the broken sound of glass, doorway, explosive sound, gunshot etc., after recognizing these abnormal noises, send alerting signal immediately, thereby effectively prevent generation by criminal offences such as destruction door and window burglaries.This system also can be installed in outdoor, the sounds relevant such as automatic identification thunder, sound of the wind, the patter of rain with weather, and real-time monitors weather conditions.In addition, this system also can help the wild animal scholar who works to carry out research work in the open air.The wild animal scholar often needs to spend some rare wild animals of time-tracking of several weeks even some months, the present invention can be by broadcasting sowing the wireless senser that this system is installed in the appointed area, come the cry of certain wild animal of identification, behind the sound that picks out this kind animal, send signal, help the wild animal scholar to follow the trail of.This system also can be used for the diagnosis of mechanical fault.When machine breaks down, can send and differ from the sound that machine works is just often sent, and the trouble sound that different faults is sent is also inequality.This system just can learn according to several different fault audio frequency, be installed in then and real-time near the machine machine works sound carried out identification, after picking out trouble sound, report to the police and provide possible fault category, this result can help people to find mechanical disorder timely, and provides foundation for the fault diagnosis of machine.This system can also be applied to based in the audio retrieval of internet and the scene analysis based on audio frequency.

Should be understood that, for those of ordinary skills, can be improved according to the above description or conversion, and all these improvement and conversion all should belong to the protection domain of claims of the present invention.

Claims

1, a kind of intelligent audio discrimination method may further comprise the steps:

D, voice data to be identified is handled, extracted its proper vector;

2, method according to claim 1 is characterized in that: described step B comprises the steps:

3, method according to claim 1 is characterized in that: described step D comprises the steps:

4, according to claim 2 or 3 described methods, it is characterized in that: the described characteristic component of described step B2 or D2 comprises: the energy distribution feature of audio frequency in the energy feature of audio frequency or a plurality of period in the centre frequency of audio frequency, some characteristic frequency sections.

5, method according to claim 4 is characterized in that: the described proper vector of described step B3 or D3 be audio power spectrum sum in the centre frequency of audio frequency and some the characteristic frequency sections vector and.

6, method according to claim 5 is characterized in that: category regions was divided according to the numerical value of described proper vector under step C was described, and was limited by curve or curved surface.

7, method according to claim 6 is characterized in that: described step e comprises following processing:

8, method according to claim 7 is characterized in that: described steps A comprises that the sample voice data to collecting carries out identification, and what sound determines and indicates this sample voice data is.

9, a kind of intelligent audio identification system is characterized in that: comprise that one is used to gather and store audio data set, a training unit and the identification unit of all kinds of sample voice datas; Described training unit is used to extract the proper vector of sample voice data, and seeks and set up from sample voice data proper vector to the mapping relations the affiliated classification; Described identification unit is used to deposit the data of mapping relations between the voice data proper vector set up and the affiliated classification, and extracts voice data proper vector to be identified, and according to the proper vector of voice data to be identified, provides identification result.

10, system according to claim 9 is characterized in that: training unit comprises first pretreatment module, first characteristic extracting module and training module, and described pretreatment module is used for the sample voice data is carried out denoising, obtains training data; Described characteristic extracting module is used for extracting from training data the proper vector of sample voice data, and described training module is used to seek and set up from sample voice data proper vector to the mapping relations the affiliated classification.

11, according to claim 9 or 10 described systems, it is characterized in that: described identification unit comprises second pretreatment module, second characteristic extracting module and sorter, described second pretreatment module is used for voice data to be identified is carried out denoising, obtains Identification Data; Described second characteristic extracting module is used for extracting from Identification Data the proper vector of voice data to be identified, described sorter is used to deposit the data of mapping relations between the voice data proper vector of described training module output and the affiliated classification, and according to the proper vector of voice data to be identified of input, output identification result.