WO2016152132A1 - Dispositif de traitement vocal, procédé de traitement vocal et support d'enregistrement - Google Patents
Dispositif de traitement vocal, procédé de traitement vocal et support d'enregistrement Download PDFInfo
- Publication number
- WO2016152132A1 WO2016152132A1 PCT/JP2016/001593 JP2016001593W WO2016152132A1 WO 2016152132 A1 WO2016152132 A1 WO 2016152132A1 JP 2016001593 W JP2016001593 W JP 2016001593W WO 2016152132 A1 WO2016152132 A1 WO 2016152132A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cluster
- segments
- voice
- unit
- speech
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 141
- 238000003672 processing method Methods 0.000 title claims description 6
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 238000010606 normalization Methods 0.000 claims description 64
- 238000000034 method Methods 0.000 claims description 43
- 239000000284 extract Substances 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 abstract description 4
- 238000003860 storage Methods 0.000 description 54
- 238000013500 data storage Methods 0.000 description 40
- 238000010586 diagram Methods 0.000 description 15
- 230000000694 effects Effects 0.000 description 6
- 238000004519 manufacturing process Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 2
- 239000010931 gold Substances 0.000 description 2
- 229910052737 gold Inorganic materials 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011840 criminal investigation Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000005477 standard model Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/54—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
Definitions
- the present invention relates to a voice processing device, a voice processing system, a voice processing method, and a recording medium that extract frequent patterns from voice data.
- fingerprint identification which is a typical example, fingerprint images collected at the crime scene are sequentially compared with a large number of known fingerprint images to estimate who is involved in the crime.
- a technique similar to a fingerprint test and dealing with voice is called a voiceprint test or a voice test.
- Patent Document 1 describes a technique for extracting speech data of unknown words that are candidate keywords to be registered in a speech recognition dictionary from speech data.
- the technique described in Patent Document 1 detects a section in which a state in which the power value of speech of speech data is greater than a threshold value th1 continues for a certain time or more as a speech section, and a state in which the power value is greater than the threshold value th2 from each speech section. Divide into sections that continue for more than a certain time.
- the technique described in Patent Document 1 acquires a phoneme string from the divided speech data, performs clustering, calculates an evaluation value, detects an unknown word, and registers it in the dictionary.
- Patent Document 2 describes a technique for determining a factor causing misrecognition and notifying a user.
- the technique described in Patent Document 2 divides a mel cepstrum coefficient (Mel-Frequency Cepstrum Coefficients; hereinafter referred to as “MFCC”) vector sequence extracted by a feature extraction unit into segments for each phoneme using a set of standard models.
- MFCC mel cepstrum coefficient
- the technique described in Patent Document 2 investigates the cause of misrecognition, creates a character string of a message to be presented to the user according to the analysis result, and notifies the user by displaying the message on a display.
- Patent Document 1 an unknown word that is a keyword candidate can be selected, but a phrase including a sentence (for example, a sentence such as “Please prepare a ransom”) cannot be selected.
- a vector string for each segment that is erroneously recognized can be analyzed, but a desired phrase cannot be selected.
- the techniques described in Patent Documents 1 and 2 have a problem that a desired phrase cannot be selected.
- An object of the present invention is to provide an audio processing device, an audio processing system, an audio processing method, and a recording medium that can solve the above-described problems and can select a desired phrase.
- a speech processing apparatus includes: a first generation unit configured to generate a plurality of segments from speech data so that adjacent segments at least partially overlap; Second generating means for classifying and generating clusters, selecting means for selecting clusters satisfying a predetermined condition based on the size of the clusters, and extracting means for extracting segments included in the selected clusters With.
- the speech processing method generates a plurality of segments in which adjacent segments at least partially overlap from speech data, classifies the plurality of segments based on phoneme similarity, and generates a cluster. Based on the size of the cluster, a cluster satisfying a predetermined condition is selected, and segments included in the selected cluster are extracted.
- a recording medium generates a cluster from audio data by generating a plurality of segments in which adjacent segments at least partially overlap, and classifying the plurality of segments based on phoneme similarity Storing a program for causing a computer to execute a process for selecting a cluster that satisfies a predetermined condition based on the size of the cluster, and a process for extracting a segment included in the selected cluster. Possible recording media.
- the present invention has an effect that a desired phrase can be selected in a voice processing device, a voice processing system, a voice processing method, and a program.
- a ransom request from a kidnapper or a telephone call of a terrorist's crime notice is recorded, and the recorded voice is compared with a known voice to identify the main voice of the telephone.
- voice changes each time depending on what is spoken. Therefore, in the voice appraisal method, a part (section) of the voice in which the same content is spoken is cut out and compared. In the voice appraisal method, for example, in the ransom request of a kidnapper, it is expected that the phrase “prepare gold” will often appear, so such a phrase is discovered and cut out, and also “prepare gold” Compare with spoken voice.
- FIG. 1 is a block diagram illustrating a configuration example of a voice processing device 10 according to the first embodiment of the present invention.
- the speech processing apparatus 10 includes a generation unit 11, a clustering unit 12, a selection unit 13, and an extraction unit 14.
- the generation unit 11 is also referred to as a first generation unit.
- the clustering unit 12 is also referred to as a second generation unit.
- the generation unit 11 generates a plurality of segments in which at least some of the adjacent segments overlap from the audio data stored in the external storage device. For example, the generation unit 11 subdivides the audio data stored in the external storage device into short time units, and generates a plurality of segments using the subdivided audio data. Moreover, the time length of the several segment which the production
- the clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster.
- the selection unit 13 selects at least one cluster from the generated clusters based on the size of each cluster.
- the extraction unit 14 extracts segments included in the selected cluster.
- the size of the cluster is, for example, a result obtained by multiplying the total time length of the segment, the number of appearances of the segment content (also referred to as a phrase), and the segment length.
- FIG. 2 is a schematic block diagram showing a configuration example of the computer 1000 in each embodiment and specific example of the present invention.
- the computer 1000 includes a CPU (Central Processing Unit) 1001, a main storage device 1002, an auxiliary storage device 1003, an interface 1004, an input device 1005, and a display device 1006.
- CPU Central Processing Unit
- the voice processing apparatus 10 and the like of each embodiment and a specific example are mounted on a computer 1000. Operations of the voice processing device 10 and the like are stored in the auxiliary storage device 1003 in the form of a program.
- the CPU 1001 reads out the program from the auxiliary storage device 1003, develops it in the main storage device 1002, and executes the above processing according to the program. For example, the CPU 1001 reads out the above program from the auxiliary storage device 1003 and develops it in the main storage device 1002, thereby realizing the functions of the generation unit 11, clustering unit 12, selection unit 13, and extraction unit 14.
- the auxiliary storage device 1003 is an example of a tangible medium that is not temporary.
- Other examples of non-temporary tangible media include magnetic disk, magneto-optical disk, CD (Compact Disc) -ROM (Read Only Memory), DVD (Digital Versatile Disc) -ROM, semiconductor, which are connected via an interface 1004 Memory etc. are mentioned.
- this program is distributed to the computer 1000 via a communication line, the computer 1000 that has received the distribution may develop the program in the main storage device 1002 and execute the above processing.
- the interface 1004 is connected to the CPU 1001 and is connected to a network or an external storage medium. External data may be taken into the CPU 1001 via the interface 1004.
- the input device 1005 is, for example, a keyboard, a mouse, a touch panel, or a microphone.
- the display device 1006 displays a screen corresponding to drawing data processed by a CPU 1001 or a GPU (Graphics Processing Unit) (not shown) such as an LCD (Liquid Crystal Display) or a CRT (Cathode Ray Tube) display. It is a device to do. Note that the hardware configuration illustrated in FIG. 2 is merely an example, and each unit illustrated in FIG. 2 may be configured with independent logic circuits.
- the program may be for realizing a part of the above-described processing.
- the program may be a differential program that realizes the above-described processing in combination with another program already stored in the auxiliary storage device 1003.
- FIG. 3 is a flowchart showing an operation example of the speech processing apparatus 10 according to the first embodiment of the present invention.
- the generation unit 11 generates a plurality of segments from the audio data stored in the external storage device (step S101). At this time, the generation unit 11 generates a plurality of segments so that adjacent segments have at least temporal overlap.
- the time length of the segment may be a constant value in the range of 1 second to several seconds, for example, depending on the assumed time length of the phrase.
- the generation unit 11 may generate a segment with various time lengths by dividing the sound data a plurality of times with different time lengths. Further, the generation unit 11 divides the audio data at a predetermined change point or a silent section using the method described in Non-Patent Document 1, and uses the plurality of divided audio data to generate variable length segments. It may be generated.
- the clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster (step S102). That is, the clustering unit 12 clusters a plurality of segments.
- the clustering unit 12 calculates the similarity between the segments from the plurality of segments generated by the generation unit 11, and generates a cluster in which the segments with high similarity are collected.
- the method described in Non-Patent Document 1 can be used.
- the similarity index is an index for measuring the similarity of phonemes constituting a segment.
- the similarity index is an index that uses a statistic of an acoustic feature quantity, such as a batch distance distance calculated from the average and variance of the acoustic feature quantity series, a divergence of a Cullback library, and a log likelihood ratio. These indices do not consider the order of the acoustic feature quantity sequence within the segment.
- a method using the similarity index for example, an index considering an order, that is, a time order may be used.
- the method using the similarity index is, for example, a DP matching method that calculates the degree of similarity by obtaining an optimal correspondence of each acoustic feature amount between segments by dynamic programming (hereinafter referred to as “DP”).
- DP dynamic programming
- the acoustic feature amount is, for example, MFCC.
- MFCC is widely used for voice recognition and the like.
- the selection unit 13 selects a cluster that satisfies a predetermined condition from the clusters generated by the clustering unit 12 based on the size of each cluster (step S103). S104). In this selection, the selection unit 13 compares cluster sizes from the viewpoint of finding frequently occurring phrases, and selects at least one cluster in descending order of size. Examples of the predetermined condition include a larger number of segments, a longer total time length of the segments, and the like. That is, the selection unit 13 selects, for example, a cluster including more segments or a cluster having a longer total segment length as a cluster that satisfies a predetermined condition.
- the case where the clustering converges is, for example, a situation where Step S101 and Step S102 are executed a predetermined number of times, a situation where the increase or decrease of the predetermined evaluation value related to clustering is a predetermined value or less, and the like.
- the case where clustering converges may be a situation in which the movement of a segment between clusters is lost in association with a situation in which the increase or decrease in a predetermined evaluation value related to clustering is equal to or less than a certain value.
- the extraction unit 14 extracts segments from one or more segments included in the cluster selected by the selection unit 13 (step S105). Thereby, the extraction part 14 can extract the segment of the part applicable to a desired phrase from audio
- step S101 if the clustering in the clustering unit 12 has not converged (No in step S103), the process returns to step S101.
- step S101 and step S102 depend on each other and may be repeated a predetermined number of times or until convergence.
- FIG. 4 is a diagram illustrating an example of a method in which the speech processing apparatus 10 extracts a phrase using the HMM. That is, the speech processing apparatus 10 learns an HMM as shown in FIG. 4 based on the maximum likelihood estimation method using speech data stored in the external storage device.
- a one-way HMM (Left ⁇ ) expressing the first phrase (phrase 1 in FIG. 4), the second phrase (phrase 2 in FIG. 4),..., The Nth phrase (phrase N in FIG. 4). to-right HMM) is automatically formed, and at the same time, segments belonging to each are also acquired.
- the generation unit 11 generates a plurality of segments from the speech data so that adjacent segments at least partially overlap, and the clustering unit 12 Based on the similarity, a plurality of segments are classified to generate a cluster.
- the selection unit 13 selects at least one cluster from the clusters based on the size of each cluster.
- the extraction unit 14 extracts the segments included in the selected cluster, the segment corresponding to the desired phrase can be extracted from the speech data. It becomes possible. The reason is that the generation unit 11 generates a plurality of segments so that adjacent segments overlap at least partially from the audio data, so a phrase longer than the word is generated as one segment from a word shorter than the word. Because it can.
- the speech processing apparatus 10 in the present embodiment, frequent phrases necessary for speech appraisal can be found and selected at low cost even if not an expert appraiser.
- the reason is that the speech processing apparatus 10 generates segments such that adjacent segments at least partially overlap from given speech data, clusters the segments, and selects a cluster including many similar segments. Because. This is because the speech processing apparatus 10 extracts segments included in the cluster selected in this way.
- the extracted segment is a segment generated by the generation unit 11 and is partial audio data including a desired phrase in the audio data. This is because the speech processing apparatus 10 can automatically find frequently occurring phrases in the speech data.
- FIG. 5 is a block diagram illustrating a configuration example of the sound processing device 20 according to the second embodiment of the present invention.
- the speech processing apparatus 20 according to the second embodiment of the present invention includes a normalization learning unit 15, a speech data normalization unit 16, a speech data processing unit 17, and first to Nth speech data storages. Part (101-1 to 101-N (N is a positive integer)), unspecified acoustic model storage unit 102, and first to Nth parameter storage units (103-1 to 103-N (N is a positive integer) )).
- the normalization learning unit 15 is also referred to as a third generation unit.
- the first to Nth audio data storage units (101-1 to 101-N) are referred to as the audio data storage unit 101 when they are not distinguished or collectively referred to.
- the first to Nth parameter storage units (103-1 to 103-N) are referred to as the parameter storage unit 103 when they are not distinguished or collectively referred to.
- the audio data storage means 101 stores audio data having different properties. That is, the first audio data storage unit 101-1, the second audio data storage unit 101-2,..., And the Nth audio data storage unit 101-N each store audio data having different properties. Also, the audio data having different properties stored in the first audio data storage unit 101-1, the second audio data storage unit 101-2,..., And the Nth audio data storage unit 101-N are: Each of them is voice data classified based on acoustic characteristics.
- the unspecified acoustic model storage unit 102 stores the unspecified acoustic model learned by the normalization learning unit 15.
- the unspecified acoustic model is a model obtained by normalizing a difference between audio data having different properties stored in the audio data storage unit 101.
- the parameter storage unit 103 stores parameters for normalizing the difference between the audio data. That is, the first parameter storage unit 103-1, the second parameter storage unit 103-2,..., And the Nth parameter storage unit 103-N have parameters for normalizing the difference of audio data. Remember each one.
- the normalization learning unit 15 performs normalization learning using audio data having different properties stored in the audio data storage unit 101.
- normalization learning is an acoustic model learning method described in Non-Patent Document 2, for example.
- each phoneme i is defined by an average vector ⁇ i of acoustic feature values.
- the average vector can be changed depending on the property of speech data. That is, in the present embodiment, the average vector (unspecified acoustic model) ⁇ i is expressed by affine transformation as shown in the following formula (1).
- s 1, 2,..., N.
- a s and b s are parameters for normalizing the difference in the properties of the audio data.
- Equation (1) provides the unspecified acoustic model ⁇ i that is not affected by the difference in the properties of the speech data, and the parameters A S and b S for normalizing the difference in the properties of the speech data. Then, the normalization learning unit 15 stores the unspecified acoustic model ⁇ i in the unspecified acoustic model storage unit 102. In addition, the normalization learning unit 15 stores the parameters A S and b S in the parameter storage unit 103. Specifically, the normalization learning unit 15 stores the parameters A 1 and b 1 in the first parameter storage unit 103-1, and the parameters A N and b N in the Nth parameter storage unit 103-N. Store.
- Non-Patent Document 2 describes a method of normalizing the difference between speakers assuming that the nature of speech data varies depending on the speaker.
- the difference in the properties of speech data is not limited to the speaker, and background noise, Various assumptions such as a microphone and a communication line are possible.
- the normalization learning unit 15 generates a normalization parameter for normalizing the difference between the audio data having different properties stored in the audio data storage unit 101 and stores the normalization parameter in the parameter storage unit 103.
- the normalization learning unit 15 normalizes the difference between the audio data having different properties stored in the audio data storage unit 101 to learn the unspecified acoustic model, and stores the learned unspecified acoustic model in the unspecified acoustic model 102.
- the normalization learning unit 15 generates the normalization parameter by estimating a normalization parameter for normalizing the difference between the audio data having different properties stored in the audio data storage unit 101.
- the normalization learning unit 15 stores the unspecified acoustic model in the unspecified acoustic model 102 at each iteration.
- the audio data normalization unit 16 refers to the parameters stored in the parameter storage unit 103, normalizes the audio data stored in each of the audio data storage units 101, and sends it to the audio data processing unit 17. Specifically, the sth parameter is used for the time series x 1 , x 2 ,..., X t ,... (T is a positive integer) of the acoustic feature amount of the sth audio data.
- the expression (2) which is a conversion corresponding to the inverse conversion of the expression (1), is applied.
- the parameters that define normalization may be different depending on the phoneme class (friction, plosive, etc.), or may be different depending on the preceding and following phonemes in consideration of context dependency. .
- the audio data normalization unit 16 may normalize not only the average vector of acoustic feature values but also the variance. Further, the present invention is not limited to these, and various devices known for normalization learning may be applied.
- the voice data processing unit 17 has the same configuration and effects as the voice processing apparatus 10 in the first embodiment. That is, the voice data processing unit 17 performs the processing of the generation unit 11, the clustering unit 12, the selection unit 13, and the extraction unit 14 illustrated in FIG. 1 in the same manner as in the first embodiment, and in the normalized voice data A segment containing a phrase that appears frequently is output.
- FIG. 6 is a flowchart showing an operation example of the speech processing apparatus 20 according to the second embodiment of the present invention.
- the operation of the audio data processing unit 17 in this embodiment that is, steps S204 to S208 is the same as the operation of the audio processing device 10 in the first embodiment, that is, steps S101 to S105. Since it is the same, description is abbreviate
- the normalization learning unit 15 reads out each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameter of each voice data in the parameter storage unit 103 (step S201).
- the normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the speech data in the unspecified acoustic model storage unit 102 (step S202).
- the audio data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and normalizes the audio data stored in the audio data storage unit 101, respectively (step S203).
- the voice data processing unit 17 performs the same processing as steps S101 to S105 of the voice processing apparatus 10 in the first embodiment shown in FIG. 3, and outputs segments including phrases that frequently appear in the voice data (steps). S204 to step S208).
- the normalization learning unit 15 reads each speech data from the speech data storage unit 101, performs normalization learning, and sets the normalization parameter of each speech data. Store in the parameter storage unit 103.
- the normalization learning unit 15 performs normalization to eliminate the difference in acoustic properties of the respective voice data, and stores the unspecified acoustic model generated in the unspecified acoustic model storage unit 102.
- the voice data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103 and normalizes the voice data stored in the voice data storage unit 101, respectively.
- the voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the speech processing apparatus 20 in the present embodiment can normalize speech data that has not been normalized and select a desired phrase.
- the normalization learning unit 15 determines the difference in acoustic properties between the first speech data, the second speech data,. Learning to normalize the difference.
- the voice data processing unit 17 extracts segments including phrases that frequently appear in the voice data. Therefore, the voice processing device 20 can more accurately extract phrases that frequently appear in the voice data.
- the clustering unit 12 in the speech data processing unit 17 is affected by the difference in the properties of the speech data and generates an inappropriate cluster (for example, a speaker cluster). This is because such a situation can be reduced.
- FIG. 7 is a block diagram illustrating a configuration example of the sound processing device 30 according to the third embodiment of the present invention.
- the speech processing device 30 according to the third embodiment of the present invention includes an unclassified speech data storage unit 104 and a speech data classification unit 18 in addition to the configuration of the speech processing device 20 according to the second embodiment. And comprising.
- the audio data classification unit 18 is also described as a fourth generation unit.
- the uncategorized voice data storage unit 104 stores voice data.
- the voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and stores the voice data in the voice data storage unit 101.
- the voice data classifying unit 18 classifies the voice data stored in the unclassified voice data storage unit 104 into N clusters based on differences in acoustic properties, for example, differences in speakers, and the voice data storage unit Each of them is stored in 101. That is, the audio data classification unit 18 generates N clusters by classifying the audio data stored in the unclassified audio data storage unit 104 based on the acoustic properties. Then, the voice data classifying unit 18 stores the first cluster in the first voice data storage unit, the second cluster in the second voice data storage unit,..., The second cluster in the Nth voice data storage unit. Store N clusters.
- the audio data stored in the unclassified audio data storage unit 104 may be a mixture of audio data having various acoustic properties.
- N may be a predetermined constant, or may be automatically determined by the audio data classification unit 18 according to the processing target. These can be implemented by applying a known clustering method.
- the audio data storage unit 101 stores the audio data classified by the audio data classification unit 18.
- FIG. 8 is a flowchart showing an operation example of the speech processing apparatus 30 according to the third embodiment of the present invention.
- the operation of the audio data processing unit 17 in this embodiment that is, steps S306 to S310 is the same as the operation of the audio processing device 10 in the first embodiment, that is, steps S101 to S105. Since it is the same, description is abbreviate
- the voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and stores the voice data in the voice data storage unit 101 (step S301).
- the normalization learning unit 15 reads each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameter of each voice data in the parameter storage unit 103 (step S302).
- the normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the voice data in the unspecified acoustic model storage unit 102 (step S303).
- the speech data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and each of the speech data The voice data stored in the storage unit 101 is normalized (step S305).
- step S304 when the results of the speech data classification unit 18 and the normalization learning unit 15 have not converged (No in step S304), the process returns to the flow of step S301. Thereby, the voice data classification unit 18 and the normalization learning unit 15 can be repeatedly executed alternately until the result converges.
- the results output by the speech data classification unit 18 and the normalization learning unit 15 may depend on each other. Therefore, it is good also as a repetitive operation
- Such an operation can be carried out efficiently based on an optimization criterion such as likelihood maximization following the method described in Non-Patent Document 3.
- the voice data processing unit 17 performs the same processing as steps S101 to S105 of the voice processing apparatus 10 in the first embodiment shown in FIG. 6, and outputs segments including phrases that frequently appear in the voice data (steps). S306 to S310).
- the audio data classification unit 18 classifies the audio data stored in the audio data storage unit 104 based on the acoustic properties, and the audio data storage unit 101.
- the normalization learning unit 15 reads out each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameters of each voice data in the parameter storage unit 103.
- the normalization learning unit 15 performs normalization to eliminate the difference in acoustic properties of the respective voice data, and stores the unspecified acoustic model generated in the unspecified acoustic model storage unit 102.
- the sound data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103 and normalizes the sound data stored in the sound data storage unit 101, respectively.
- the voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the speech processing apparatus 30 in the present embodiment can classify and normalize speech data that has not been classified and normalized, and select a desired phrase.
- the speech data classification unit 18 classifies speech data into N clusters based on the difference in acoustic properties, and uses the result to normalize the learning unit 15. Is configured to perform normalized learning. Therefore, the voice processing device 30 in the present embodiment can reduce the preparation cost of the voice data as compared with the voice processing device 20 in the second embodiment. The reason is that the speech processing apparatus 30 in the present embodiment does not need to divide speech data in advance according to the difference in acoustic properties (for example, for each speaker), and collects a set of various speech data at once. It is because it can be given and processed.
- FIG. 9 is a block diagram showing a configuration example of the speech processing system 40 in the fourth embodiment of the present invention.
- the voice processing system 40 in the fourth embodiment includes a voice processing device 41, a voice input device 42, an instruction input device 43, and an output device 44.
- the voice processing device 41 processes the input voice by the processing of the voice processing device 10 in the first embodiment of the present invention, the processing of the voice processing device 20 in the second embodiment, or the third embodiment.
- the processing of the voice processing device 30 (hereinafter referred to as “phrase extraction processing described in the first to third embodiments of the present invention”) is executed.
- the voice input device 42 inputs voice.
- the audio input device 42 is an arbitrary device that functions as an interface for inputting arbitrary audio data to the audio processing device 41, that is, a microphone that receives audio signals as data, a memory that records audio data, and the like.
- the voice input device 42 is, for example, the input device 1005 shown in FIG.
- the output device 44 outputs the result of the processing performed by the voice processing device 41.
- the output device 44 is an output device such as a monitor or a speaker that outputs the processing result of the sound processing device 42 by visual or auditory means in accordance with an instruction input from the instruction input device 43 by the operator.
- the output device 44 is a monitor, for example, a list of clusters is displayed in order of size, the contents of a specific cluster are displayed by waveform diagrams, spectrograms, etc., so that a plurality of segments can be compared. Display them side by side.
- the output device 44 is a speaker, the output method of the output device 44 is to reproduce sound.
- the output device 44 is realized by a display device 1006, for example.
- the instruction input device 43 receives the instruction information from the operator and controls information displayed on the display device.
- the instruction input device 43 is a user interface that receives operator instruction information such as processing for information output from the output device 44 and execution of processing by the voice processing device 41.
- An arbitrary input device such as a mouse, a keyboard, or a touch panel is used. Is available.
- the instruction input device 43 receives the instruction information from the operator and controls the voice processing device 41 to execute the process.
- the voice input device 42 inputs arbitrary voice data to the voice processing device 41.
- the speech processing device 41 executes the phrase extraction processing described in the first to third embodiments of the present invention based on the input speech data, selects clusters including frequently occurring phrases, and further selects them. The segments included in the created cluster are extracted.
- the output device 44 outputs the processing result of the sound processing device 41 by visual or auditory means according to an instruction input from the instruction input device 43 by the operator. That is, the output device 44 outputs the processing result in a form that the operator desires to view.
- the instruction input device 43 controls the voice processing device 41 to execute processing in accordance with the instruction information input from the operator.
- the voice input device 42 inputs arbitrary voice data to the voice processing device 41.
- the voice processing device 42 executes the phrase extraction described in the first to third embodiments of the present invention, selects a cluster including phrases (segments) that frequently appear, and Extract the segments contained in the selected cluster.
- the output device 44 outputs the processing result of the sound processing device 41 by visual or auditory means according to the instruction input from the instruction input device 43 by the operator. Therefore, the speech processing system 40 in the present embodiment can output clusters and segments including frequently occurring phrases included in the speech data.
- the voice processing system 40 allows an operator to easily perform analysis work such as identification of a person from voice. This is because the voice processing system 40 in the present embodiment is configured such that the processing result is output to the output device 44 in a form that the operator wants to browse.
- the speech processing system 40 according to the present embodiment can frequently analyze phrases that frequently appear, so that it is possible to analyze a tendency of a talk or a topic that a specific person often speaks.
- FIG. 10 is a diagram illustrating an example of audio data stored in the external storage device.
- the external storage device is realized by, for example, the voice input device 42 in the fourth embodiment.
- the external storage device stores voice data and a voice data ID that is an identifier of the voice data.
- the voice data ID is “1”
- the external storage device stores voice data “... Kept a child. Please prepare a ransom.
- the external storage device is not limited to the contents of the audio data shown in FIG.
- FIG. 11 is a diagram illustrating an example of a method in which the generation unit 11 generates a segment from audio data.
- the generation unit 11 starts from the voice data shown in FIG. 10, that is, the voice data ID “1” “... Has left the child. Prepare the ransom. , Generate multiple segments.
- segment 1 is “deposited” and segment 2 is “deposited.
- the generation unit 11 subdivides the audio data arbitrarily (a predetermined time or the like), and generates a plurality of segments using these.
- generation part 11 produces
- FIG. 12 is a diagram illustrating an example of a method in which the clustering unit 12 generates a cluster in which a plurality of segments are collected.
- a cluster is a cluster ID that is an identifier of a segment content (phrase), a segment content, and an appearance of a segment content (phrase) that appears in all audio data. Including the number of times.
- the cluster ID and the segment number shown in FIG. 11 are assumed to be the same.
- the cluster indicates, for example, that the phrase “deposited” with the cluster ID “1” appears 20 times in all audio data. That is, the clustering unit 12 calculates the similarity between the segments from the plurality of segments generated by the generation unit 11 as illustrated in FIG. 11, and generates a cluster having a high similarity, that is, a group of the same segments. .
- the selection unit 13 compares clusters using the number of segments included in the cluster and the total time length, and selects a cluster that satisfies a predetermined condition. For example, the selection unit 13 compares the number of segments included in each cluster among the plurality of clusters generated by the clustering unit 12, that is, the number of appearances of the phrase. As shown in FIG. 12, the selection unit 13 selects “Ransom” with an appearance count of 35 and “Prepare a ransom” with an appearance count of 30. Next, the selection unit 13 performs comparison based on the size of each cluster. For example, the selection unit 13 uses the result of multiplying the number of appearances and the segment length, that is, the time length as the size of each cluster, and selects the cluster having the largest size of each cluster.
- the selection unit 13 compares, for example, a cluster with a cluster ID of 7 and a cluster with a cluster ID of 8.
- the selection unit 13 compares the result of the multiplication of the appearance count 35 times with the time length of “Ransom” and the result of the multiplication of the appearance count 30 times with the time length of “Prepare ransom”.
- the selection part 13 may compare and select only the time length of a segment, when comparing the clusters with the same appearance frequency. Note that the selection unit 13 is not limited to the above method, and the size may be defined and compared based on various indexes such as the number of appearances, the time length, and the number of phonemes.
- the extraction unit 14 extracts a segment from the selected cluster.
- audio data that is a segment whose content is “Prepare ransom” is extracted. It can be seen from the voice data of this segment that the phrase “Prepare ransom” is frequently included in the voice data.
- Speech processing apparatus DESCRIPTION OF SYMBOLS 10 Speech processing apparatus 11 Generation part 12 Clustering part 13 Selection part 14 Extraction part 15 Normalization learning part 16 Speech data normalization part 17 Speech data processing part 18 Speech data classification part 20 Speech processing apparatus 30 Speech processing apparatus 40 Speech processing system 41 Audio processing device 42 Audio input device 43 Instruction input device 44 Output device 101 Audio data storage unit 102 Unspecified acoustic model storage unit 103 Parameter storage unit 1000 Computer 1001 CPU 1002 Main storage device 1003 Auxiliary storage device 1004 Interface 1005 Input device 1006 Display device
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne un dispositif de traitement vocal qui permet de sélectionner précisément des phrases apparaissant fréquemment qui sont nécessaires à l'évaluation vocale à partir de données vocales. Le dispositif de traitement vocal comprend : une unité de génération permettant de générer une pluralité de segments à partir des données vocales avec des segments adjacents se chevauchant au moins partiellement ; une unité de regroupement permettant de générer un groupe en triant la pluralité de segments d'après la similitude phonologique ; une unité de sélection permettant de sélectionner un groupe qui remplit une condition prescrite d'après la taille du groupe ; et une unité d'extraction permettant d'extraire un segment inclus dans le groupe sélectionné.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2017507495A JP6784255B2 (ja) | 2015-03-25 | 2016-03-18 | 音声処理装置、音声処理システム、音声処理方法、およびプログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2015-061854 | 2015-03-25 | ||
JP2015061854 | 2015-03-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016152132A1 true WO2016152132A1 (fr) | 2016-09-29 |
Family
ID=56978310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2016/001593 WO2016152132A1 (fr) | 2015-03-25 | 2016-03-18 | Dispositif de traitement vocal, procédé de traitement vocal et support d'enregistrement |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP6784255B2 (fr) |
WO (1) | WO2016152132A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111613249A (zh) * | 2020-05-22 | 2020-09-01 | 云知声智能科技股份有限公司 | 一种语音分析方法和设备 |
CN113178196A (zh) * | 2021-04-20 | 2021-07-27 | 平安国际融资租赁有限公司 | 音频数据提取方法、装置、计算机设备和存储介质 |
CN113380273A (zh) * | 2020-08-10 | 2021-09-10 | 腾擎科研创设股份有限公司 | 异常声音检测及判断形成原因的系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007140136A (ja) * | 2005-11-18 | 2007-06-07 | Mitsubishi Electric Corp | 楽曲分析装置及び楽曲検索装置 |
JP2008515012A (ja) * | 2004-09-28 | 2008-05-08 | フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ | 楽曲の時間セグメントをグループ化するための装置および方法 |
JP2008533580A (ja) * | 2005-03-10 | 2008-08-21 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | オーディオ及び/又はビジュアルデータの要約 |
JP2010032792A (ja) * | 2008-07-29 | 2010-02-12 | Nippon Telegr & Teleph Corp <Ntt> | 発話区間話者分類装置とその方法と、その装置を用いた音声認識装置とその方法と、プログラムと記録媒体 |
-
2016
- 2016-03-18 WO PCT/JP2016/001593 patent/WO2016152132A1/fr active Application Filing
- 2016-03-18 JP JP2017507495A patent/JP6784255B2/ja active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008515012A (ja) * | 2004-09-28 | 2008-05-08 | フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ | 楽曲の時間セグメントをグループ化するための装置および方法 |
JP2008533580A (ja) * | 2005-03-10 | 2008-08-21 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | オーディオ及び/又はビジュアルデータの要約 |
JP2007140136A (ja) * | 2005-11-18 | 2007-06-07 | Mitsubishi Electric Corp | 楽曲分析装置及び楽曲検索装置 |
JP2010032792A (ja) * | 2008-07-29 | 2010-02-12 | Nippon Telegr & Teleph Corp <Ntt> | 発話区間話者分類装置とその方法と、その装置を用いた音声認識装置とその方法と、プログラムと記録媒体 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111613249A (zh) * | 2020-05-22 | 2020-09-01 | 云知声智能科技股份有限公司 | 一种语音分析方法和设备 |
CN113380273A (zh) * | 2020-08-10 | 2021-09-10 | 腾擎科研创设股份有限公司 | 异常声音检测及判断形成原因的系统 |
CN113178196A (zh) * | 2021-04-20 | 2021-07-27 | 平安国际融资租赁有限公司 | 音频数据提取方法、装置、计算机设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2016152132A1 (ja) | 2018-01-18 |
JP6784255B2 (ja) | 2020-11-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shahin et al. | Emotion recognition using hybrid Gaussian mixture model and deep neural network | |
Venkataramanan et al. | Emotion recognition from speech | |
US10176811B2 (en) | Neural network-based voiceprint information extraction method and apparatus | |
Deshmukh et al. | Speech based emotion recognition using machine learning | |
CN111524527A (zh) | 话者分离方法、装置、电子设备和存储介质 | |
KR102406512B1 (ko) | 음성인식 방법 및 그 장치 | |
Gupta et al. | Speech emotion recognition using SVM with thresholding fusion | |
JP5704071B2 (ja) | 音声データ解析装置、音声データ解析方法及び音声データ解析用プログラム | |
Al Hindawi et al. | Speaker identification for disguised voices based on modified SVM classifier | |
US10699224B2 (en) | Conversation member optimization apparatus, conversation member optimization method, and program | |
WO2016152132A1 (fr) | Dispositif de traitement vocal, procédé de traitement vocal et support d'enregistrement | |
JP2017134321A (ja) | 信号処理方法、信号処理装置及び信号処理プログラム | |
George et al. | A review on speech emotion recognition: a survey, recent advances, challenges, and the influence of noise | |
JP2015175859A (ja) | パターン認識装置、パターン認識方法及びパターン認識プログラム | |
Kim et al. | Ada-vad: Unpaired adversarial domain adaptation for noise-robust voice activity detection | |
Rahmawati et al. | Java and Sunda dialect recognition from Indonesian speech using GMM and I-Vector | |
Raghib et al. | Emotion analysis and speech signal processing | |
JP5091202B2 (ja) | サンプルを用いずあらゆる言語を識別可能な識別方法 | |
Kamble et al. | Emotion recognition for instantaneous Marathi spoken words | |
KR101023211B1 (ko) | 마이크배열 기반 음성인식 시스템 및 그 시스템에서의 목표음성 추출 방법 | |
JP2011191542A (ja) | 音声分類装置、音声分類方法、及び音声分類用プログラム | |
US12100388B2 (en) | Method and apparatus for training speech recognition model, electronic device and storage medium | |
Gupta et al. | Speech emotion recognition using MFCC and wide residual network | |
Gubka et al. | A comparison of audio features for elementary sound based audio classification | |
US11996086B2 (en) | Estimation device, estimation method, and estimation program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16768039 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2017507495 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16768039 Country of ref document: EP Kind code of ref document: A1 |