JP6401126B2

JP6401126B2 - Feature amount vector calculation apparatus, feature amount vector calculation method, and feature amount vector calculation program.

Info

Publication number: JP6401126B2
Application number: JP2015158861A
Authority: JP
Inventors: 小川　厚徳; 厚徳小川; マークデルクロア; 拓也吉岡; 中谷　智広; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc
Current assignee: Nippon Telegraph and Telephone Corp; NTT Inc
Priority date: 2015-08-11
Filing date: 2015-08-11
Publication date: 2018-10-03
Anticipated expiration: 2035-08-11
Also published as: JP2017037222A

Description

本発明は、特徴量ベクトル算出装置、音声認識装置、特徴量ベクトル算出方法及び特徴量ベクトル算出プログラムに関する。 The present invention relates to a feature vector calculation device, a speech recognition device, a feature vector calculation method, and a feature vector calculation program.

近年、音声認識技術における音響モデルとして、ＧＭＭ（Gaussian Mixture Model）に基づくＨＭＭ（Hidden Markov Model）音響モデル(ＧＭＭ−ＨＭＭ音響モデル)よりも認識精度が高い、ＤＮＮ（Deep Neural Network）に基づくＨＭＭ音響モデル（ＤＮＮ−ＨＭＭ音響モデル）が用いられるようになってきている（例えば非特許文献１及び２参照）。ＤＮＮ−ＨＭＭ音響モデルでは、話者、雑音、チャネル等の影響を受けた入力音声データの認識精度が変動することから、各種の変動要因に対するＤＮＮ−ＨＭＭ音響モデルの適応化が盛んに研究されている（例えば非特許文献３及び４参照）。例えば、話者の特徴を数十〜数百次元程度のベクトルで表現したi-vectorと呼ばれる特徴量ベクトルに基づく主に話者変動へのＤＮＮ−ＨＭＭ音響モデル適応化が、簡易かつ高精度な手法として注目されている（例えば非特許文献４及び５参照）。 In recent years, HMM acoustics based on DNN (Deep Neural Network), which have higher recognition accuracy than HMM (Hidden Markov Model) acoustic models (GMM-HMM acoustic models) based on GMM (Gaussian Mixture Model) as acoustic models in speech recognition technology. A model (DNN-HMM acoustic model) has been used (for example, see Non-Patent Documents 1 and 2). In the DNN-HMM acoustic model, the recognition accuracy of the input speech data affected by the speaker, noise, channel, etc. fluctuates. Therefore, adaptation of the DNN-HMM acoustic model to various fluctuation factors has been actively studied. (For example, see Non-Patent Documents 3 and 4). For example, the DNN-HMM acoustic model adaptation to speaker variation mainly based on a feature vector called i-vector that expresses speaker features with vectors of about tens to hundreds of dimensions is simple and highly accurate. It attracts attention as a technique (see, for example, Non-Patent Documents 4 and 5).

Geoffrey Hilton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” [online], SIGNAL PROCESSING MAGAZINE 2012, Volume:29 , Issue: 6, p.82 - p.97, [平成２７年６月２９日検索], インターネット< http://www.isip.piconepress.com/courses/temple/ece_8527/lectures/2014_spring/lecture_38_spmag.pdf >Geoffrey Hilton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition, ”[Online], SIGNAL PROCESSING MAGAZINE 2012, Volume: 29, Issue: 6, p.82-p.97, [Search June 29, 2015], Internet <http://www.isip.piconepress.com /courses/temple/ece_8527/lectures/2014_spring/lecture_38_spmag.pdf> T. Yoshioka, M.J.F Gales, “Environmentally robust ASR front-end for deep neural network acoustic models,” [online], Computer Speech & Language, Volume 31 , Issue 1, May 2015, p.65 - p.86, [平成２７年６月２９日検索], インターネット< http://www.sciencedirect.com/science/article/pii/S0885230814001259 >T. Yoshioka, MJF Gales, “Environmentally robust ASR front-end for deep neural network acoustic models,” [online], Computer Speech & Language, Volume 31, Issue 1, May 2015, p.65-p.86, [Heisei Search on June 29, 2015], Internet <http://www.sciencedirect.com/science/article/pii/S0885230814001259> Hank Liao, “SPEAKER ADAPTATION OF CONTEXT DEPENDENT DEEP NEURAL NETWORKS,” [online], in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 26 - 31 May 2013, p.7947 - p.7951, [平成２７年６月２９日検索], インターネット< http://mazsola.iit.uni-miskolc.hu/~czap/letoltes/IS14/IS2014/PDF/AUTHOR/IS140624.PDF >Hank Liao, “SPEAKER ADAPTATION OF CONTEXT DEPENDENT DEEP NEURAL NETWORKS,” [online], in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 26-31 May 2013, p.7947-p.7951, [Heisei Searched on June 29, 2015], Internet <http://mazsola.iit.uni-miskolc.hu/~czap/letoltes/IS14/IS2014/PDF/AUTHOR/IS140624.PDF> Michael L. Seltzer, Dong Yu, Yongqiang Wang, “AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION,” [online], in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 26 - 31 May 2013, p.7398 - p.7402, [平成２７年６月２９日検索], インターネット< http://research.microsoft.com/pubs/194344/0007398.pdf >Michael L. Seltzer, Dong Yu, Yongqiang Wang, “AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION,” [online], in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 26-31 May 2013 , p.7398-p.7402, [Search June 29, 2015], Internet <http://research.microsoft.com/pubs/194344/0007398.pdf> George Saon, Hagen Soltau, David Nahamoo and Michel Picheny, “Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors,” [online], in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop, 8 - 12 Dec 2013, p.55 - p.59, [平成２７年６月２９日検索], インターネット< http://www.researchgate.net/profile/George_Saon/publication/261485126_Speaker_adaptation_of_neural_network_acoustic_models_using_i-vectors/links/558d70f108ae15962d8939c7.pdf >George Saon, Hagen Soltau, David Nahamoo and Michel Picheny, “Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors,” [online], in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop, 8-12 Dec 2013, p.55-p.59, [Searched on June 29, 2015], Internet Mickael Rouvier, Benoit Favre, “Speaker adaptation of DNN-based ASR with i-vectors: Does it actually adapt models to speakers?,” [online], in Proc. of INTERSPEECH, [平成２７年６月２９日検索], インターネット< http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.481.855&rep=rep1&type=pdf >Mickael Rouvier, Benoit Favre, “Speaker adaptation of DNN-based ASR with i-vectors: Does it actually adapt models to speakers ?,” [online], in Proc. Of INTERSPEECH, [searched June 29, 2015], Internet <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.481.855&rep=rep1&type=pdf> 小川哲司、塩田さやか、“i-vectorを用いた話者認識”、日本音響学会誌，70巻6号，p.332 - p.339，2014-06-01、一般社団法人日本音響学会Tetsuji Ogawa, Sayaka Shioda, “Speaker recognition using i-vector”, Journal of the Acoustical Society of Japan, Vol.70, No.6, p.332-p.339, 2014-06-01, The Acoustical Society of Japan Yi Hu, Philipos C. Loizou, “Subjective comparison and evaluation of speech enhancement algorithms,” [online], in Acoustics, Speech and Signal Processing 2006, ICASSP 2006 Proceedings, 2006 IEEE International Conference (Volume:1), [平成２７年６月２９日検索], インターネット< http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2098693/ >Yi Hu, Philipos C. Loizou, “Subjective comparison and evaluation of speech enhancement algorithms,” [online], in Acoustics, Speech and Signal Processing 2006, ICASSP 2006 Proceedings, 2006 IEEE International Conference (Volume: 1), [2015 Search June 29], Internet <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2098693/> Yun Lei, Nicolas Scheffer, Luciana Ferrer, Mitchel MacLaren, “A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK,” [online], in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference, 4 - 9 May 2014, P.1714 - 1718, [平成２７年６月２９日検索], インターネット< http://www.sri.com/sites/default/files/publications/dnn.pdf >Yun Lei, Nicolas Scheffer, Luciana Ferrer, Mitchel MacLaren, “A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK,” [online], in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference, 4 -9 May 2014, P.1714-1718, [Search June 29, 2015], Internet <http://www.sri.com/sites/default/files/publications/dnn.pdf> P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet and J. Alam, “Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition,” [online], in The Speaker and Language Recognition Workshop 16 -19 June 2014, [平成２７年６月２９日検索], インターネット< http://cs.uef.fi/odyssey2014/program/pdfs/28.pdf >P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet and J. Alam, “Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition,” [online], in The Speaker and Language Recognition Workshop 16 -19 June 2014, [Search June 29, 2015], Internet <http://cs.uef.fi/odyssey2014/program/pdfs/28.pdf> Daniel Garcia-Romero, Xiaohui Zhang, Alan McCree, Daniel Povey, “IMPROVING SPEAKER RECOGNITION PERFORMANCE IN THE DOMAIN ADAPTATION CHALLENGE USING DEEP NEURAL NETWORKS,” [online], in Spoken Language Technology Workshop (SLT), 2014 IEEE, 7 - 10 Dec. 2014, [平成２７年６月２９日検索], インターネット< http://www.danielpovey.com/files/2014_slt_dnn.pdf >Daniel Garcia-Romero, Xiaohui Zhang, Alan McCree, Daniel Povey, “IMPROVING SPEAKER RECOGNITION PERFORMANCE IN THE DOMAIN ADAPTATION CHALLENGE USING DEEP NEURAL NETWORKS,” [online], in Spoken Language Technology Workshop (SLT), 2014 IEEE, 7-10 Dec 2014, [Search June 29, 2015], Internet <http://www.danielpovey.com/files/2014_slt_dnn.pdf>

しかしながら、上記技術では、i-vectorに基づくＤＮＮ−ＨＭＭ音響モデル適応化において、雑音やチャネル歪みなどの影響を受けていないクリーンな入力音声データを想定して行われている。あるいは、入力音声データが雑音やチャネル歪みの影響を受けているとしても、それらに何ら対処を施さずにＤＮＮ−ＨＭＭ音響モデル適応化が行われている。 However, in the above technique, DNN-HMM acoustic model adaptation based on i-vector is performed assuming clean input voice data that is not affected by noise or channel distortion. Alternatively, even if the input voice data is affected by noise and channel distortion, DNN-HMM acoustic model adaptation is performed without taking any measures against them.

ここで、i-vectorは、入力音声データの特徴量に基づき抽出されるため、入力音声データに雑音やチャネル歪みが付加されている場合は、抽出されたi-vectorも雑音やチャネル歪みの影響を受ける。よって、入力音声データが雑音やチャネル歪みなどの影響を受けている場合は、i-vectorに基づくＤＮＮ−ＨＭＭ音響モデル適応化の効果が低下する。 Here, since i-vectors are extracted based on the features of the input audio data, if noise or channel distortion is added to the input audio data, the extracted i-vector is also affected by noise or channel distortion. Receive. Therefore, when the input voice data is affected by noise or channel distortion, the effect of adapting the DNN-HMM acoustic model based on the i-vector is reduced.

本願が開示する実施形態の一例は、例えば、特徴量ベクトルに基づくＤＮＮ−ＨＭＭ音響モデル適応化の効果の低減を抑制することを目的とする。 An example of the embodiment disclosed in the present application is to suppress, for example, a reduction in the effect of DNN-HMM acoustic model adaptation based on a feature vector.

本願の実施形態の一例において、入力音声から第１の特徴量ベクトルを抽出する。入力音声に対して雑音又はチャネル歪みの低減処理が施された音声から第２の特徴量ベクトルを抽出する。そして、雑音又は歪みを含む音声に対して低減処理が施された音声を学習した混合分布モデルのパラメータをもとに、第２の特徴量ベクトルが混合分布モデルの各分布に該当する確率を示す事後確率を計算する。そして、雑音又は歪みを含む音声及び事後確率から、混合分布モデルにおける各分布の平均ベクトルを算出する。そして、第１の特徴量ベクトルと、事後確率と、平均ベクトルとから、入力音声に対する０次のBaum-Welch統計量及び１次のBaum-Welch統計量を計算する。そして、０次のBaum-Welch統計量及び１次のBaum-Welch統計量から特徴量ベクトルを計算する。 In an example of the embodiment of the present application, a first feature vector is extracted from input speech. A second feature amount vector is extracted from the speech in which noise or channel distortion reduction processing has been performed on the input speech. Then, based on the parameters of the mixed distribution model obtained by learning the voice that has been subjected to reduction processing on the voice including noise or distortion, the probability that the second feature vector corresponds to each distribution of the mixed distribution model is indicated. Calculate the posterior probability. Then, an average vector of each distribution in the mixed distribution model is calculated from the speech including noise or distortion and the posterior probability. Then, the 0th-order Baum-Welch statistic and the 1st-order Baum-Welch statistic for the input speech are calculated from the first feature vector, the posterior probability, and the average vector. Then, a feature vector is calculated from the zeroth-order Baum-Welch statistic and the first-order Baum-Welch statistic.

本願が開示する実施形態の一例によれば、例えば、特徴量ベクトルに基づくＤＮＮ−ＨＭＭ音響モデル適応化の効果の低減を抑制できる。 According to an example of the embodiment disclosed in the present application, for example, it is possible to suppress a reduction in the effect of DNN-HMM acoustic model adaptation based on a feature vector.

図１は、従来技術に係るＤＮＮ−ＨＭＭ音響モデルへの基本特徴量ベクトルの入力の概要の一例を示す図である。FIG. 1 is a diagram illustrating an example of an outline of input of a basic feature vector into a DNN-HMM acoustic model according to the related art. 図２は、従来技術に係るＤＮＮ−ＨＭＭ音響モデルへの基本特徴量ベクトル及びi-vectorの入力の概要の一例を示す図である。FIG. 2 is a diagram illustrating an example of an outline of basic feature vector and i-vector input to a DNN-HMM acoustic model according to the related art. 図３は、従来技術に係るi-vectorの抽出手順の概要の一例を示す図である。FIG. 3 is a diagram illustrating an example of an outline of an i-vector extraction procedure according to the related art. 図４は、実施形態に係るi-vector算出装置の一例を示す図である。FIG. 4 is a diagram illustrating an example of the i-vector calculation apparatus according to the embodiment. 図５は、実施形態に係るi-vector抽出処理の一例を示すフローチャートである。FIG. 5 is a flowchart illustrating an example of the i-vector extraction process according to the embodiment. 図６は、プログラムが実行されることにより、実施形態に係るi-vector算出装置及びi-vector算出装置を含む音声認識装置が実現されるコンピュータの一例を示す図である。FIG. 6 is a diagram illustrating an example of a computer that realizes an i-vector calculation apparatus and a speech recognition apparatus including the i-vector calculation apparatus according to the embodiment by executing a program.

以下、本願の開示技術に関する実施形態の一例について、図面を参照して説明する。なお、以下の実施形態により、本願の開示技術が限定されるものではない。また、以下の実施形態は、適宜組合せてもよい。以下、本願が開示する実施形態の説明に先立ち、前提となる従来技術について説明し、その後、本願が開示する実施形態を説明する。 Hereinafter, an exemplary embodiment related to the disclosed technology of the present application will be described with reference to the drawings. The disclosed technology of the present application is not limited by the following embodiments. Further, the following embodiments may be appropriately combined. Prior to the description of the embodiments disclosed in the present application, the premise prior art will be described, and then the embodiments disclosed in the present application will be described.

なお、以下の記載において、記号Ａに対して“＾Ａ”と表記する場合は、下記の（１−１）式に示すように、「Ａの直上に＾が付された記号」と同等であるとする。また、記号Ａに対して“−Ａ”と表記する場合は、下記の（１−２）式に示すように、「Ａの直上に−が付された記号」と同等であるとする。また、記号Ａに対して“｛Ａ｝_α ^β”と表記する場合は、下記の（１−３）式に示すように、「｛Ａ｝の右方にαが下付きで表記され、｛Ａ｝の右方にβが上付きで表記された記号」と同等であるとする。また、Ａがベクトルである場合には「ベクトルＡ」、Ａが行列である場合には「行列Ａ」、Ａが集合である場合には「集合Ａ」と記載する。 In addition, in the following description, when “^ A” is written with respect to the symbol A, it is equivalent to “a symbol with a ^ immediately above A” as shown in the following equation (1-1). Suppose there is. In addition, when “-A” is written for the symbol A, it is assumed to be equivalent to “a symbol with − immediately above A” as shown in the following equation (1-2). Also, when “{A} _α ^β ” is written for the symbol A, “α is written as a subscript to the right of {A}, as shown in the following equation (1-3): { It is assumed that β is equivalent to a symbol in which β is superscripted on the right side of A}. Further, “A” is described as “Vector A” when A is a vector, “Matrix A” when A is a matrix, and “Set A” when A is a set.

［従来技術に係るＤＮＮ−ＨＭＭ音響モデルへの基本特徴量ベクトルの入力］
図１は、従来技術に係るＤＮＮ−ＨＭＭ音響モデルへの基本特徴量ベクトルの入力の概要の一例を示す図である。図１に示すように、一般的に、音声認識において、入力音声データは、フレーム長３０ｍｓｅｃ程度、フレームシフト１０ｍｓｅｃ程度の単位で音響分析され、４０次元程度のＭＦＣＣ（Mel-Frequency Cepstral Coefficient）やＦＢＡＮＫ（log-mel Filter BANK）等の基本特徴量ベクトルがフレーム毎に抽出される。 [Input of basic feature vector to DNN-HMM acoustic model according to prior art]
FIG. 1 is a diagram illustrating an example of an outline of input of a basic feature vector into a DNN-HMM acoustic model according to the related art. As shown in FIG. 1, in general, in speech recognition, input speech data is acoustically analyzed in units of a frame length of about 30 msec and a frame shift of about 10 msec, and about 40-dimensional MFCC (Mel-Frequency Cepstral Coefficient) or FBANK A basic feature vector such as (log-mel Filter BANK) is extracted for each frame.

そして、図１に示すように、ＤＮＮ−ＨＭＭ音響モデルは、１フレームの基本特徴量ベクトルが与えられたときに、当該フレームのＨＭＭ状態の事後確率ベクトルを出力する。より詳細には、ＤＮＮ−ＨＭＭ音響モデルは、例えば当該フレーム及び当該フレームの前後５フレーム分の特徴量ベクトルが連結された合計数百〜千数百程度の次元の基本特徴量ベクトルが与えられるのに対して、当該フレームのＨＭＭ状態の事後確率ベクトルを出力する。この音声認識の基本の枠組みについては、例えば非特許文献１及び２で詳細に説明されている。 As shown in FIG. 1, when a basic feature vector of one frame is given, the DNN-HMM acoustic model outputs a posterior probability vector of the HMM state of the frame. More specifically, the DNN-HMM acoustic model is provided with basic feature vectors having a total dimension of about several hundred to several hundreds, for example, in which the frame and feature vectors for five frames before and after the frame are connected. In response, the posterior probability vector of the HMM state of the frame is output. This basic framework for speech recognition is described in detail in Non-Patent Documents 1 and 2, for example.

［従来技術に係るＤＮＮ−ＨＭＭ音響モデルへの基本特徴量ベクトル及びi-vectorの入力］
図２は、従来技術に係るＤＮＮ−ＨＭＭ音響モデルへの基本特徴量ベクトル及びi-vectorの入力の概要の一例を示す図である。図２に示すように、ＭＦＣＣやＦＢＡＮＫ等の基本特徴量ベクトルとは別に、入力音声データに含まれる話者の特徴を数十〜数百次元程度のベクトルで表現したi-vectorと呼ばれる特徴量ベクトルが入力音声データから抽出される。そして、基本特徴量ベクトル及びi-vectorを連結した拡張特徴量ベクトルをＤＮＮ−ＨＭＭ音響モデルに与えて、主に話者変動に対して適応化した音声認識に用いる。この方法の有効性は、例えば非特許文献４及び５で詳細に説明されている。 [Input of basic feature vector and i-vector to DNN-HMM acoustic model according to prior art]
FIG. 2 is a diagram illustrating an example of an outline of basic feature vector and i-vector input to a DNN-HMM acoustic model according to the related art. As shown in FIG. 2, apart from basic feature vectors such as MFCC and FBANK, feature values called i-vectors that represent speaker features contained in input speech data with vectors of about several tens to several hundreds of dimensions A vector is extracted from the input speech data. Then, an extended feature vector obtained by concatenating the basic feature vector and the i-vector is given to the DNN-HMM acoustic model and used mainly for speech recognition adapted to speaker variation. The effectiveness of this method is described in detail in Non-Patent Documents 4 and 5, for example.

ここで、雑音やチャネル歪みが付加されている入力音声データを音声認識するためにi-vectorを用いる場合は、話者変動に加えて、雑音やチャネルの変動にもＤＮＮ−ＨＭＭ音響モデルを適応化する必要があるため、話者の特徴に加えて、雑音やチャネル歪みの情報もi-vectorに含まれている方が望ましい。i-vectorは、元来、話者認識の分野で開発されたものである。 Here, when i-vector is used for speech recognition of input speech data to which noise and channel distortion are added, the DNN-HMM acoustic model is applied to noise and channel variations in addition to speaker variations. Therefore, it is desirable that the i-vector includes noise and channel distortion information in addition to speaker characteristics. i-vector was originally developed in the field of speaker recognition.

［従来技術に係るi-vectorの抽出手順］
図３は、従来技術に係るi-vectorの抽出手順の概要の一例を示す図である。以下、i-vectorの抽出手順について説明する。以下、i-vector抽出手順のうち、開示技術に関わる部分のみについて説明する。i-vectorが登場した経緯や抽出手順については、例えば非特許文献７で詳細に説明されている。 [Existing procedure of i-vector according to the prior art]
FIG. 3 is a diagram illustrating an example of an outline of an i-vector extraction procedure according to the related art. The i-vector extraction procedure will be described below. Hereinafter, only the part related to the disclosed technology in the i-vector extraction procedure will be described. The details of the appearance of i-vector and the extraction procedure are described in detail in Non-Patent Document 7, for example.

従来技術の話者認識における標準的なi-vectorの抽出手法は、ＧＭＭ（Gaussian Mixture Model：混合ガウス分布モデル）−ＵＢＭ（Universal Background Model）(ＧＭＭ−ＵＢＭ)アプローチである。なお、ＵＢＭもＧＭＭの一種である。ＧＭＭ−ＵＢＭアプローチは、「音声らしい」モデル(ＵＢＭ)を多数の不特定話者の大量のＵＢＭ学習用の音声データを用いて学習しておき、新たな話者のモデル（ＧＭＭ）は、当該話者の少量の音声データを用いてＵＢＭを適応して得る、という手法である。 A standard i-vector extraction method in speaker recognition of the prior art is a GMM (Gaussian Mixture Model) -UBM (Universal Background Model) (GMM-UBM) approach. UBM is a kind of GMM. In the GMM-UBM approach, a “voice-like” model (UBM) is learned using a large amount of speech data for UBM learning of a large number of unspecified speakers, and a new speaker model (GMM) This is a method of adaptively obtaining UBM using a small amount of voice data of a speaker.

一方、近年の話者認識においては、ＧＭＭの平均ベクトルを混合数分だけ連結したＧＭＭスーパーベクトルを特徴量ベクトルとして用いる枠組みが主流となってきている。ＧＭＭスーパーベクトルは、時系列データである音声データをベクトル空間上の一点として表現するものである。i-vectorもこのＧＭＭスーパーベクトルを基礎としている。 On the other hand, in recent speaker recognition, a framework using a GMM super vector obtained by connecting GMM average vectors by the number of mixtures as a feature vector has become mainstream. The GMM super vector represents audio data as time series data as one point on a vector space. i-vector is also based on this GMM supervector.

ここで、入力音声データｕから得られるＤ次元のＬフレームの特徴量ベクトル系列Ｘ_uを、下記の（２）式のように定義する。特徴量ベクトルｘ_t（ｔ＝１，２，・・・，Ｌ）は、例えばＭＦＣＣであり、その次元数Ｄは、例えば４０である。 Here, a D-dimensional L frame feature vector sequence X _u obtained from the input speech data u is defined as in the following equation (2). The feature vector x _t (t = 1, 2,..., L) is, for example, MFCC, and its dimension number D is, for example, 40.

また、ｃ＝１，２，・・・，ＣをＵＢＭ（ＧＭＭ）のガウス分布を表す添え字(例えばＣ＝２０４８)とし、ｃ番目のガウス分布の混合重みπ_c、ｃ番目の平均ベクトルｍ_c、対角共分散行列Σ_cとすると、ＵＢＭのパラメータ集合Ωは、下記の（３）式で表される。 In addition, c = 1, 2,..., C is a subscript (for example, C = 2048) representing a Gaussian distribution of UBM (GMM), and the mixing weight π _{c of} the c-th Gaussian distribution and the c-th average vector m _c, when the diagonal covariance matrix sigma _c, the parameter set Ω of UBM, represented by (3) below.

このとき、特徴量ベクトルｘ_tに対するＵＢＭの尤度ｐ(ｘ_t|Ω)は、下記の（４）式のように与えられる。 At this time, the likelihood p (x _t | Ω) of the UBM with respect to the feature quantity vector x _t is given by the following equation (4).

このＵＢＭから得られる話者非依存のＣＤ（Ｃ×Ｄ）次元のＧＭＭスーパーベクトルｍは、下記（５）式のようになる。ただし、数式の右肩のＴは、行列又はベクトルの転置記号である。 The speaker-independent CD (C × D) -dimensional GMM supervector m obtained from the UBM is expressed by the following equation (5). However, T on the right side of the equation is a transposition symbol of a matrix or a vector.

そして、入力音声データｕのＣＤ次元のＧＭＭスーパーベクトルＭ_uは、下記の（６）式のように得られるものとする。 The CD-dimensional GMM super vector M _u of the input audio data u is assumed to be obtained by the following equation (6).

ここで、上記の（６）式における行列Ｔは、全変動行列と呼ばれるＣＤ次元×Ｍ次元の矩形行列（Ｍ＜＜ＣＤ)であり、ベクトルｗ_uが入力音声データｕに対するＭ次元のi-vectorである。つまり、i-vectorは、ＧＭＭスーパーベクトル空間における平均的な話者（ＵＢＭの平均）からの「差」（を次元圧縮したもの）として各入力音声データｕに含まれる話者の特徴を表現したものといえる。 Here, the matrix T in the above equation (6) is a CD dimension × M dimension rectangular matrix (M << CD) called a total variation matrix, and the vector w _u is an M dimension i− with respect to the input speech data u. vector. In other words, the i-vector represents the characteristics of the speakers included in each input speech data u as a “difference” (dimensionally compressed) from the average speaker (average of UBM) in the GMM super vector space. It can be said that.

以下、i-vectorであるベクトルｗ_uの具体的な一連の抽出手順について述べる。先ず、γ_t(ｃ)を、ＵＢＭにおいてｃ番目であるガウス分布からｘ_tが生成される事後確率とする。事後確率γ_t(ｃ)は、下記の（７）式のように得られる。 Hereinafter, a specific series of extraction procedures of the vector w _u which is an i-vector will be described. First, let γ _t (c) be the posterior probability that x _t is generated from the Gaussian distribution that is the c-th in the UBM. The posterior probability γ _t (c) is obtained by the following equation (7).

事後確率γ_t(ｃ)を用いると、ＵＢＭを用いた入力音声データｕに対する０次、１次のBaum-Welch統計量Ｎ_u,c、ベクトルＦ_u,cは、下記の（８）式及び（９）式のようにそれぞれ書くことができる。ただし、ベクトルＦ_u,cは、Ｄ次元のベクトルである。 Using the posterior probability γ _t (c), the 0th-order and 1st-order Baum-Welch statistics N _{u, c} and the vector F _{u, c} for the input speech data u using UBM are expressed by the following equation (8) and Each can be written as shown in equation (9). However, the vector _{Fu, c} is a D-dimensional vector.

さらに、上記の（８）式及び（９）式を用いて、下記の（１０）式及び（１１）式のように、０次、１次のBaum-Welch統計量である行列Ｎ_u、ベクトルＦ_uを定義する。ただし、行列Ｎ_uはＣＤ次元×ＣＤ次元の行列、ベクトルＦ_uはＤ次元のベクトルである。 Further, using the above equations (8) and (9), a matrix N _u that is a 0th-order and first-order Baum-Welch statistic, a vector, as in the following equations (10) and (11): Define _Fu . However, the matrix N _u is a CD dimension × CD dimension matrix, and the vector F _u is a D dimension vector.

ここで、上記の（１０）式の対角成分に現れる行列Ｉ_Dは、Ｄ次元×Ｄ次元の単位行列である。また、行列Σを全変動行列Ｔで表現できない残留変動成分をモデル化するＤ次元×Ｄ次元の対角行列とする。行列Ｔ及び行列Σの計算手順は省略するが、以上を用いてi-vectorｗ_uは、下記の（１２）式のように計算できる。なお、下記の（１２）式における行列Ｉ_Mは、Ｍ次元×Ｍ次元の単位行列である。 Here, the matrix I _D that appears in the diagonal component of the above equation (10) is a D-dimensional × D-dimensional unit matrix. Further, the matrix Σ is a D-dimensional × D-dimensional diagonal matrix that models residual fluctuation components that cannot be expressed by the total fluctuation matrix T. Although the calculation procedure of the matrix T and the matrix Σ is omitted, the i-vectorw _u can be calculated as in the following equation (12) using the above. Note that the matrix I _M in the following equation (12) is an M-dimensional × M-dimensional unit matrix.

上記の（７）式〜（１２）式で示したi-vectorｗ_uの具体的な一連の抽出手順は、大きく分けて二つの手順に分けることができる。＜一つ目の手順＞は、上記の（７）式に相当するもので、入力音声データｕから得られるＬフレームの特徴量ベクトル系列Ｘ_uの各フレームの特徴量ｘ_t（ｔ=１，２，・・・，Ｌ）がＵＢＭのｃ番目のガウス分布から生成される事後確率γ_t(ｃ)を計算する手順である。＜二つ目の手順＞は、上記の（７）式で計算した事後確率γ_t(ｃ)を用いて、上記の（８）式〜（１２）式に従い、i-vectorｗ_uを計算する手順である。 Specific series of extraction steps of i-vectorw _u shown in the above (7) to (12) can be divided roughly into two steps. The <first procedure> corresponds to the above equation (7), and the feature quantity x _t (t = 1, _t) of each frame of the L frame feature quantity vector sequence X _u obtained from the input speech data u. 2,..., L) is a procedure for calculating the posterior probability γ _t (c) generated from the c-th Gaussian distribution of UBM. <Second procedure> is a procedure for calculating i-vectorw _u according to the above equations (8) to (12) using the posterior probability γ _t (c) calculated in the above equation (7). It is.

ＵＢＭ内の各ガウス分布は、理想的には、前後数音素分の依存性も含めた音素の情報を含む各音素コンテキストに対応している。i-vector抽出の＜一つ目の手順＞で、事後確率γ_t(ｃ)を計算しているが、これはベクトルｘ_tの音素コンテキストを確率的に推定していることに相当する。事後確率γ_t(ｃ)を精度良く計算することは、i-vector抽出の二つ目の手順で、話者の特徴を表現したi-vectorｗ_uを、音素コンテキストすなわち入力音声データｕの発話内容に依存せずに、精度良く計算するために必要不可欠である。 Each Gaussian distribution in the UBM ideally corresponds to each phoneme context including phoneme information including the dependency of several phonemes before and after. In the <first procedure> of i-vector extraction, the posterior probability γ _t (c) is calculated, which corresponds to probabilistic estimation of the phoneme context of the vector x _t . Possible to accurately calculate the posterior probability γ _t (c) is a second step in the i-vector extraction, the i-vectorw _u representing the characteristics of the speaker, the speech contents of the phoneme contexts or input audio data u It is indispensable to calculate with high accuracy without depending on.

実環境において音声認識を行う際には、入力音声データｕには雑音やチャネル歪みが付加されることが多い。この場合、i-vector抽出の＜一つ目の手順＞で、事後確率γ_t(ｃ)を精度良く計算することが困難になり、その結果、i-vector抽出の＜二つ目の手順＞で、i-vectorを精度良く計算することが困難になる。この問題を解決するために、例えば何らかの音声強調技術を用いて、入力音声データｕから雑音やチャネル歪みを低減した上で、上記の（７）式〜（１２）式で示されるi-vector抽出の一連の手順を行うという方法が考えられる。 When speech recognition is performed in an actual environment, noise and channel distortion are often added to the input speech data u. In this case, it becomes difficult to accurately calculate the posterior probability γ _t (c) by the <first procedure> of i-vector extraction. As a result, the <second procedure> of i-vector extraction This makes it difficult to calculate i-vectors with high accuracy. In order to solve this problem, i-vector extraction represented by the above equations (7) to (12) is performed after reducing noise and channel distortion from the input speech data u using, for example, some speech enhancement technique. A method of performing a series of procedures is possible.

この方法によれば、i-vector抽出の＜一つ目の手順＞で、事後確率γ_t(ｃ)は精度良く計算することが可能になるが、i-vector抽出の＜二つ目の手順＞での処理対象が雑音やチャネル歪みが低減された情報となるため、実際に計算されたi-vectorからも雑音やチャネル歪みの情報が失われることになり、話者の特徴に加えて雑音やチャネル歪みの情報もi-vectorに含めるようにして音声認識で積極的に利用したい場合に、不都合となる。 According to this method, the posterior probability γ _t (c) can be accurately calculated by the <first procedure> of i-vector extraction, but the <second procedure of i-vector extraction Since the processing target in <> is information with reduced noise and channel distortion, noise and channel distortion information will be lost from the actually calculated i-vector, and in addition to speaker characteristics, noise And channel distortion information is also included in the i-vector, which is inconvenient if you want to use it actively in speech recognition.

［実施形態に係るi-vector抽出］
以上から、実施形態は、i-vectorの抽出手順において、(第１の要件)i-vector抽出の＜一つ目の手順＞で、入力音声データｕに含まれる雑音やチャネル歪みを低減して事後確率γ_t(ｃ)を精度良く計算し、(第２の要件)i-vector抽出の＜二つ目の手順＞では、話者の特徴に加えて雑音やチャネル歪みの情報も含んだ形で、つまり、雑音やチャネル歪みが含まれる入力音声データｕを使ってi-vectorを計算する。 [I-vector extraction according to the embodiment]
As described above, in the i-vector extraction procedure, the first embodiment reduces the noise and channel distortion included in the input audio data u in the <first procedure> of i-vector extraction. The posterior probability γ _t (c) is calculated with high accuracy. (Second requirement) In the <second procedure> of i-vector extraction, in addition to speaker characteristics, information including noise and channel distortion is included. That is, i-vector is calculated using input speech data u including noise and channel distortion.

図４は、実施形態に係るi-vector算出装置の一例を示す図である。i-vector算出装置１０は、第１の基本特徴量抽出部１１Ａ、第２の基本特徴量抽出部１１Ｂ、＾γ_t(c)計算部１２、−ｍ_c計算部１３、＾Ｎ_u,c，＾Ｆ_u,c計算部１４、i-vector計算部１５を有する。なお、第１の基本特徴量抽出部１１Ａ、第２の基本特徴量抽出部１１Ｂ、＾γ_t(c)計算部１２、−ｍ_c計算部１３、＾Ｎ_u,c，＾Ｆ_u,c計算部１４、i-vector計算部１５は、ＣＰＵ（Central Processing Unit）等の処理装置及びＲＡＭ（Random Access Memory）等の一時記憶装置の協働により処理を行う処理部であり、適宜統合又は分散してもよい。 FIG. 4 is a diagram illustrating an example of the i-vector calculation apparatus according to the embodiment. i-vector calculating unit 10, a first basic feature amount extracting section 11A, second basic feature amount extracting unit 11B, ^ γ _t (c) calculating unit 12, -m _c calculating unit 13, ^ N _{u, c} , ^ F _{u, c} calculation unit 14 and i-vector calculation unit 15. The first basic feature amount extracting section 11A, second basic feature amount extracting unit 11B, ^ γ _t (c) calculating unit 12, -m _c calculating unit _{13, ^ N u, c,} ^ F u, c The calculation unit 14 and the i-vector calculation unit 15 are processing units that perform processing in cooperation with a processing device such as a CPU (Central Processing Unit) and a temporary storage device such as a RAM (Random Access Memory). May be.

実施形態では、雑音やチャネル歪みが付加された多数の不特定話者の大量のＵＢＭ学習用の音声データから抽出されるＤ次元、Ｑフレームの特徴量ベクトル時系列Ｏを、下記の（１３）式のように定義する。特徴量ベクトル時系列Ｏは、雑音歪み音声特徴量記憶部１００Ａに保存される。 In the embodiment, the D-dimensional and Q-frame feature vector time series O extracted from a large amount of speech data for UBM learning of a large number of unspecified speakers to which noise and channel distortion are added is represented by the following (13): Define it like an expression. The feature vector time series O is stored in the noise distortion speech feature storage unit 100A.

また、雑音やチャネル歪みが付加された多数の不特定話者の大量のＵＢＭ学習用の音声データに対して所定の音声強調技術を用いて雑音やチャネル歪みを低減して得た音声データから抽出されるＤ次元、Ｑフレーム特徴量ベクトル時系列＾Ｏを、下記の（１４）式のように定義する。特徴量ベクトル時系列＾Ｏは、雑音歪み低減音声特徴量記憶部１００Ｂに保存される。 In addition, a large amount of voice data for UBM learning of a large number of unspecified speakers to which noise and channel distortion are added is extracted from voice data obtained by reducing noise and channel distortion using a predetermined voice enhancement technique. A D-dimensional, Q-frame feature vector time series ^ O to be defined is defined as in the following equation (14). The feature vector time series ^ O is stored in the noise distortion reduced speech feature storage unit 100B.

雑音やチャネル歪みが付加された入力音声データｕから抽出されたＤ次元、Ｌフレームの特徴量ベクトル系列Ｘ_uを、下記の（１５）式のように定義する。第１の基本特徴量抽出部１１Ａは、下記の（１５）式により、入力音声データｕから特徴量ベクトル系列Ｘ_uを抽出する。 A D-dimensional and L-frame feature vector sequence X _u extracted from input speech data u to which noise and channel distortion have been added is defined as in the following equation (15). The first basic feature quantity extraction unit 11A extracts a feature quantity vector series X _u from the input speech data u by the following equation (15).

入力音声データｕに対して、上記した所定の音声強調技術を用いて雑音やチャネル歪みを低減した入力音声データ＾ｕから得たＤ次元、Ｌフレームの特徴量ベクトル時系列＾Ｘ_uを、下記の（１６）式のように定義する。第２の基本特徴量抽出部１１Ｂは、下記の（１６）式により、入力音声データｕから特徴量ベクトル系列＾Ｘ_uを抽出する。すると、ベクトル系列＾Ｏを用いて学習したＵＢＭ（以下、＾ＵＢＭと表記する)の＾ｘ_t（ｔ＝１，２，・・・，Ｌ）に対する各尤度ｐ（＾ｘ_t|＾Ω）は、下記の（１７）式のように書くことができる。 For the input speech data u, the D-dimensional and L-frame feature vector time series { _circumflex over (X) _} u obtained from the input speech data {circumflex over (u)} using the above-described predetermined speech enhancement technique to reduce noise and channel distortion (16). The second basic feature quantity extraction unit 11B extracts the feature quantity vector series ^ X _u from the input speech data u by the following equation (16). Then, each likelihood p (^ x _t | ^ Ω) for ^ x _t (t = 1, 2,..., L) of the UBM (hereinafter referred to as ^ UBM) learned using the vector sequence ^ O. ) Can be written as the following equation (17).

ここで、ｃ＝１，２，・・・，Ｃを＾ＵＢＭのガウス分布を表す添え字(例えばＣ＝２０４８)とし、ｃ番目のガウス分布の混合重み＾π_c、ｃ番目の平均ベクトル＾ｍ_c、対角共分散行列＾Σ_cとすると、＾ＵＢＭのパラメータ集合＾Ωは、下記の（１８）式のようになる。＾ＵＢＭのパラメータ集合＾Ωは、ＵＢＭ学習装置２００により特徴量ベクトル時系列＾Ｏから算出され、＾ＵＢＭ記憶部３００に保存される。 Here, c = 1, 2,..., C is a subscript representing the UBM Gaussian distribution (for example, C = 2048), and the c-th Gaussian distribution weight ^ π _c and the c-th average vector ^ Assuming that m _{c is} a diagonal covariance matrix ^ Σ _c , the parameter set ^ Ω of ^ UBM is expressed by the following equation (18). The UBM parameter set ^ Ω is calculated from the feature vector time series ^ O by the UBM learning device 200 and stored in the ^ UBM storage unit 300.

＾γ_t(c)計算部１２は、＾ＵＢＭのパラメータ集合＾Ωを用いて、＾Ｘ_uの各フレームの特徴量ベクトル＾ｘ_t（ｔ＝１，２，・・・，Ｌ）が＾ＵＢＭのｃ番目のガウス分布から生成される事後確率＾γ_t(ｃ)を、下記の（１９）式のように計算する。 ^ Γ _t (c) Using the UBM parameter set ^ Ω, the calculation unit 12 obtains the feature vector ^ x _t (t = 1, 2,..., L) of each frame of ^ X _u The posterior probability ^ γ _t (c) generated from the c-th Gaussian distribution of UBM is calculated as in the following equation (19).

事後確率＾γ_t(ｃ)は、雑音やチャネル歪みを低減した＾ＵＢＭと、雑音やチャネル歪みを低減したベクトル系列＾Ｘ_uとを用いて計算されているため、上記の（第１の要件）を満たすi-vectorの抽出手順の＜一つ目の手順＞である。続けて、＾Ｎ_u,c，＾Ｆ_u,c計算部１４は、事後確率＾γ_t(ｃ)を用いて、入力音声データｕに対する０次、１次のBaum-Welch統計量＾Ｎ_u,c、ベクトル＾Ｆ_u,cを、下記の（２０）式及び（２１）式のようにそれぞれ計算する。ただし、ベクトル＾Ｆ_u,cは、Ｄ次元のベクトルである。 The posterior probability ^ γ _t (c) is calculated using ^ UBM with reduced noise and channel distortion and the vector sequence ^ X _u with reduced noise and channel distortion. This is the <first procedure> of the i-vector extraction procedure that satisfies (1). Subsequently, the ^ N _{u, c} , _{Fu, c} calculation unit 14 uses the posterior probability ^ γ _t (c) to calculate the 0th-order and 1st-order Baum-Welch statistics ^ N _u for the input speech data _{u. , c} and vector { _{circumflex over} (F) _{}, c} , respectively, as shown in the following equations (20) and (21). However, the vector ^ F _{u, c} is a D-dimensional vector.

ここで着目すべきは、上記の（２１）式において、ベクトル＾Ｆ_u,cの計算に、雑音やチャネル歪みが付加された入力音声データｕの特徴量ベクトルｘ_t（ｔ＝１，２，・・・，Ｌ）を用いることである。このようにベクトル＾Ｆ_u,cを計算することで、最終的に抽出されるi-vectorは、話者の特徴に加えて、雑音やチャネル歪みの情報も保持したものとなり、＜二つ目の手順＞において上記の（第２の要件）が満されていることになる。 It should be noted here that in the above equation (21), the feature vector x _t (t = 1, 2,) of the input speech data u in which noise and channel distortion are added to the calculation of the vector F _{u, c} . ..., L). By calculating the vector ＦF _{u, c} in this way, the i-vector that is finally extracted retains noise and channel distortion information in addition to the speaker characteristics. In the above procedure, the above (second requirement) is satisfied.

なお、−ｍ_c計算部１３は、上記の（２１）式における−ｍ_cを、事後確率＾γ_t（c）と、上記の（１３）式で示されるＵＢＭ学習用の音声データから得られるＤ次元、Ｑフレームの特徴量ベクトル時系列Ｏを用いて、下記の（２２）式のように計算する。 Incidentally, -m _c calculating unit 13, the -m _c in the above equation (21), a posterior probability ^ gamma _t (c), derived from the sound data for UBM learning represented by the formula (13) Using the feature vector time series O of the D dimension and Q frame, calculation is performed as shown in the following equation (22).

これは、仮に特徴量ベクトル時系列Ｏを用いてＵＢＭを学習したとしても、ＵＢＭのガウス分布番号と、＾ＵＢＭのガウス分布番号の対応を取ることは不可能であるため、単純にＵＢＭの分布番号ｃのガウス分布の平均ベクトルｍ_cを用いることができないためである。すなわち、ＵＢＭと＾ＵＢＭは別物であり、ＵＢＭを構成するガウス分布と＾ＵＢＭを構成するガウス分布とは何ら関係はないことから、両者のガウス分布の分布番号同士にも何ら関係はないためである。つまり、＾ＵＢＭでのガウス分布番号が既知であっても、この番号はＵＢＭでのガウス分布の分布番号とは異なり、ＵＢＭでのガウス分布番号を求めることはできないことから、特徴量ベクトルｘ_t（ｔ＝１，２，・・・，Ｌ）から差し引くべきｃ番目の平均ベクトルｍ_cを求めることができない。この問題を解決するため、＾ＵＢＭでのガウス分布番号を用いて、上記の（２２）式に従って、ｃ番目の平均ベクトルｍ_cの近似値−ｍ_cを求める。 Even if the UBM is learned using the feature vector time series O, it is impossible to take correspondence between the UBM Gaussian distribution number and the ^ UBM Gaussian distribution number. it can not be used an average vector m _c of the Gaussian distribution of the number c. That is, UBM and ^ UBM are different, and since there is no relationship between the Gaussian distribution that constitutes UBM and the Gaussian distribution that constitutes ^ UBM, there is no relationship between the distribution numbers of both Gaussian distributions. is there. In other words, ^ even known Gaussian distribution number in UBM, this number is different from the distribution number of the Gaussian distribution at UBM, since it is impossible to obtain the Gaussian distribution number in UBM, feature vector x _t (t = 1,2, ···, L ) can not be obtained c-th mean vector m _c to subtract from. To solve this problem, by using a Gaussian distribution number in ^ UBM, according to the above (22), approximated -m _c of the c-th mean vector m _c.

最後に、i-vector計算部１５は、下記の（２３）式、（２４）式、（２５）式により、i-vectorｗ_uを計算する。 Finally, the i-vector calculation unit 15 calculates i-vectorw _u by the following equations (23), (24), and (25).

［実施形態に係るi-vector抽出処理］
図５は、実施形態に係るi-vector抽出処理の一例を示すフローチャートである。先ず、i-vector算出装置１０の第１の基本特徴量抽出部１１Ａは、上記の（１５）式により、入力音声データｕから特徴量ベクトル系列Ｘ_u（第１の基本特徴量）を抽出する（ステップＳ１１）。次に、第２の基本特徴量抽出部１１Ｂは、上記の（１６）式により、入力音声データｕから特徴量ベクトル系列＾Ｘ_u（第２の基本特徴量）を抽出する（ステップＳ１２）。なお、ステップＳ１１及びステップＳ１２の実行順序は、前後しても、同時であってもよい。 [I-vector extraction processing according to the embodiment]
FIG. 5 is a flowchart illustrating an example of the i-vector extraction process according to the embodiment. First, the first basic feature quantity extraction unit 11A of the i-vector calculation apparatus 10 extracts a feature quantity vector series X _u (first basic feature quantity) from the input speech data u by the above equation (15). (Step S11). Next, the second basic feature quantity extraction unit 11B extracts a feature quantity vector series ^ X _u (second basic feature quantity) from the input speech data u by the above equation (16) (step S12). In addition, the execution order of step S11 and step S12 may be before and after, or may be simultaneous.

次に、＾γ_t(c)計算部１２は、＾ＵＢＭのパラメータ集合＾Ωと、特徴量ベクトル系列＾Ｘ_uとを用いて、＾Ｘ_uの各フレームの特徴量ベクトル＾ｘ_t（ｔ＝１，２，・・・，Ｌ）が＾ＵＢＭのｃ番目のガウス分布から生成される事後確率＾γ_t(ｃ)を、上記の（１９）式のように計算する（ステップＳ１３）。 Next, ^ gamma _t (c) calculating unit 12, ^ the parameter set ^ Omega the UBM, by using the feature quantity vector sequence ^ X _u, ^ X _u feature vector ^ x _t (t of each frame = 1,2, ···, L) is ^ posterior probability is generated from c-th Gaussian distribution UBM ^ gamma _t and (c), calculated as above (19) (step S13).

次に、−ｍ_c計算部１３は、上記の（２１）式における−ｍ_cを、事後確率＾γ_t（c）と、上記の（１３）式で示されるＵＢＭ学習用の音声データから得られるＤ次元、Ｑフレームの特徴量ベクトル時系列Ｏと、事後確率＾γ_t(ｃ)とを用いて、上記の（２２）式のように計算する（ステップＳ１４）。 Next, -m _c calculating unit 13, to give the -m _c in the above equation (21), a posterior probability ^ gamma _t (c), from the voice data for UBM learning represented by the formula (13) Using the D-dimensional and Q-frame feature vector time series O and the posterior probability ^ γ _t (c), calculation is performed as in the above equation (22) (step S14).

次に、＾Ｎ_u,c，＾Ｆ_u,c計算部１４は、特徴量ベクトル系列Ｘ_uと、事後確率＾γ_t(ｃ)と、−ｍ_cとから、入力音声データｕに対する０次、１次のBaum-Welch統計量＾Ｎ_u,c、ベクトル＾Ｆ_u,cを、上記の（２０）式及び（２１）式のようにそれぞれ計算する（ステップＳ１５）。 _{Next, ^ N u, c, ^} F u, c calculating unit 14, a feature vector sequence X _u, the posterior probability ^ γ _t (c), and a -m _c, 0-order with respect to the input audio data u First-order Baum-Welch statistics ^ N _{u, c} and vector ^ F _{u, c} are calculated as shown in the above equations (20) and (21) (step S15).

次に、i-vector計算部１５は、上記の（２３）式、（２４）式、（２５）式により、i-vectorｗ_uを計算する（ステップＳ１６）。i-vector算出装置１０は、ステップＳ１５で計算したi-vectorｗ_uを出力する。i-vectorｗ_uは、例えば図２に示すように、基本特徴量ベクトル及びi-vectorｗ_uが連結された拡張特徴量ベクトルが、例えばＤＮＮ−ＨＭＭ音響モデルに入力され求められたＨＭＭ状態事後確率ベクトルを用いて音声認識を行う音声認識装置に適用できる。 Next, i-vector calculating unit 15, the above equation (23), (24) and (25), to calculate the i-vectorw _u (step S16). The i-vector calculation device 10 outputs the i-vectorw _u calculated in step S15. i-vectorw _u, for example, as shown in FIG. 2, the basic feature vector and i-vectorw _u is expanded feature vectors, which are connected, for example DNN-HMM acoustic model is input to the obtained HMM state posterior probability vector It can be applied to a speech recognition apparatus that performs speech recognition using

なお、以上の実施形態に係るi-vectorの抽出手順における雑音やチャネル歪みを低減する方法としては、任意の音声強調処理技術を適用することができる。各種の音声強調処理技術については、例えば非特許文献８に詳細に記載されている。または、雑音やチャネル歪みの影響を低減する方法として音声強調処理技術に代えて、ＤＮＮ−ＨＭＭ音響モデルから得られるボトルネック特徴量を用いる処理技術を用いてもよい。ボトルネック特徴量は、例えば非特許文献２に詳細に記載されている。 Note that any speech enhancement processing technique can be applied as a method of reducing noise and channel distortion in the i-vector extraction procedure according to the above embodiment. Various speech enhancement processing techniques are described in detail in Non-Patent Document 8, for example. Alternatively, as a method of reducing the influence of noise and channel distortion, a processing technique using a bottleneck feature amount obtained from a DNN-HMM acoustic model may be used instead of the voice enhancement processing technique. The bottleneck feature amount is described in detail in Non-Patent Document 2, for example.

また、特徴量ベクトル時系列＾Ｏを用いて学習する混合分布モデルは、ＧＭＭに基づくＵＢＭに限らず、ＨＭＭであってもよい。 Further, the mixed distribution model learned using the feature vector time series ^ O is not limited to the UBM based on the GMM but may be an HMM.

［評価実験］
実施形態と比較する従来技術は、非特許文献４及び５に記載の従来技術とした。下記の（表１）及び（表２）は、実施形態のi-vector算出装置１０により算出されたi-vectorをＤＮＮの音響モデルへ投入した場合の評価実験結果を示す表である。各表における百分率は、単語誤り率（Word Error Rate：ＷＥＲ）である。 [Evaluation experiment]
The conventional technique compared with the embodiment is the conventional technique described in Non-Patent Documents 4 and 5. The following (Table 1) and (Table 2) are tables showing evaluation experiment results when the i-vector calculated by the i-vector calculation apparatus 10 of the embodiment is input to the acoustic model of DNN. The percentage in each table is the word error rate (WER).

（表１）において、“＋”記号の左側は“i-vector抽出の＜一つ目の手順＞で用いた特徴量の種別”を表し、“＋”記号の右側は“i-vector抽出の＜二つ目の手順＞で用いた特徴量の種別”を表す。“noisy MFCC”は雑音ＭＦＣＣであり、“Bottleneck”はBottleneck特徴量であり、“VTS enhanced”はベクトルテーラー展開強調量である。 In (Table 1), the left side of the “+” symbol represents “type of feature quantity used in the <first procedure> of i-vector extraction”, and the right side of the “+” symbol represents “i-vector extraction”. <Type of feature amount used in <second procedure> ". “Noisy MFCC” is a noise MFCC, “Bottleneck” is a Bottleneck feature, and “VTS enhanced” is a vector tailor expansion enhancement.

（表１）は、いずれの組合せであっても、ベースラインのＤＮＮよりもＷＥＲの削減が見られたことを示す。 Table 1 shows that for any combination, WER reduction was seen over baseline DNN.

また、（表２）は、i-vector抽出中における＾ＵＢＭの混合分布モデル学習の際に用いたボトルネック特徴量のサイズの違いによるＷＥＲを示す。（表２）は、いずれのサイズであってもベースラインのＤＮＮよりＷＥＲの削減が見られたことを示す。 Table 2 shows the WER depending on the difference in the size of the bottleneck feature amount used in the ^ UBM mixed distribution model learning during i-vector extraction. (Table 2) shows that WER reduction was seen over baseline DNN for any size.

i-vector算出装置１０及びi-vector算出装置１０を含む音声認識装置において行われる各処理は、全部又は任意の一部が、ＣＰＵ等の処理装置及び処理装置により解析実行されるプログラムにて実現されてもよい。また、i-vector算出装置１０及びi-vector算出装置１０を含む音声認識装置において行われる各処理は、ワイヤードロジックによるハードウェアとして実現されてもよい。 Each process performed in the i-vector calculation apparatus 10 and the speech recognition apparatus including the i-vector calculation apparatus 10 is realized by a processing apparatus such as a CPU and a program that is analyzed and executed by the processing apparatus. May be. In addition, each process performed in the i-vector calculation apparatus 10 and the speech recognition apparatus including the i-vector calculation apparatus 10 may be realized as hardware by wired logic.

また、実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともできる。もしくは、実施形態において説明した各処理のうち、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上述及び図示の処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて適宜変更することができる。 In addition, among the processes described in the embodiment, all or a part of the processes described as being automatically performed can be manually performed. Alternatively, all or some of the processes described as being manually performed among the processes described in the embodiments can be automatically performed by a known method. In addition, the above-described and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be changed as appropriate unless otherwise specified.

（プログラムについて）
図６は、プログラムが実行されることにより、実施形態に係るi-vector算出装置及びi-vector算出装置を含む音声認識装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。コンピュータ１０００において、これらの各部はバス１０８０によって接続される。 (About the program)
FIG. 6 is a diagram illustrating an example of a computer that realizes an i-vector calculation apparatus and a speech recognition apparatus including the i-vector calculation apparatus according to the embodiment by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. In the computer 1000, these units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１０４１に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１０５１、キーボード１０５２に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１０６１に接続される。 The memory 1010 includes a ROM 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS. The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. The serial port interface 1050 is connected to a mouse 1051 and a keyboard 1052, for example. The video adapter 1060 is connected to the display 1061, for example.

ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、i-vector算出装置１０及びi-vector算出装置１０を含む音声認識装置の各処理を規定するプログラムは、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、例えばハードディスクドライブ１０３１に記憶される。例えば、i-vector算出装置１０及びi-vector算出装置１０を含む音声認識装置における機能構成と同様の情報処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 The hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the i-vector calculation device 10 and the speech recognition device including the i-vector calculation device 10 is stored in, for example, the hard disk drive 1031 as a program module 1093 in which a command executed by the computer 1000 is described Remembered. For example, a program module 1093 for executing information processing similar to the functional configuration in the i-vector calculation device 10 and the speech recognition device including the i-vector calculation device 10 is stored in the hard disk drive 1031.

また、実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1031, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1041 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 via the network interface 1070.

実施形態は、本願が開示する技術に含まれると同様に、特許請求の範囲に記載された発明とその均等の範囲に含まれるものである。 The embodiments are included in the invention disclosed in the claims and equivalents thereof, as well as included in the technology disclosed in the present application.

１０ i-vector算出装置
１１Ａ第１の基本特徴量抽出部
１１Ｂ第２の基本特徴量抽出部
１２＾γ_t(c)計算部
１３ −ｍ_c計算部
１４＾Ｎ_u,c，＾Ｆ_u,c計算部
１５ i-vector計算部
１０００コンピュータ
１０１０メモリ
１０２０ＣＰＵ 10 i-vector calculating device 11A first basic feature amount extracting section 11B second basic feature amount extracting unit 12 ^ γ _t (c) calculating unit 13 -m _c calculating unit _{14 ^ N u, c, ^} F u, _c calculation unit 15 i-vector calculation unit 1000 computer 1010 memory 1020 CPU

Claims

A first feature quantity extraction unit for extracting a first feature quantity vector from input speech;
A second feature amount extraction unit that extracts a second feature amount vector from speech that has been subjected to noise or channel distortion reduction processing on the input speech;
Based on the parameters of the mixture distribution model obtained by learning the speech on which noise or channel distortion reduction processing has been performed on the speech including noise or distortion, the second feature vector is included in each distribution of the mixture distribution model. A posterior probability calculation unit for calculating a posterior probability indicating a corresponding probability;
An average vector calculation unit that calculates an average vector of each distribution in the mixed distribution model from the noise or distortion-containing speech and the posterior probability;
A statistic calculator that calculates a zero-order Baum-Welch statistic and a first-order Baum-Welch statistic for the input speech from the first feature vector, the posterior probability, and the average vector; ,
A feature vector calculation apparatus comprising: a feature vector calculator that calculates a feature vector from the zero-order Baum-Welch statistic and the first-order Baum-Welch statistic.

The feature quantity vector calculation apparatus according to claim 1, wherein the reduction process is a voice enhancement process.

The feature vector calculation apparatus according to claim 1, wherein the reduction process is a process using a bottleneck feature quantity.

Wherein a first feature vector as an input the extended feature vectors obtained by connecting the feature quantity vector calculated by the feature quantity vector calculation unit to predetermined acoustic model, the speech to the speech recognition processing the input speech The feature vector calculating apparatus according to claim 1 , further comprising a recognition processing unit .

A feature vector calculation method executed by a feature vector calculator,
A first feature amount extraction step of extracting a first feature amount vector from the input speech;
A second feature amount extracting step of extracting a second feature amount vector from the speech in which noise or channel distortion reduction processing has been performed on the input speech;
The probability that the second feature vector corresponds to each distribution of the mixed distribution model is determined based on the parameters of the mixed distribution model obtained by learning the voice subjected to the reduction processing on the voice including noise or distortion. A posterior probability calculation step for calculating a posterior probability to be shown;
An average vector calculation step of calculating an average vector of each distribution in the mixed distribution model from the speech including the noise or distortion and the posterior probability;
A statistic calculation step of calculating a zeroth-order Baum-Welch statistic and a first-order Baum-Welch statistic for the input speech from the first feature vector, the posterior probability, and the average vector; ,
A feature vector calculation method comprising: calculating a feature vector from the zeroth-order Baum-Welch statistic and the first-order Baum-Welch statistic.

According to claim 1, the feature quantity vector calculation program for causing a computer to function as a feature quantity vector calculation apparatus according to 3 or 4.