GB2526291A

GB2526291A - Speech analysis

Info

Publication number: GB2526291A
Application number: GB1408878.5A
Authority: GB
Inventors: Thomas Drugman; Ioannis Stylianou
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2014-05-19
Filing date: 2014-05-19
Publication date: 2015-11-25
Anticipated expiration: 2034-05-19
Also published as: GB2526291B; GB201408878D0

Abstract

The spectral envelope of a speech signal is estimated (via eg. Linear Prediction, Discrete All Pole, Minimium Variance Distortionless Response, True Envelope, Cubic Spline Interpolation or Mel Frequency Cepstrum Coefficient methods) after modifying the signal by generating new components at non-harmonic frequencies (ie. frequencies which are not an integer multiple of the fundamental) which comprise the vectoral sum of the input signals harmonic components. One way of generating interharmonic components is by modulating the speech signal with a cosine function and an exponential or polynomial weighting W and obtaining a zero-phase version of this signal.

Description

Speech Analysis

FIELD

Embodiments of the present invention as generally described herein relate to a speech signal analysis system and method.

BACKGROIND

Speech analysis systems are systems where audio speech or audio speech files are inputted and parameters of the speech are extracted. One example of such a parameter is the spectral envelope of the speech.

Spectral envelope estimation of speech signals is used in a wide variety of applications such as text to spcech synthesis and speech modelling/coding.

There is a continuing need to improve the accuracy of spectral envelope estimation.

BRIEF DESCRIPTION OF THE FIGURES

Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which: Figure 1 is a system for estimating the spectral envelope of a speech signal according to an embodiment.

Figure 2 is a flow chart showing a method of estimating the spectral envelope of a speech signal according to an embodiment; Figure 3 is a flow chart showing another method of estimating the spectral envelope of a speech signal according to an embodiment; and Figure 4 shows an example of a speech spectrum and a corresponding envelope estiniation according to an embodiment.

DETAILED DESCRIPTION

In an embodiment, a method of estimating the spectral envelope of a speech signal is provided, said method comprising: inputting said speech signal; modifying said speech signal; and estimating the spectral envelope of said modified speech signal using an envelope estimation model, wherein modifying said speech signal comprises generating new components of the signal at non-harmonic frequencies, wherein said new components of the signal comprise the vectoral sum of harmonic components of the input speech signal.

The spectral envelope is the smooth function passing through the prominent peaks of the frequency spectrum of the speech signal. In an embodiment, any suitable envelope estimation model may be employed. In an embodiment, the envelope estimation model is selected from one of Linear Prediction (LP), the Discrete All Pole method (DP), the Minimum Variance Distortionless Response method (MVDR), the True Envelope method (TE), cubic spline interpolation (CST) between the harmonics, and Mel Frequency Cepstrum Coefficient (MFCC).

A harmonic component of a speech signal is a component of the signal that has a frequency equal to a harmonic frequency. Equivalently, the harmonic component of a speech signal is the component of the signal that has a frequency which is an integer multiple of the fundamental frequency. Non-harmonic frequencies (or inter-harmonic frequencies) are frequencies not equal to the harmonic frequencies of the original speech signal. Equivalently non-harmonic frequencies are frequencies which are not integer multiples of the fundamental frequency of the input speech signal. k an embodiment, a non-harmonic frequency may be lower than the fundamental frequency. In an embodiment new components are generated at frequencies intermediate between the harmonics of the original speech signal. In an embodiment, more than one new component is generated at frequencies between two harmonics of thc original speech signal. In an embodiment, the frequency of the lowest inter-harmonic component is lower than the fundamental frequency of the original speech signal.

In an embodiment, the number of new components between each harmonic can be varied. In an embodiment, the number of new components is selected by a user via a user interface.

In an embodiment, generating new components of the signal at non-hannonic frequencies comprises modulating the speech signal with a periodic function. In an embodiment, generating new components of the signal at non-harmonic frequencies comprises multiplying the speech signal with a periodic function. In a further embodiment, the periodic function is a cosine function. In yet a further embodiment, the cosine function is cos(---'n), where 11 is the discrete time index, I is an integer greater than or equal to 2, and w0 is the fundamental frequency of the input speech signal.

in an embodiment, modifying the speech signal comprises adding the modulated speech signal to the input speech signal. In a further embodiment, modifying the speech signal comprises transforming the input speech signal s into the modified speech signal y,, using the following equation: y(n) =s(n).[1+X2.W(Ycos(w0n)] where n is the discrete time index, W (.) is a weighting function, I is an integer greater than or equal to 2, and w0 is the fundamental frequency of the input speech signal.

In an embodiment, I and W are selected by the user. In an embodiment, I-i new components are generated in the speech signal.

In an embodiment, WC) has the following properties: W(O) = 1; W(l -x) = I -W(x); and W(x) is monotically decreasing with x increasing in the interval [0-1]. In an embodiment, W() is a polynomial function. in another embodiment, WC) is an exponential function. in yet another embodiment, W(x) = I -x.

In an embodiment modifying said speech signal further comprises determining the zero-phase version of the speech signal. in an embodiment, the zero-phase version of the speech signal is determined before modification of the signal occurs. In an embodiment, the zero-phase version of the signal itself is modified. In an embodiment, obtaining the zero-phase version of the speech signal comprises obtaining the inverse Fourier transform of the amplitude of the Fourier transform of the speech signal.

in an embodiment, a system for estimating the spectral envelope of a speech signal is provided, said system comprising: an input for receiving input speech; and a processor configured to: modify said speech signal; and estimate the spectral envelope of said modified speech signal using an envelope estimation model, wherein modifying said speech signal comprises generating new components of the signal at non-harmonic frequencies, wherein said new components of the signal comprise the vectoral sum of harmonic components of the input speech signal.

Methods in accordance with embodiments of the present invention can be implemented either in hardware or on software in a general purpose computer. Further methods in accordance with embodiments of the present can be implemented in a combination of hardware and software.

Methods in accordance with embodiments of the present invention can also be implemented by a single processing apparatus or a distributed network of processing apparatuses.

Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.

The source -filter model of speech production assumes that speech waveforms can be modelled as a white excitation signal filtered by a linear transfer function. The excitation signal performs a spectral subsampling of the filter transfer function. hi that case, the spectral envelope (SE), defined as a smooth function passing through the prominent peaks of the frequency spectrum, is the transfer function of the filter. Note that such an envelope contains not only the contribution of the vocal tract response, but also of the glottal flow.

SE estimation refers to the task of estimating the filter transfer function from the speech signal.

As a fundamental speech analysis problem, it finds interest in almost all voice technology applications, such as speech synthesis, speech recognition or speaker identification.

lii unvoiced sounds (no vibration of the vocal cords), the source is assumed to be a white noise whose amplitude spectrum is flat. In voiced sounds (vibration of the vocal cords), the source is assumed to be a quasi-periodic pulse Irain whose amplitude spectrum consists of peaks at the harmonics. The convolution of this excitation signal with the filter impulse response can therefore be seen as a sampling of the spectral envelope at multiple integers of the fundamental frequency P. In low-pitched voices, harmonics are close to each other, and therefore the resulting spectral sampling is sufficient to estimate the SE with a limited loss of information.

However, the more P increases, the further harmonics are from each other and therefore the more the SE will be subsampled and the more difficult its accurate estimation.

Figure 1 shows a spectral envelope estimation system 1. The speech analysis system 1 comprises a processor 3 which executes a program 5. Spectral envelope estimation system I may or may not thither comprise storage 7. The storage 7 may store data which is used by program 5 to analyse speech. The spectral envelope estimation system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to a speech input 15. Speech input 15 receives speech to be analysed. The speech input 15 may be, for example a microphone. Alternatively, speech input 15 may be a means for receiving speech data from an external storage medium or network.

Connected to the output module 13 is output for spectral envelope data 17. The output 17 is used for outputting data relating to the spectral envelope estimated from the speech which is input into speech input 15, The output 17 may be for example a spectral envelope data file which may be sent to a storage medium, networked, etc. In use, the spectral envelope estimation system I receives speech through speech input 15. The program 5 executes on processor 3 and estimates the spectral envelope of the speech data. The processor may or may not use data stored in storage 7. The spectral envelope data is output via the output module 13 to data output 17.

Figure 2 shows a method of estimating the spectra! envelope of a speech signal according to an embodiment.

In step SIOl the speech signal s(n) is input, where n is the discrete time index. The speech signal may be input directly by a speaker via microphone or via a file containing pre-recorded speech data.

in step SI 03, a weight function W is selected. This function will be discussed in detail below.

The weight function may be selected manually by the user of the spectral envelope estimation system via a user interface. Alternatively, pre-determined values of the weight function may be stored in storage 7.

In step SI 05, a number of interharmonics I-I is selected. This number will be discussed in detail below. The value of T may be selected manually by the user of the spectral envelope estimation system via a user interface. Alternatively, pre-determined values of I may be stored in storage 7.

In step S 109, fast interharmonic reconstruction of the speech signal is performed (FIHR). FIHR according to an embodiment will now be described.

In this embodiment, the speech signal s(n) is assumed to consist of only harmonics, i.e. s(n)=ak.cos(kwOn+Øk) (1) where k' 0k' and 1th' respectively, denote the amplitude and phase of harmonic k,the fundamental angular frequency and the number of harmonics over the fill band.

In one embodiment, the speech signal, s(n), is modulated by a cosine of frequency This results in a shifi of the harmonic components by and by a reduction of their amplitude by a factor 2. As a consequence, new components will appear in (k + i). w0 whose amplitude and phase result from the vectorial sum of the original components in k. co and (/c + 1) . w0. The modulated speech signal is then added to the original speech signal s(n) . Thus, an interharmonic is artificially created between each of the harmonics of the original signal.

In a firther embodiment, multiple interharmonics are created between the cxisting harmonics of thc speech signal. Under its general form, FIHR transforms an original speech signal sOi) into a new signal y(n) using the following equation: y(n) = s(n) . [1 + 2 W(L) . cos(L wan)] (2) where I-I inter-harmonics are created, the number of wluch were selected in step 5105 above.

W() is the weighting function selected in step S103 above. The resulting signal y(n) can be regarded as a periodic signal whose fundamental frequency F is -. With the transformation described in Equation (2), FIHR can be regarded as a preprocess to lower the fundamental frequency, before extracting the SE. As discussed above, accurate SE estimation of low-pitched signals is easier than SB estimation of high-pitched signals.

It can be easily shown that Equation (2) is equivalent to applying the following operation in the complex spectrum domain: Y(co)= S(w)+XW('.[S(w_'w0)+S(w+'00)] (3) In step S 109, spectral envelope (SE) estimation is performed on the transformed speech signal y(n).

Any technique suitable for estimating the spectral envelope may be used on the transform speech signal y(n) in step S109. For example, Linear Prediction (LP), the Discrete All Pole 1 5 method (DP), the Minimum Variance Distortionless Response method (MVDR) the True Envelope method (TB), cubic spline interpolation (CSI) between the harmonics, and Mel Frequency Cepstrum Coefficient (MFCC) are all suitable techniques.

In an embodiment, linear prediction is employed. In this embodiment, the predietcd value 9(n) of the modified signal 9(n) is given by 9(n) = E' a y(ri -i), (4) where p is the order of the linear predictor and a1 are the linear prediction coefficients. The error between the predicted value, 9(n) and the actual value, y(n), is given as: e(n) = y(n) -5(n).

The root mean square criterion is used to estimatc the linear prediction coefficients a1.

Minimisation of the squared error E[e2(n)} (where E{x] indicates the expectation value of the function x) yields the equation: X, a1RO -1) = -RU) (5) where RU) = E{y(n)y(n -} is the autocorrelation of signal y(n). [his equation is then solved for a1. In an embodiment it is solved using a Levinson-Durbin recursion.

The spectral envelope of the signal y(n) is then given as: A(eJ°) = (6) where j is the square root of minus 1.

In another enthodirnent, FIIHR is applied to the zero phase version of the speech signal. A flow diagram showing a method of speech envelope estimation according to this embodiment is shown in Figure 3.

In Step S 101, speech is input.

In Step S 1101, the speech signal is windowed using a window function. Any windowing function suitable for use with a Fourier transform may be employed.

In Step 51103, the Fourier Transform of the windowed signal is taken.

In Step S 1105, an inverse Fourier transform is applied to the amplitude of the signal obtained in step S1103. The output of the transfonn is the zero-phase version s,(n) of the speech signal.

In steps S103 and SlOS, the weight function and number of interharmonics, respectively, are selected as described above in relation to Figure 2.

En step S 109, fast interharmonic reconstruction of the zero-phase signal is performed. This process is the same as that described above in relation to Figure 2. However, in the case of the zero-phase signal s, (11), the multiplication of s (n) by cos2u2. n) yields replicas in (k + 1) w9 whose amplimde can be shown to be +ak+l) . Thus, by employing the zero-phase version of the speech signal, as opposed to the unaltered speech signal, the amplitude of the new components are dependent solely on the ak (and not on the 0(k)).

In this embodiment, app]ying FfHR to the zero-phase speech signal can be seen as a moving-average (MA) process on the amplitude spectrum of the speech signal. The coefficients of this MA process are determined by the weighting function We).

In step SIll, envelope estimation is performed on the resulting spectrum, as discussed above in relation to Figure 2.

The amplitude spectrum of a soprano speech signal where the fundamental frequency I 650 Hz is shown in Figure 4 (solid black line). The result of fast interharmonic reconstruction according to an embodiment applied to the zero-phase speech signal s, (ii) (referred to as the FIT-IR-ZP technique) using F = 130 Hz (I = 5) is also shown for comparison (solid blue line).

The subsequent autocorrelation-based LP envelope (order, p = I 8, sampling frequency P = 16 kT-Iz) is also shown. The dotted red line indicates the result of LP on the original, unreconstructed spectrum and the dotted green line indicates the result of LP on the reconstructed zero-phase spectrum. The effect of Fast Interharmonie Reconstruction on the estimated spectral envelope can therefore be observed.

According to the above embodiments, Fli-IR is characterized by 3 features: i) the use of either s(n) or s7, (ii) in Equation (2); ii) the target fundamental frequency P = --(or equivalently the number of inter-harmonies I -I); and iii) the weighting function W().

The choice between s(n) or s (n) results from a trade-off between accuracy and computational load. Obtaining s,(n) requires the computation of FFT and IFFT, but guarantees that, contraiy to s(n) , Equation (3) is an MA process on the amplitude spectrum, and is independent upon the phase of the signal. The use of s(n) is therefore provides improved accuracy (especially if the phase spectrum varies rapidly across successive harmonics).

It can be shown that the number of inter-harmonies I-i created using methods according to the above embodiments can be calculated from -1. In this embodiment, therefore, inter-harmonics will be created only if F, > f. In an embodiment, F is chosen to be below 300Hz. Tn a further embodiment, F,, = 100 Hz, which corresponds to that of a low-pitched male voice.

The last feature of FIHR is the weighting function W(.). In an embodiment, WC) has the following properties: W(0) = I; W(l -x) = 1-W(x); and W(x) is monotically decreasing with x increasing in the interval [0-1] . In an embodiment, W() is a polynomial function. In another embodiment, W(.) is an exponential function. In yet another embodiment, W(x)=1-x.

The methods of estimating the spectral envelope according the above embodiments can be shown to enable improved accuracy of estimation, particularly for high-pitched voices. Further, this is achieved without significantly increasing computation time. The technique can also be shown to be robust to errors in the determination of the fundamental frequency F. Methods and systems according to embodiments described above may be used in the training of text to speech systems. In such training, the spectral envelope is estimated from the speech of training speakers and stored for use in the text to speech system.

In one example of a text to speech system in which stored spectral envelope infonnation may be employed, speech is synthesised by passing a train of Dirac pulses through an all-pole filter. The stored spectral envelope data provides the magnitude spectrum of the all pole-filter. The filtered train of pulses is then output as synthesised speech.

The above embodiments may also be employed in speech modification systems. For example, the pitch of the speech of a particular speaker may be increased by estimating the spectral envelope from their unmodified speech and applying the spectral envelope as a filter for an excitation signal with a higher fundamental frequency.

The estimation of the spectral envelope according to the above embodiments may also be employed in speech recognition. For example, different vowels correspond to different spectral envelopes. Thus an estimation of the spectral envelope can aid in determining which vowel is being spoken.

Further, the spectral envelope varies with vocal tract and therefore estimation of the spectral envelope according to embodiments described above may be used for speaker identification, Experimental Results Table 1 shows the Relative Computation Time (RCT) for a number of spectral envelope estimation techniques: standard LP, and Mel Frequency Cepstral Coefficient (MIFCC), Discrete All-Pole (DAP), True Envelope (TB), and Cubic Spline Interpolation (CSI) between the harmonics. RCT is defined as the ratio of the computation time to sound duration. In addition, Table I shows the RCT for the interharmonic reconstruction of speech signal by FIll-JR and FIHR-ZP according to embodiments described above. Note that the RCT for FIHR and FIT-fR-ZP is given as the RCT for the pre-processing and does not include the envelope estimation itself. All implementations were in Matlab and were run on an Intel Core i7 3.0 0Hz CPU with 16GB of RAM. Thanks to their simplicity, Fill-JR and FIHR-ZP are very fast relative to the time required for envelope estimation.

LPC MFCC PAP TE CSI P11W FIII11-ZP 1.6 5.48 2-5.2 39.6 13.1 0.36 1.75

Table 1

Tables 2 and 3 give the spectral distortion (SD) between the spectral envelope estimated from synthetic signals created by passing a train of Dirac pulses through an all-pole filter and the magnitude spectrum of the all-pole filter.

The various synthesis parameters were varied as follows: P ranged from 100 to 1000 Hz, in steps of 100 Hz. The lowest frequency formant F! ranged from 100 to 900 Hz, in steps of 200 Hz; F2 and F3 were varied in the same way as Fl, but were located respectively I and 2 kliz higher; the radius for these 3 pairs of poles ranged from 0.95 to 0.99 in steps of 0.02. The sampling ratc F was set to 16 kHz. The whole dataset therefore consisted of 33750 artificial signals. The Spectral Distortion (SD) is defined as the root mean-square error of the log-amplitude spectra. Six techniques of SE estimation were compared: i) traditional LP with order /1000+2; ii) LP with optimal order 0.25 (hereafter noted LPC); iii) WCC with cepstral order 0.25. iv) DAY with optimal order 0.4* -; and v) TB with optimal cepstral order 0.5.-vi) CS!. For each of these 6 methods, three types of pre-processing are investigated: no pre-processing (conventional approach), FIHR and Fll-IR-ZP.

The results are displayed in Tables 2 and 3 respectively for lower and higher values of F. Note that SD increases for high-pitched voices as, for the reasons mentioned above, estimation of the spectral envelope is subject to more distortion at these frequencies.

From the tables, it can be seen that, in most cases, FIHR or FIHR-ZP pre-processing decreases the spectral distortion of the synthetic signal, especially for high pitched signals. The reduction in spectral distortion is significant for simple, low-computational load techniques such as LPC.

Pro.procsshig LPC LPC* MFC)C DAP TE &ne 1.56 2.03 2.98 0.80 1.77 1.30 Full 1M1 1.78 2.17 2.47 2,21 1.66 FmThZP 1.55 1.62 1,64 1.86 1.S 1.53 Table 2: SD (in dB) for P «= 500 Hz.

Pro-processing LPC LPC MFCC DAP TE [ 031 None 8,03 11.48 5.81 6.83 4.40 3.73 P11111. 4,15 4.31 4,46 4.59 4.62 4.16 FHTR-ZP 3.97 3.98 3.97 4.01 4.01 3.96 Table 3: SD (in dB) for J >500Hz.

A second experiment was carried out on real signals: the first 100 files uttered by the male speaker AWB (with average pitch about 140 Hz) of the CMU ARCTIC database were considered. These files were first analysed using a Harmonic Model (1-IM), and further resynthesizcd, keeping the harmonics multiple of a factor Q (which implies that F,, is therefore increased by Q). The resynthcsized signal could therefore be regarded as a spectral subsampling of the original I-tv!. The spectral envelope of the resynthesized signal was estimated using the techniques given below. SD was calculated using the amplitudes of the harmonics of the original 1-IM as a reference.

The results are given in Tables 4 and 5 for lower and higher values of Q Compared to the tables 2 and 3, SD is seen to be much larger, as the signals are much closer to real conditions.

The conclusions however remain similar: the use of FIHR-based pre-processing allows a substantial reduction of SD for most techniques, particularly for high pitched voices.

It can be seen that once a En-JR-based pre-processing has been applied, all the SE estimation techniques provide similar results. This can be explained by the fact that FIHR lowers the pitch to a value of F for which all SB estimation methods are known to work well (and consequently converge towards the same estimates).

Pro-processing LPC LPC MFGC flAP TE [ (SI None 3,04 4.94 6,46 5.08 497' 4.36 1 FIRE. 4.40 4.33 4A4 4.65 5.16 4.5 FIIIR-ZP 4.33 4.20 4.10 4.14 5.34 4.35 I Table 4: SD (in dB) for 2 «= Q «= 4.

Pro-processing LPC LPC* 1 MFGC flAP TE ( csIJ None 9.15 0.80 8.19 7.33 6.31 6.22 FIllEt 6.29 6.32 6,40 6.66 6.87 6.45 FIFIR-ZP 6.22 6.19 6.16 6.21 7.01 6.26 Table5: SD(indB)for 5«=Q«=7.

Thus methods and systems according embodiments described above may be used to increase the range of standard envelope estimation techniques while keeping the computational load to a minimum.

While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods, systems and carrier media described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims

CLAIMS: 1. A method of estimating the spectral envelope of a speech signal, said method comprising: inputting said speech signal; modifying said speech signal; and estimating the spectral envelope of said modified speech signal using an envelope estimation model, wherein modiflying said speech signal comprises generating new components of the signal at non-hannonic frequencies, wherein said new components of the signal comprise the vectoral sum of harmonic components of the input speech signal.
2. The method of claim 1, 1 5 wherein generating new components of the signal at non-harmonic frequencies comprises modulating the speech signal with a periodic function.
3. The method of claim 1, wherein generating new components of the signal at non-harmonic frequencies comprises modulating the speech signal with a cosine function.
4. The method of claim 3, wherein said cosine function is cos(--.,i), where n is the discrete time index, I is an integer greater than or equal to 2, and m is the thndamental frequency of the input speech signal.
5. The method of claim 3, wherein generating new components of the signal at non-hannonic frequencies further comprises adding the modulated speech signal to the input speech signal.
6. The method of claim 1, wherein modi'ing said speech signal comprises transforming the input speech signal s, into the modified speech signal y1 using the following equation: y(n) =s(n).{1+2W(COSE won)] where ii is the discrcte time index, W is a weighting function, I is an integer greater than or equal to 2, and w0 is the fundamental frequency of the input speech signal.
7. The method of claim 6, wherein W is a polynomial function.
8. The method of claim 6, wherein IV is an exponential function.
9. The method of claim 6, wherein W(x) = (1-x).
10. The method of claim 1, wherein modifying said speech signal further comprises determining the zero-phase version of the speech signal.
11. A system for estimating the spectral envelope of a speech signal, said system comprising: an input for receiving input speech; and a processor configured to: modif' said speech signal; and estimatc the spectral envelope of said modified speech signal using an envelope estimation model, wherein modifying said speech signal comprises generating new components of the signal at non-harmonic frequencies, wherein said new components of the signal comprise the vectoral sum of harmonic components of the input speech signal.
12. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 1.