Nothing Special   »   [go: up one dir, main page]

GB2526291A - Speech analysis - Google Patents

Speech analysis Download PDF

Info

Publication number
GB2526291A
GB2526291A GB1408878.5A GB201408878A GB2526291A GB 2526291 A GB2526291 A GB 2526291A GB 201408878 A GB201408878 A GB 201408878A GB 2526291 A GB2526291 A GB 2526291A
Authority
GB
United Kingdom
Prior art keywords
speech signal
signal
speech
spectral envelope
new components
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1408878.5A
Other versions
GB2526291B (en
GB201408878D0 (en
Inventor
Thomas Drugman
Ioannis Stylianou
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1408878.5A priority Critical patent/GB2526291B/en
Publication of GB201408878D0 publication Critical patent/GB201408878D0/en
Publication of GB2526291A publication Critical patent/GB2526291A/en
Application granted granted Critical
Publication of GB2526291B publication Critical patent/GB2526291B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/093Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters using sinusoidal excitation models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/06Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The spectral envelope of a speech signal is estimated (via eg. Linear Prediction, Discrete All Pole, Minimium Variance Distortionless Response, True Envelope, Cubic Spline Interpolation or Mel Frequency Cepstrum Coefficient methods) after modifying the signal by generating new components at non-harmonic frequencies (ie. frequencies which are not an integer multiple of the fundamental) which comprise the vectoral sum of the input signals harmonic components. One way of generating interharmonic components is by modulating the speech signal with a cosine function and an exponential or polynomial weighting W and obtaining a zero-phase version of this signal.

Description

Speech Analysis
FIELD
Embodiments of the present invention as generally described herein relate to a speech signal analysis system and method.
BACKGROIND
Speech analysis systems are systems where audio speech or audio speech files are inputted and parameters of the speech are extracted. One example of such a parameter is the spectral envelope of the speech.
Spectral envelope estimation of speech signals is used in a wide variety of applications such as text to spcech synthesis and speech modelling/coding.
There is a continuing need to improve the accuracy of spectral envelope estimation.
BRIEF DESCRIPTION OF THE FIGURES
Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which: Figure 1 is a system for estimating the spectral envelope of a speech signal according to an embodiment.
Figure 2 is a flow chart showing a method of estimating the spectral envelope of a speech signal according to an embodiment; Figure 3 is a flow chart showing another method of estimating the spectral envelope of a speech signal according to an embodiment; and Figure 4 shows an example of a speech spectrum and a corresponding envelope estiniation according to an embodiment.
DETAILED DESCRIPTION
In an embodiment, a method of estimating the spectral envelope of a speech signal is provided, said method comprising: inputting said speech signal; modifying said speech signal; and estimating the spectral envelope of said modified speech signal using an envelope estimation model, wherein modifying said speech signal comprises generating new components of the signal at non-harmonic frequencies, wherein said new components of the signal comprise the vectoral sum of harmonic components of the input speech signal.
The spectral envelope is the smooth function passing through the prominent peaks of the frequency spectrum of the speech signal. In an embodiment, any suitable envelope estimation model may be employed. In an embodiment, the envelope estimation model is selected from one of Linear Prediction (LP), the Discrete All Pole method (DP), the Minimum Variance Distortionless Response method (MVDR), the True Envelope method (TE), cubic spline interpolation (CST) between the harmonics, and Mel Frequency Cepstrum Coefficient (MFCC).
A harmonic component of a speech signal is a component of the signal that has a frequency equal to a harmonic frequency. Equivalently, the harmonic component of a speech signal is the component of the signal that has a frequency which is an integer multiple of the fundamental frequency. Non-harmonic frequencies (or inter-harmonic frequencies) are frequencies not equal to the harmonic frequencies of the original speech signal. Equivalently non-harmonic frequencies are frequencies which are not integer multiples of the fundamental frequency of the input speech signal. k an embodiment, a non-harmonic frequency may be lower than the fundamental frequency. In an embodiment new components are generated at frequencies intermediate between the harmonics of the original speech signal. In an embodiment, more than one new component is generated at frequencies between two harmonics of thc original speech signal. In an embodiment, the frequency of the lowest inter-harmonic component is lower than the fundamental frequency of the original speech signal.
In an embodiment, the number of new components between each harmonic can be varied. In an embodiment, the number of new components is selected by a user via a user interface.
In an embodiment, generating new components of the signal at non-hannonic frequencies comprises modulating the speech signal with a periodic function. In an embodiment, generating new components of the signal at non-harmonic frequencies comprises multiplying the speech signal with a periodic function. In a further embodiment, the periodic function is a cosine function. In yet a further embodiment, the cosine function is cos(---'n), where 11 is the discrete time index, I is an integer greater than or equal to 2, and w0 is the fundamental frequency of the input speech signal.
in an embodiment, modifying the speech signal comprises adding the modulated speech signal to the input speech signal. In a further embodiment, modifying the speech signal comprises transforming the input speech signal s into the modified speech signal y,, using the following equation: y(n) =s(n).[1+X2.W(Ycos(w0n)] where n is the discrete time index, W (.) is a weighting function, I is an integer greater than or equal to 2, and w0 is the fundamental frequency of the input speech signal.
In an embodiment, I and W are selected by the user. In an embodiment, I-i new components are generated in the speech signal.
In an embodiment, WC) has the following properties: W(O) = 1; W(l -x) = I -W(x); and W(x) is monotically decreasing with x increasing in the interval [0-1]. In an embodiment, W() is a polynomial function. in another embodiment, WC) is an exponential function. in yet another embodiment, W(x) = I -x.
In an embodiment modifying said speech signal further comprises determining the zero-phase version of the speech signal. in an embodiment, the zero-phase version of the speech signal is determined before modification of the signal occurs. In an embodiment, the zero-phase version of the signal itself is modified. In an embodiment, obtaining the zero-phase version of the speech signal comprises obtaining the inverse Fourier transform of the amplitude of the Fourier transform of the speech signal.
in an embodiment, a system for estimating the spectral envelope of a speech signal is provided, said system comprising: an input for receiving input speech; and a processor configured to: modify said speech signal; and estimate the spectral envelope of said modified speech signal using an envelope estimation model, wherein modifying said speech signal comprises generating new components of the signal at non-harmonic frequencies, wherein said new components of the signal comprise the vectoral sum of harmonic components of the input speech signal.
Methods in accordance with embodiments of the present invention can be implemented either in hardware or on software in a general purpose computer. Further methods in accordance with embodiments of the present can be implemented in a combination of hardware and software.
Methods in accordance with embodiments of the present invention can also be implemented by a single processing apparatus or a distributed network of processing apparatuses.
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
The source -filter model of speech production assumes that speech waveforms can be modelled as a white excitation signal filtered by a linear transfer function. The excitation signal performs a spectral subsampling of the filter transfer function. hi that case, the spectral envelope (SE), defined as a smooth function passing through the prominent peaks of the frequency spectrum, is the transfer function of the filter. Note that such an envelope contains not only the contribution of the vocal tract response, but also of the glottal flow.
SE estimation refers to the task of estimating the filter transfer function from the speech signal.
As a fundamental speech analysis problem, it finds interest in almost all voice technology applications, such as speech synthesis, speech recognition or speaker identification.
lii unvoiced sounds (no vibration of the vocal cords), the source is assumed to be a white noise whose amplitude spectrum is flat. In voiced sounds (vibration of the vocal cords), the source is assumed to be a quasi-periodic pulse Irain whose amplitude spectrum consists of peaks at the harmonics. The convolution of this excitation signal with the filter impulse response can therefore be seen as a sampling of the spectral envelope at multiple integers of the fundamental frequency P. In low-pitched voices, harmonics are close to each other, and therefore the resulting spectral sampling is sufficient to estimate the SE with a limited loss of information.
However, the more P increases, the further harmonics are from each other and therefore the more the SE will be subsampled and the more difficult its accurate estimation.
Figure 1 shows a spectral envelope estimation system 1. The speech analysis system 1 comprises a processor 3 which executes a program 5. Spectral envelope estimation system I may or may not thither comprise storage 7. The storage 7 may store data which is used by program 5 to analyse speech. The spectral envelope estimation system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to a speech input 15. Speech input 15 receives speech to be analysed. The speech input 15 may be, for example a microphone. Alternatively, speech input 15 may be a means for receiving speech data from an external storage medium or network.
Connected to the output module 13 is output for spectral envelope data 17. The output 17 is used for outputting data relating to the spectral envelope estimated from the speech which is input into speech input 15, The output 17 may be for example a spectral envelope data file which may be sent to a storage medium, networked, etc. In use, the spectral envelope estimation system I receives speech through speech input 15. The program 5 executes on processor 3 and estimates the spectral envelope of the speech data. The processor may or may not use data stored in storage 7. The spectral envelope data is output via the output module 13 to data output 17.
Figure 2 shows a method of estimating the spectra! envelope of a speech signal according to an embodiment.
In step SIOl the speech signal s(n) is input, where n is the discrete time index. The speech signal may be input directly by a speaker via microphone or via a file containing pre-recorded speech data.
in step SI 03, a weight function W is selected. This function will be discussed in detail below.
The weight function may be selected manually by the user of the spectral envelope estimation system via a user interface. Alternatively, pre-determined values of the weight function may be stored in storage 7.
In step SI 05, a number of interharmonics I-I is selected. This number will be discussed in detail below. The value of T may be selected manually by the user of the spectral envelope estimation system via a user interface. Alternatively, pre-determined values of I may be stored in storage 7.
In step S 109, fast interharmonic reconstruction of the speech signal is performed (FIHR). FIHR according to an embodiment will now be described.
In this embodiment, the speech signal s(n) is assumed to consist of only harmonics, i.e. s(n)=ak.cos(kwOn+Øk) (1) where k' 0k' and 1th' respectively, denote the amplitude and phase of harmonic k,the fundamental angular frequency and the number of harmonics over the fill band.
In one embodiment, the speech signal, s(n), is modulated by a cosine of frequency This results in a shifi of the harmonic components by and by a reduction of their amplitude by a factor 2. As a consequence, new components will appear in (k + i). w0 whose amplitude and phase result from the vectorial sum of the original components in k. co and (/c + 1) . w0. The modulated speech signal is then added to the original speech signal s(n) . Thus, an interharmonic is artificially created between each of the harmonics of the original signal.
In a firther embodiment, multiple interharmonics are created between the cxisting harmonics of thc speech signal. Under its general form, FIHR transforms an original speech signal sOi) into a new signal y(n) using the following equation: y(n) = s(n) . [1 + 2 W(L) . cos(L wan)] (2) where I-I inter-harmonics are created, the number of wluch were selected in step 5105 above.
W() is the weighting function selected in step S103 above. The resulting signal y(n) can be regarded as a periodic signal whose fundamental frequency F is -. With the transformation described in Equation (2), FIHR can be regarded as a preprocess to lower the fundamental frequency, before extracting the SE. As discussed above, accurate SE estimation of low-pitched signals is easier than SB estimation of high-pitched signals.
It can be easily shown that Equation (2) is equivalent to applying the following operation in the complex spectrum domain: Y(co)= S(w)+XW('.[S(w_'w0)+S(w+'00)] (3) In step S 109, spectral envelope (SE) estimation is performed on the transformed speech signal y(n).
Any technique suitable for estimating the spectral envelope may be used on the transform speech signal y(n) in step S109. For example, Linear Prediction (LP), the Discrete All Pole 1 5 method (DP), the Minimum Variance Distortionless Response method (MVDR) the True Envelope method (TB), cubic spline interpolation (CSI) between the harmonics, and Mel Frequency Cepstrum Coefficient (MFCC) are all suitable techniques.
In an embodiment, linear prediction is employed. In this embodiment, the predietcd value 9(n) of the modified signal 9(n) is given by 9(n) = E' a y(ri -i), (4) where p is the order of the linear predictor and a1 are the linear prediction coefficients. The error between the predicted value, 9(n) and the actual value, y(n), is given as: e(n) = y(n) -5(n).
The root mean square criterion is used to estimatc the linear prediction coefficients a1.
Minimisation of the squared error E[e2(n)} (where E{x] indicates the expectation value of the function x) yields the equation: X, a1RO -1) = -RU) (5) where RU) = E{y(n)y(n -} is the autocorrelation of signal y(n). [his equation is then solved for a1. In an embodiment it is solved using a Levinson-Durbin recursion.
The spectral envelope of the signal y(n) is then given as: A(eJ°) = (6) where j is the square root of minus 1.
In another enthodirnent, FIIHR is applied to the zero phase version of the speech signal. A flow diagram showing a method of speech envelope estimation according to this embodiment is shown in Figure 3.
In Step S 101, speech is input.
In Step S 1101, the speech signal is windowed using a window function. Any windowing function suitable for use with a Fourier transform may be employed.
In Step 51103, the Fourier Transform of the windowed signal is taken.
In Step S 1105, an inverse Fourier transform is applied to the amplitude of the signal obtained in step S1103. The output of the transfonn is the zero-phase version s,(n) of the speech signal.
In steps S103 and SlOS, the weight function and number of interharmonics, respectively, are selected as described above in relation to Figure 2.
En step S 109, fast interharmonic reconstruction of the zero-phase signal is performed. This process is the same as that described above in relation to Figure 2. However, in the case of the zero-phase signal s, (11), the multiplication of s (n) by cos2u2. n) yields replicas in (k + 1) w9 whose amplimde can be shown to be +ak+l) . Thus, by employing the zero-phase version of the speech signal, as opposed to the unaltered speech signal, the amplitude of the new components are dependent solely on the ak (and not on the 0(k)).
In this embodiment, app]ying FfHR to the zero-phase speech signal can be seen as a moving-average (MA) process on the amplitude spectrum of the speech signal. The coefficients of this MA process are determined by the weighting function We).
In step SIll, envelope estimation is performed on the resulting spectrum, as discussed above in relation to Figure 2.
The amplitude spectrum of a soprano speech signal where the fundamental frequency I 650 Hz is shown in Figure 4 (solid black line). The result of fast interharmonic reconstruction according to an embodiment applied to the zero-phase speech signal s, (ii) (referred to as the FIT-IR-ZP technique) using F = 130 Hz (I = 5) is also shown for comparison (solid blue line).
The subsequent autocorrelation-based LP envelope (order, p = I 8, sampling frequency P = 16 kT-Iz) is also shown. The dotted red line indicates the result of LP on the original, unreconstructed spectrum and the dotted green line indicates the result of LP on the reconstructed zero-phase spectrum. The effect of Fast Interharmonie Reconstruction on the estimated spectral envelope can therefore be observed.
According to the above embodiments, Fli-IR is characterized by 3 features: i) the use of either s(n) or s7, (ii) in Equation (2); ii) the target fundamental frequency P = --(or equivalently the number of inter-harmonies I -I); and iii) the weighting function W().
The choice between s(n) or s (n) results from a trade-off between accuracy and computational load. Obtaining s,(n) requires the computation of FFT and IFFT, but guarantees that, contraiy to s(n) , Equation (3) is an MA process on the amplitude spectrum, and is independent upon the phase of the signal. The use of s(n) is therefore provides improved accuracy (especially if the phase spectrum varies rapidly across successive harmonics).
It can be shown that the number of inter-harmonies I-i created using methods according to the above embodiments can be calculated from -1. In this embodiment, therefore, inter-harmonics will be created only if F, > f. In an embodiment, F is chosen to be below 300Hz. Tn a further embodiment, F,, = 100 Hz, which corresponds to that of a low-pitched male voice.
The last feature of FIHR is the weighting function W(.). In an embodiment, WC) has the following properties: W(0) = I; W(l -x) = 1-W(x); and W(x) is monotically decreasing with x increasing in the interval [0-1] . In an embodiment, W() is a polynomial function. In another embodiment, W(.) is an exponential function. In yet another embodiment, W(x)=1-x.
The methods of estimating the spectral envelope according the above embodiments can be shown to enable improved accuracy of estimation, particularly for high-pitched voices. Further, this is achieved without significantly increasing computation time. The technique can also be shown to be robust to errors in the determination of the fundamental frequency F. Methods and systems according to embodiments described above may be used in the training of text to speech systems. In such training, the spectral envelope is estimated from the speech of training speakers and stored for use in the text to speech system.
In one example of a text to speech system in which stored spectral envelope infonnation may be employed, speech is synthesised by passing a train of Dirac pulses through an all-pole filter. The stored spectral envelope data provides the magnitude spectrum of the all pole-filter. The filtered train of pulses is then output as synthesised speech.
The above embodiments may also be employed in speech modification systems. For example, the pitch of the speech of a particular speaker may be increased by estimating the spectral envelope from their unmodified speech and applying the spectral envelope as a filter for an excitation signal with a higher fundamental frequency.
The estimation of the spectral envelope according to the above embodiments may also be employed in speech recognition. For example, different vowels correspond to different spectral envelopes. Thus an estimation of the spectral envelope can aid in determining which vowel is being spoken.
Further, the spectral envelope varies with vocal tract and therefore estimation of the spectral envelope according to embodiments described above may be used for speaker identification, Experimental Results Table 1 shows the Relative Computation Time (RCT) for a number of spectral envelope estimation techniques: standard LP, and Mel Frequency Cepstral Coefficient (MIFCC), Discrete All-Pole (DAP), True Envelope (TB), and Cubic Spline Interpolation (CSI) between the harmonics. RCT is defined as the ratio of the computation time to sound duration. In addition, Table I shows the RCT for the interharmonic reconstruction of speech signal by FIll-JR and FIHR-ZP according to embodiments described above. Note that the RCT for FIHR and FIT-fR-ZP is given as the RCT for the pre-processing and does not include the envelope estimation itself. All implementations were in Matlab and were run on an Intel Core i7 3.0 0Hz CPU with 16GB of RAM. Thanks to their simplicity, Fill-JR and FIHR-ZP are very fast relative to the time required for envelope estimation.
LPC MFCC PAP TE CSI P11W FIII11-ZP 1.6 5.48 2-5.2 39.6 13.1 0.36 1.75
Table 1
Tables 2 and 3 give the spectral distortion (SD) between the spectral envelope estimated from synthetic signals created by passing a train of Dirac pulses through an all-pole filter and the magnitude spectrum of the all-pole filter.
The various synthesis parameters were varied as follows: P ranged from 100 to 1000 Hz, in steps of 100 Hz. The lowest frequency formant F! ranged from 100 to 900 Hz, in steps of 200 Hz; F2 and F3 were varied in the same way as Fl, but were located respectively I and 2 kliz higher; the radius for these 3 pairs of poles ranged from 0.95 to 0.99 in steps of 0.02. The sampling ratc F was set to 16 kHz. The whole dataset therefore consisted of 33750 artificial signals. The Spectral Distortion (SD) is defined as the root mean-square error of the log-amplitude spectra. Six techniques of SE estimation were compared: i) traditional LP with order /1000+2; ii) LP with optimal order 0.25 (hereafter noted LPC); iii) WCC with cepstral order 0.25. iv) DAY with optimal order 0.4* -; and v) TB with optimal cepstral order 0.5.-vi) CS!. For each of these 6 methods, three types of pre-processing are investigated: no pre-processing (conventional approach), FIHR and Fll-IR-ZP.
The results are displayed in Tables 2 and 3 respectively for lower and higher values of F. Note that SD increases for high-pitched voices as, for the reasons mentioned above, estimation of the spectral envelope is subject to more distortion at these frequencies.
From the tables, it can be seen that, in most cases, FIHR or FIHR-ZP pre-processing decreases the spectral distortion of the synthetic signal, especially for high pitched signals. The reduction in spectral distortion is significant for simple, low-computational load techniques such as LPC.
Pro.procsshig LPC LPC* MFC)C DAP TE &ne 1.56 2.03 2.98 0.80 1.77 1.30 Full 1M1 1.78 2.17 2.47 2,21 1.66 FmThZP 1.55 1.62 1,64 1.86 1.S 1.53 Table 2: SD (in dB) for P «= 500 Hz.
Pro-processing LPC LPC MFCC DAP TE [ 031 None 8,03 11.48 5.81 6.83 4.40 3.73 P11111. 4,15 4.31 4,46 4.59 4.62 4.16 FHTR-ZP 3.97 3.98 3.97 4.01 4.01 3.96 Table 3: SD (in dB) for J >500Hz.
A second experiment was carried out on real signals: the first 100 files uttered by the male speaker AWB (with average pitch about 140 Hz) of the CMU ARCTIC database were considered. These files were first analysed using a Harmonic Model (1-IM), and further resynthesizcd, keeping the harmonics multiple of a factor Q (which implies that F,, is therefore increased by Q). The resynthcsized signal could therefore be regarded as a spectral subsampling of the original I-tv!. The spectral envelope of the resynthesized signal was estimated using the techniques given below. SD was calculated using the amplitudes of the harmonics of the original 1-IM as a reference.
The results are given in Tables 4 and 5 for lower and higher values of Q Compared to the tables 2 and 3, SD is seen to be much larger, as the signals are much closer to real conditions.
The conclusions however remain similar: the use of FIHR-based pre-processing allows a substantial reduction of SD for most techniques, particularly for high pitched voices.
It can be seen that once a En-JR-based pre-processing has been applied, all the SE estimation techniques provide similar results. This can be explained by the fact that FIHR lowers the pitch to a value of F for which all SB estimation methods are known to work well (and consequently converge towards the same estimates).
Pro-processing LPC LPC MFGC flAP TE [ (SI None 3,04 4.94 6,46 5.08 497' 4.36 1 FIRE. 4.40 4.33 4A4 4.65 5.16 4.5 FIIIR-ZP 4.33 4.20 4.10 4.14 5.34 4.35 I Table 4: SD (in dB) for 2 «= Q «= 4.
Pro-processing LPC LPC* 1 MFGC flAP TE ( csIJ None 9.15 0.80 8.19 7.33 6.31 6.22 FIllEt 6.29 6.32 6,40 6.66 6.87 6.45 FIFIR-ZP 6.22 6.19 6.16 6.21 7.01 6.26 Table5: SD(indB)for 5«=Q«=7.
Thus methods and systems according embodiments described above may be used to increase the range of standard envelope estimation techniques while keeping the computational load to a minimum.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods, systems and carrier media described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims (12)

  1. CLAIMS: 1. A method of estimating the spectral envelope of a speech signal, said method comprising: inputting said speech signal; modifying said speech signal; and estimating the spectral envelope of said modified speech signal using an envelope estimation model, wherein modiflying said speech signal comprises generating new components of the signal at non-hannonic frequencies, wherein said new components of the signal comprise the vectoral sum of harmonic components of the input speech signal.
  2. 2. The method of claim 1, 1 5 wherein generating new components of the signal at non-harmonic frequencies comprises modulating the speech signal with a periodic function.
  3. 3. The method of claim 1, wherein generating new components of the signal at non-harmonic frequencies comprises modulating the speech signal with a cosine function.
  4. 4. The method of claim 3, wherein said cosine function is cos(--.,i), where n is the discrete time index, I is an integer greater than or equal to 2, and m is the thndamental frequency of the input speech signal.
  5. 5. The method of claim 3, wherein generating new components of the signal at non-hannonic frequencies further comprises adding the modulated speech signal to the input speech signal.
  6. 6. The method of claim 1, wherein modi'ing said speech signal comprises transforming the input speech signal s, into the modified speech signal y1 using the following equation: y(n) =s(n).{1+2W(COSE won)] where ii is the discrcte time index, W is a weighting function, I is an integer greater than or equal to 2, and w0 is the fundamental frequency of the input speech signal.
  7. 7. The method of claim 6, wherein W is a polynomial function.
  8. 8. The method of claim 6, wherein IV is an exponential function.
  9. 9. The method of claim 6, wherein W(x) = (1-x).
  10. 10. The method of claim 1, wherein modifying said speech signal further comprises determining the zero-phase version of the speech signal.
  11. 11. A system for estimating the spectral envelope of a speech signal, said system comprising: an input for receiving input speech; and a processor configured to: modif' said speech signal; and estimatc the spectral envelope of said modified speech signal using an envelope estimation model, wherein modifying said speech signal comprises generating new components of the signal at non-harmonic frequencies, wherein said new components of the signal comprise the vectoral sum of harmonic components of the input speech signal.
  12. 12. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 1.
GB1408878.5A 2014-05-19 2014-05-19 Speech analysis Expired - Fee Related GB2526291B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1408878.5A GB2526291B (en) 2014-05-19 2014-05-19 Speech analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1408878.5A GB2526291B (en) 2014-05-19 2014-05-19 Speech analysis

Publications (3)

Publication Number Publication Date
GB201408878D0 GB201408878D0 (en) 2014-07-02
GB2526291A true GB2526291A (en) 2015-11-25
GB2526291B GB2526291B (en) 2018-04-04

Family

ID=51135091

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1408878.5A Expired - Fee Related GB2526291B (en) 2014-05-19 2014-05-19 Speech analysis

Country Status (1)

Country Link
GB (1) GB2526291B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187635A1 (en) * 2002-03-28 2003-10-02 Ramabadran Tenkasi V. Method for modeling speech harmonic magnitudes
US20070288232A1 (en) * 2006-04-04 2007-12-13 Samsung Electronics Co., Ltd. Method and apparatus for estimating harmonic information, spectral envelope information, and degree of voicing of speech signal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187635A1 (en) * 2002-03-28 2003-10-02 Ramabadran Tenkasi V. Method for modeling speech harmonic magnitudes
US20070288232A1 (en) * 2006-04-04 2007-12-13 Samsung Electronics Co., Ltd. Method and apparatus for estimating harmonic information, spectral envelope information, and degree of voicing of speech signal

Also Published As

Publication number Publication date
GB2526291B (en) 2018-04-04
GB201408878D0 (en) 2014-07-02

Similar Documents

Publication Publication Date Title
US9368103B2 (en) Estimation system of spectral envelopes and group delays for sound analysis and synthesis, and audio signal synthesis system
Shrawankar et al. Techniques for feature extraction in speech recognition system: A comparative study
US8594993B2 (en) Frame mapping approach for cross-lingual voice transformation
CN111833843B (en) Speech synthesis method and system
EP0970466A2 (en) Voice conversion system and methodology
Ganapathy et al. Temporal envelope compensation for robust phoneme recognition using modulation spectrum
Choi et al. Korean singing voice synthesis based on auto-regressive boundary equilibrium gan
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
Alku et al. The linear predictive modeling of speech from higher-lag autocorrelation coefficients applied to noise-robust speaker recognition
CN108369803B (en) Method for forming an excitation signal for a parametric speech synthesis system based on a glottal pulse model
US10014007B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
Khonglah et al. Speech enhancement using source information for phoneme recognition of speech with background music
Srivastava Fundamentals of linear prediction
CN116543778A (en) Vocoder training method, audio synthesis method, medium, device and computing equipment
Kafentzis et al. On the Modeling of Voiceless Stop Sounds of Speech using Adaptive Quasi-Harmonic Models.
EP2519944B1 (en) Pitch period segmentation of speech signals
Nasreen et al. Speech analysis for automatic speech recognition
Drugman et al. Fast inter-harmonic reconstruction for spectral envelope estimation in high-pitched voices
JP2006215228A (en) Speech signal analysis method and device for implementing this analysis method, speech recognition device using this device for analyzing speech signal, program for implementing this analysis method, and recording medium thereof
GB2526291A (en) Speech analysis
Demuynck et al. Synthesizing speech from speech recognition parameters
Huh et al. A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit
Kawahara et al. Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds.
Sarma et al. Formant frequency estimation of phonemes of Assamese speech
Jinachitra Robust structured voice extraction for flexible expressive resynthesis

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230519