Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD) ISSN 2249-684X Vol. 3, Issue 3, Aug 2013, 99-106 © TJPRC Pvt. Ltd. IMPLEMENTATION OF A NOVEL TRANSFORMATION TECHNIQUE TO IMPROVE SPEECH COMPRESSION RATIO B. V. VIJAYASRI1 & I. SANTIPRABHA2 1 Associate Professor, Department of ECE, SSAIST, Surampalem, JNTUK, Kakinada, Andhra Pradesh, India 2 Director CEOW&G, Department of ECE, JNTUK, Kakinada, Andhra Pradesh, India ABSTRACT A novel transformation technique for speech compression is proposed to improve the compression ratio without loss of intelligibility of speech. This technique is a combination of analysis-synthesis and sub-band coding technique. The proposed technique is compared with conventional techniques like Linear Predictive Coding (LPC) and Discrete Wavelet Transform (DWT). First, a speech encoder at lower bit rates (LPC) is used to analyze the speech signals. Levinson Durbin Algorithm is used to reproduce the important characteristics of the input speech in LPC analysis. Later, DWT is used to analyze the speech signal. The comparative analysis was done for the proposed, conventional and transformation techniques in terms of qualitative speech parameters like Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR). KEYWORDS: Speech Coding, Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR) INTRODUCTION Speech coding is defined as transforming the speech in a compact form, so that the memory required for transmitting the speech signal is less [1]. Speech coding is the fundamental operation in Public Switched Telephone Network (PSTN), digital and cellular telephony and various Voice over Internet Protocol (VoIP) applications. The most common approaches to narrow band speech coding is centered around three types of paradigms: waveform coders, analysis– synthesis techniques, sub–band coding [7]. Waveform coders attempt to reproduce the time domain speech signal as accurately as possible. AnalysisSynthesis method utilizes a perceptual distortion measures like LPC [6] to reproduce the speech signal without loss of important characteristics like speech rate, pitch. Another speech coding approach is sub-band coding [3]. The speech is first segmented into separate frequency bands called sub-bands. Then each frequency band can be separately encoded either by waveform coding or analysis - synthesis methods. The most familiar methods of waveform coding are log PCM and ADPCM. In PSTN systems log PCM is used for long distance communication at a rate of 64kbps. μ-law and A-law are the companding techniques used in log PCM technique. ADPCM operates at 32kbps and achieves performance comparable with log PCM [4]. Analysis-synthesis speech coding is a lossy coding technique, which means the reproduced signal does not exactly sound like original. The most common lossy coding technique is LPC. For each coefficient prediction, an error signal is calculated, called MSE and is minimized. Transform or sub-band coding is applied to speech signals because of operational performance achieved at lower bit rates [7]. In general, the classic tradeoff in speech compression is rate versus distortion. The higher the bit rate, the smaller the distortion in the reproduced signal. But using transform coding lower bit rates and good performance can be achieved [3], because more bits can be allocated to obtain perceptually important coefficients. The speech signals that need to be coded are wide-band signals with frequencies in the range 300Hz to 8 KHz. 100 B. V. Vijayasri & I. Santiprabha Therefore, the sampling frequency should be 16 KHz [4]. Speech coding is done in two paradigms: Analysis-synthesis using LPC and Transform coding using different discrete wavelet transform like 'haar', 'daubchies', 'dmey'. The comparison is made in terms of minimum Mean Square Error (MSE), Peak Signal-to-Noise Ratio (PSNR) and compression ratio. SPEECH CODING USING LPC The continuous input speech signal is discretized by sampling at a rate of 16KHz, as the voice signal may range upto 8KHz[2]. Considering in view that the properties of short duration speech frames remains constant[5], the sampled speech signal is segmented to frames of size 20msec. To, reduce the effects of radiation at lips high frequency coefficients to be preserved. Hence, each frame is high pass filtered. The LPC analysis can be applied to voiced speech signal. But, the speech may consist of voiced and unvoiced frames. Hence the voiced frames can be identified by Levinson Durbin Algorithm [6]. The voiced or unvoiced classification can be done through estimation of pitch period, zero crossing rate, and energy of the speech. The energy of any mth speech frame of size N can be obtained by the relation, E(m) =  2 m n = m - N +1 y ( n) (1) where y(n) is the speech frame. The threshold for energy may be calculated using the mean logarithm value of the energy of the frame. The frame may be classified as voiced if the energy is greater than the threshold [5]. The frame may be classified as voiced/unvoiced depending on the zero crossing rate of the speech frame [5]. The zero crossing rate of any mth speech frame can be defined as,  m zc(m) = n = m- N +1 sgn( y(n))sgn( y(n1)) (2) where N is the number of samples in the frame. The threshold is obtained and if zero crossing rate is less than the threshold then the frame is voiced. Similarly, the classification may be done through autocorrelation also. The autocorrelation of any speech frame is defined as, R(k) =  N-1-k m 0 y(m) y(m  k) (3) The autocorrelation function is not only used for voice/unvoiced classification [5] but, is also used for pitch period and LPC coefficients estimation. Therefore, for each frame M+1coefficients can be estimated where M is the order of LPC. The LPC coefficients can be estimated from the following equations: = = for 1 ≤ i ≤ M, = = = (1 - for 1 ≤ j ≤ i-1 (4) 101 Implementation of a Novel Transformation Technique to Improve Speech Compression Ratio The above equations are solved recursively for i i= 1,2,3……..M and the final solution is = for 1 ≤ j ≤ M can be obtained. Then the prediction error can be obtained and is represented by, e (n) = y(n) -  M   ai y(n  i)    i 1 (5) SPEECH CODING USING DWT Wavelets are short duration waveforms with zero mean value. Wavelet transform can represent non-stationary signals like speech more effectively [2], since it retains both time and frequency related information of the speech signal. The choice of mother wavelet [6] can be done by choosing the energy of the wavelet basis function that can concentrate on the level 1 approximation coefficient. The speech signal is then decomposed into set of scaled and translated versions of mother wavelet. In DWT, different set of filters with different cut off frequencies are used to analyze the signal at different scales. The signal is passed through a series of high pass filters and low pass filters to analyze high frequency components and low frequency components separately. The filtered results are down sampled by 2. For many signals most of the energy is concentrated on low frequencies, hence the low pass filtered signal is again passed through series of high pass and low pass filters and the process repeats till we get the desired decomposition. This process of decomposing the signal is called subband coding. At each level of decomposition the number of samples reduces by 2. Hence the original speech signal is now compressed to 2n :1 where n is the level of decomposition. Figure 1: Wavelet Decomposition and Reconstruction Structure The detailed coefficients are treated to be zero at the first level of synthesis process. Both the coarser and detail coefficients are up sampled and then filtered through low pass and high pass filters. The transfer function of these filters should be the inverse transfer function of the corresponding low pass and high pass filters used in the analysis. Hence the synthetic speech consists of 16000 samples with coefficients determined from interpolation and summation process. SUB-BAND SPEECH CODING USING LPC AND DWT The sub-band coding technique using LPC and DWT is the proposed technique to improve the quality of speech. The proposed technique is implemented on the input speech signals. First, 1-dimensional DWT of level – 1 is applied and 102 B. V. Vijayasri & I. Santiprabha the coarser coefficients of DWT are compressed using LPC and transmitted. The coarser coefficients consist of most of the voiced speech coefficients. Hence Levinson Durbin algorithm based LPC is applied to coarser coefficients. The process can be represented through a block diagram in figure 2. Figure 2: Represents Analysis of Speech Using DWT and LPC At the decoder end the coarser coefficients are predicted from the received coefficients by using LPC. Now by using the synthesis process of DWT (level – 1) the speech signal can be reconstructed. The inverse process of analysis gives the speech synthesis i.e., at the decoder the received signal is decoded using LPC the coarser coefficients are predicted. These predicted coefficients are interpolated by a factor of 2 and then allowed to pass through the filter whose transfer function is inverse complex conjugate of H 0(z). Hence the synthesized speech signal may be reconstructed from level – 1 DWT. RESULTS The performance measure of the above discussed algorithms is done by the parameters, MSE and PSNR. For LPC analysis, the mean square error and minimization of MSE is given in eq. (5). The input speech signal is analyzed by using Levinson Durbin algorithm for M=10. The synthetic speech reproduced from the estimated coefficients is intelligible. If the order increases, the unvoiced signals become more strengthen and causing the flavor of nuance in the synthetic speech. However, the synthetic speech reproduced using LPC of order 10 is audible and intelligible to the human ear. The PSNR can be obtained by the relation,  2  N X  PSNR= 10 log  2  x r    (6) Where X is the maximum absolute value of speech signal, and x r 2 is the energy difference between the original speech and the reconstructed speech. The same input signal is analyzed by using 'haar', ‘dmey’ and 'daubchies' wavelet transforms. These wavelets are chosen by considering energy in the level – 1 coarser coefficient. The signal is analyzed for third level decomposition. The synthetic speech is reproduced by applying IDWT of the same wavelets in the synthesis process. The resulting signal is intelligible and audible. The performance measures for wavelet analysis is considered as, MSE = ∑(x -r)2/N PSNR = 10 log 10 (7)      f max MSE 2      where fmax is the maximum amplitude of original and reconstructed speech (8) Implementation of a Novel Transformation Technique to Improve Speech Compression Ratio 103 The LPC and wavelet transform techniques are applied to a set of input speech signals. Of the different wavelets for each input speech the energy of the wavelet is calculated and found ‘dmey’ is having higher energy for all the speech signals in the chosen set. The following tabular forms show the MSE, PSNR and compression ratio of the different inputs for LPC and Wavelet techniques. Table 1: LPC Analysis Speech Signal s1 s2 s3 s4 s5 MSE 0.00061 0.00051 0.00029 0.00074 0.00042 PSNR 66.8693 69.9084 74.0175 67.2717 75.9210 Compression Ratio 1.3259 1.3260 1.3306 1.3332 1.3273 s1, s2, s3, s4, s5 are the input file names of the speech signals chosen for analyzing. S1 represents “its easy to tell the doubts above all”, s2 represents “glue the sheet to dark blue background”, s3 represents “this was easy for me”, s4 represents “do the need for crazy”, s5 represents “this was and emphasized need for an action” Table 2: Wavelet Analysis Choice of Wavelet: dmey Speech Signal s1 s2 s3 S4 s5 MSE 0.031 0.089 0.021 0.070 0.0322 PSNR 27.1903 24.8523 26.9811 24.88 28.85 Compression Ratio 7.8739 7.8782 7.7953 7.7136 7.8820 The sub-coding technique, level -1 DWT and LPC are applied for analysis and synthesis of set of speech signals. The compression ratio for intelligible speech is found to be 1.67:1. Therefore, from all the above techniques, the Wavelet analysis technique is found to be better in all aspects of synthesis. The following tabular form shows the performance analysis of the above technique. Table 3: Sub-Band Analysis Using LPC and DWT Choice of Wavelet: dmey Speech Signal s1 s2 s3 S4 s5 MSE 0.0051 0.0022 0.0036 0.0030 0.0009 PSNR 46.3432 63.8467 53.0806 59.2554 67.3204 Compression Ratio 1.6681 1.6636 1.6757 1.6726 1.6617 CONCLUSIONS The average MSE for LPC analysis is 0.0005. But the average compression ratio is only 1.33. The prosody parameters like voice quality, pitch are estimated in LPC synthesis, in addition to environmental parameter, SNR. In LPC these parameters are found be good enough in reproducing the intelligible speech. The average MSE for wavelet analysis is 0.05 and the average compression ratio is 7.82. In wavelet transformation technique only the coarser coefficients undergo the process of decomposition at each level. The high frequency coefficients are not considered in the process of reconstruction of the synthesized speech. This increases the MSE, (when compared to LPC) and hence reduces the quality of speech. But a better compression ratio can be achieved with considerable quality of speech. The average MSE is comparatively less for sub-band analysis using LPC and wavelet, i.e., 0.003, and the compression ratio is 1.67. Even though the compression ratio is more for wavelet analysis the performance is less 104 B. V. Vijayasri & I. Santiprabha according to MSE and PSNR. The compression ratio is improved in sub-band analysis using LPC and wavelet when compared to LPC. This can be achieved because of two types of coding used in this proposed technique. This technique can also be used by inverting the process of applying the LPC and DWT, i.e., the speech signal may first undergo LPC analysis and then DWT can be applied. In such process the compression ratio may be improved better with a loss of quality of speech. In these techniques a tradeoff must be made between quality of speech and compression ratio. Finally, the proposed technique is better when compared to the conventional techniques. The comparative analysis of these techniques is represented graphically in the figure 3, figure 4, figure 5. Figure 3: Graphical Comparison for the 3 Techniques in Terms of MSE Figure 4: Graphical Comparison for the 3 Techniques in Terms of PSNR Figure 5: Graphical Comparison for the 3 Techniques in Terms of Compression Ratio Implementation of a Novel Transformation Technique to Improve Speech Compression Ratio 105 REFERENCES 1. Amol R.Madane, Zalak shah, Raina shah, Sanket takur, “Speech compression Using Linear Predictive Coding”, Proceedings of International workshop on Machine Intelligence Research – 2009 nagpur. 2. Ankith Patel, Mark Tonkelowitz, “ Lossless sound compression using the Discrete Wavelet Transform.” Jan 14, 2002. 3. Deepen Sinha, Ahmed H. Tewfik, “Low bit rate transparent Audio Compression using Adapted wavelets”, IEEE transactions of signal processing, vol.41, No.12, Dec 1993. 4. KT talale, S.T. Gandhe, “ Speech Compression using ADPCM” 5. L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signal“. 6. Mahmoud.A.Osman, Nasser Al, Hussein M. Magboub and S.A.Alfandi, “Speech Compression using LPC and Wavelet”, 2nd International Conference on Computer Engineering and Technology, Al-fateh University, TripoliLibya, 2010 IEEE. 7. P.S.Sathidevi, Y.Venkataramani, "Speech and Audio Coding using Wavelet Transforms", Proceedings of the National Seminar on Applied Systems Engineering and Soft computing", (SASESC 2000 ), Dayalbagh Educational Institute, Agra, 2000, pp.242-246. 8. Palaniandavar Venkateswaran, Arindam Sanyal, Snehasish Das, Rabindranath Nandi, Salil Kumar Sanyal,” An Efficient Time Domain Speech Compression Algorithm Based on LPC and Sub-Band Coding Techniques”, Journal of Communications, Vol 4, No 6 (2009), 423-428, Jul 2009. 9. Simon D. Boland, Mohammed Deriche, “Hybrid LPC and Discrete Wavelet Transform Audio Coding with a novel Bit allocation algorithm”.