International Journal of Electronics, Communication &
Instrumentation Engineering Research and
Development (IJECIERD)
ISSN 2249-684X
Vol. 3, Issue 3, Aug 2013, 99-106
© TJPRC Pvt. Ltd.
IMPLEMENTATION OF A NOVEL TRANSFORMATION TECHNIQUE TO IMPROVE
SPEECH COMPRESSION RATIO
B. V. VIJAYASRI1 & I. SANTIPRABHA2
1
Associate Professor, Department of ECE, SSAIST, Surampalem, JNTUK, Kakinada, Andhra Pradesh, India
2
Director CEOW&G, Department of ECE, JNTUK, Kakinada, Andhra Pradesh, India
ABSTRACT
A novel transformation technique for speech compression is proposed to improve the compression ratio without
loss of intelligibility of speech. This technique is a combination of analysis-synthesis and sub-band coding technique. The
proposed technique is compared with conventional techniques like Linear Predictive Coding (LPC) and Discrete Wavelet
Transform (DWT). First, a speech encoder at lower bit rates (LPC) is used to analyze the speech signals. Levinson Durbin
Algorithm is used to reproduce the important characteristics of the input speech in LPC analysis. Later, DWT is used to
analyze the speech signal. The comparative analysis was done for the proposed, conventional and transformation
techniques in terms of qualitative speech parameters like Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR).
KEYWORDS: Speech Coding, Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR)
INTRODUCTION
Speech coding is defined as transforming the speech in a compact form, so that the memory required for
transmitting the speech signal is less [1]. Speech coding is the fundamental operation in Public Switched Telephone
Network (PSTN), digital and cellular telephony and various Voice over Internet Protocol (VoIP) applications. The most
common approaches to narrow band speech coding is centered around three types of paradigms: waveform coders,
analysis– synthesis techniques, sub–band coding [7].
Waveform coders attempt to reproduce the time domain speech signal as accurately as possible. AnalysisSynthesis method utilizes a perceptual distortion measures like LPC [6] to reproduce the speech signal without loss of
important characteristics like speech rate, pitch. Another speech coding approach is sub-band coding [3]. The speech is first
segmented into separate frequency bands called sub-bands. Then each frequency band can be separately encoded either by
waveform coding or analysis - synthesis methods.
The most familiar methods of waveform coding are log PCM and ADPCM. In PSTN systems log PCM is used for
long distance communication at a rate of 64kbps. μ-law and A-law are the companding techniques used in log PCM
technique. ADPCM operates at 32kbps and achieves performance comparable with log PCM [4]. Analysis-synthesis speech
coding is a lossy coding technique, which means the reproduced signal does not exactly sound like original. The most
common lossy coding technique is LPC. For each coefficient prediction, an error signal is calculated, called MSE and is
minimized. Transform or sub-band coding is applied to speech signals because of operational performance achieved at
lower bit rates [7]. In general, the classic tradeoff in speech compression is rate versus distortion. The higher the bit rate,
the smaller the distortion in the reproduced signal. But using transform coding lower bit rates and good performance can be
achieved [3], because more bits can be allocated to obtain perceptually important coefficients.
The speech signals that need to be coded are wide-band signals with frequencies in the range 300Hz to 8 KHz.
100
B. V. Vijayasri & I. Santiprabha
Therefore, the sampling frequency should be 16 KHz [4]. Speech coding is done in two paradigms: Analysis-synthesis
using LPC and Transform coding using different discrete wavelet transform like 'haar', 'daubchies', 'dmey'. The comparison
is made in terms of minimum Mean Square Error (MSE), Peak Signal-to-Noise Ratio (PSNR) and compression ratio.
SPEECH CODING USING LPC
The continuous input speech signal is discretized by sampling at a rate of 16KHz, as the voice signal may range
upto 8KHz[2]. Considering in view that the properties of short duration speech frames remains constant[5], the sampled
speech signal is segmented to frames of size 20msec. To, reduce the effects of radiation at lips high frequency coefficients
to be preserved. Hence, each frame is high pass filtered. The LPC analysis can be applied to voiced speech signal. But, the
speech may consist of voiced and unvoiced frames. Hence the voiced frames can be identified by Levinson Durbin
Algorithm [6]. The voiced or unvoiced classification can be done through estimation of pitch period, zero crossing rate,
and energy of the speech.
The energy of any mth speech frame of size N can be obtained by the relation,
E(m) =
2
m
n = m - N +1
y ( n)
(1)
where y(n) is the speech frame. The threshold for energy may be calculated using the mean logarithm value of the
energy of the frame. The frame may be classified as voiced if the energy is greater than the threshold [5].
The frame may be classified as voiced/unvoiced depending on the zero crossing rate of the speech frame [5]. The
zero crossing rate of any mth speech frame can be defined as,
m
zc(m) =
n = m- N +1
sgn( y(n))sgn( y(n1))
(2)
where N is the number of samples in the frame. The threshold is obtained and if zero crossing rate is less than the
threshold then the frame is voiced. Similarly, the classification may be done through autocorrelation also. The
autocorrelation of any speech frame is defined as,
R(k) =
N-1-k
m 0
y(m) y(m k)
(3)
The autocorrelation function is not only used for voice/unvoiced classification [5] but, is also used for pitch period
and LPC coefficients estimation. Therefore, for each frame M+1coefficients can be estimated where M is the order of LPC.
The LPC coefficients can be estimated from the following equations:
=
=
for 1 ≤ i ≤ M,
=
=
= (1 -
for 1 ≤ j ≤ i-1
(4)
101
Implementation of a Novel Transformation Technique to Improve Speech Compression Ratio
The above equations are solved recursively for i i= 1,2,3……..M and the final solution is
=
for 1 ≤ j ≤ M
can be obtained.
Then the prediction error can be obtained and is represented by,
e (n) = y(n) -
M
ai y(n i)
i 1
(5)
SPEECH CODING USING DWT
Wavelets are short duration waveforms with zero mean value. Wavelet transform can represent non-stationary
signals like speech more effectively [2], since it retains both time and frequency related information of the speech signal.
The choice of mother wavelet [6] can be done by choosing the energy of the wavelet basis function that can concentrate on
the level 1 approximation coefficient. The speech signal is then decomposed into set of scaled and translated versions of
mother wavelet.
In DWT, different set of filters with different cut off frequencies are used to analyze the signal at different scales.
The signal is passed through a series of high pass filters and low pass filters to analyze high frequency components and low
frequency components separately. The filtered results are down sampled by 2. For many signals most of the energy is
concentrated on low frequencies, hence the low pass filtered signal is again passed through series of high pass and low pass
filters and the process repeats till we get the desired decomposition. This process of decomposing the signal is called subband coding. At each level of decomposition the number of samples reduces by 2. Hence the original speech signal is now
compressed to
2n :1 where n is the level of decomposition.
Figure 1: Wavelet Decomposition and Reconstruction Structure
The detailed coefficients are treated to be zero at the first level of synthesis process. Both the coarser and detail
coefficients are up sampled and then filtered through low pass and high pass filters. The transfer function of these filters
should be the inverse transfer function of the corresponding low pass and high pass filters used in the analysis. Hence the
synthetic speech consists of 16000 samples with coefficients determined from interpolation and summation process.
SUB-BAND SPEECH CODING USING LPC AND DWT
The sub-band coding technique using LPC and DWT is the proposed technique to improve the quality of speech.
The proposed technique is implemented on the input speech signals. First, 1-dimensional DWT of level – 1 is applied and
102
B. V. Vijayasri & I. Santiprabha
the coarser coefficients of DWT are compressed using LPC and transmitted. The coarser coefficients consist of most of the
voiced speech coefficients. Hence Levinson Durbin algorithm based LPC is applied to coarser coefficients. The process
can be represented through a block diagram in figure 2.
Figure 2: Represents Analysis of Speech Using DWT and LPC
At the decoder end the coarser coefficients are predicted from the received coefficients by using LPC. Now by
using the synthesis process of DWT (level – 1) the speech signal can be reconstructed.
The inverse process of analysis gives the speech synthesis i.e., at the decoder the received signal is decoded using
LPC the coarser coefficients are predicted. These predicted coefficients are interpolated by a factor of 2 and then allowed
to pass through the filter whose transfer function is inverse complex conjugate of H 0(z). Hence the synthesized speech
signal may be reconstructed from level – 1 DWT.
RESULTS
The performance measure of the above discussed algorithms is done by the parameters, MSE and PSNR. For LPC
analysis, the mean square error and minimization of MSE is given in eq. (5). The input speech signal is analyzed by using
Levinson Durbin algorithm for M=10. The synthetic speech reproduced from the estimated coefficients is intelligible. If the
order increases, the unvoiced signals become more strengthen and causing the flavor of nuance in the synthetic speech.
However, the synthetic speech reproduced using LPC of order 10 is audible and intelligible to the human ear. The PSNR
can be obtained by the relation,
2
N X
PSNR= 10 log
2
x r
(6)
Where X is the maximum absolute value of speech signal, and
x r
2 is the energy difference between the
original speech and the reconstructed speech.
The same input signal is analyzed by using 'haar', ‘dmey’ and 'daubchies' wavelet transforms. These wavelets are
chosen by considering energy in the level – 1 coarser coefficient. The signal is analyzed for third level decomposition. The
synthetic speech is reproduced by applying IDWT of the same wavelets in the synthesis process. The resulting signal is
intelligible and audible. The performance measures for wavelet analysis is considered as,
MSE = ∑(x -r)2/N
PSNR = 10 log 10
(7)
f max
MSE
2
where fmax is the maximum amplitude of original and reconstructed speech
(8)
Implementation of a Novel Transformation Technique to Improve Speech Compression Ratio
103
The LPC and wavelet transform techniques are applied to a set of input speech signals. Of the different wavelets
for each input speech the energy of the wavelet is calculated and found ‘dmey’ is having higher energy for all the speech
signals in the chosen set. The following tabular forms show the MSE, PSNR and compression ratio of the different inputs
for LPC and Wavelet techniques.
Table 1: LPC Analysis
Speech Signal
s1
s2
s3
s4
s5
MSE
0.00061
0.00051
0.00029
0.00074
0.00042
PSNR
66.8693
69.9084
74.0175
67.2717
75.9210
Compression Ratio
1.3259
1.3260
1.3306
1.3332
1.3273
s1, s2, s3, s4, s5 are the input file names of the speech signals chosen for analyzing. S1 represents “its easy to tell the doubts
above all”, s2 represents “glue the sheet to dark blue background”, s3 represents “this was easy for me”, s4 represents “do the need for
crazy”, s5 represents “this was and emphasized need for an action”
Table 2: Wavelet Analysis Choice of Wavelet: dmey
Speech Signal
s1
s2
s3
S4
s5
MSE
0.031
0.089
0.021
0.070
0.0322
PSNR
27.1903
24.8523
26.9811
24.88
28.85
Compression Ratio
7.8739
7.8782
7.7953
7.7136
7.8820
The sub-coding technique, level -1 DWT and LPC are applied for analysis and synthesis of set of speech signals.
The compression ratio for intelligible speech is found to be 1.67:1. Therefore, from all the above techniques, the Wavelet
analysis technique is found to be better in all aspects of synthesis. The following tabular form shows the performance
analysis of the above technique.
Table 3: Sub-Band Analysis Using LPC and DWT Choice of Wavelet: dmey
Speech Signal
s1
s2
s3
S4
s5
MSE
0.0051
0.0022
0.0036
0.0030
0.0009
PSNR
46.3432
63.8467
53.0806
59.2554
67.3204
Compression Ratio
1.6681
1.6636
1.6757
1.6726
1.6617
CONCLUSIONS
The average MSE for LPC analysis is 0.0005. But the average compression ratio is only 1.33. The prosody
parameters like voice quality, pitch are estimated in LPC synthesis, in addition to environmental parameter, SNR. In LPC
these parameters are found be good enough in reproducing the intelligible speech.
The average MSE for wavelet analysis is 0.05 and the average compression ratio is 7.82. In wavelet
transformation technique only the coarser coefficients undergo the process of decomposition at each level. The high
frequency coefficients are not considered in the process of reconstruction of the synthesized speech. This increases the
MSE, (when compared to LPC) and hence reduces the quality of speech. But a better compression ratio can be achieved
with considerable quality of speech.
The average MSE is comparatively less for sub-band analysis using LPC and wavelet, i.e., 0.003, and the
compression ratio is 1.67. Even though the compression ratio is more for wavelet analysis the performance is less
104
B. V. Vijayasri & I. Santiprabha
according to MSE and PSNR. The compression ratio is improved in sub-band analysis using LPC and wavelet when
compared to LPC. This can be achieved because of two types of coding used in this proposed technique. This technique
can also be used by inverting the process of applying the LPC and DWT, i.e., the speech signal may first undergo LPC
analysis and then DWT can be applied. In such process the compression ratio may be improved better with a loss of quality
of speech.
In these techniques a tradeoff must be made between quality of speech and compression ratio. Finally, the
proposed technique is better when compared to the conventional techniques. The comparative analysis of these techniques
is represented graphically in the figure 3, figure 4, figure 5.
Figure 3: Graphical Comparison for the 3 Techniques in Terms of MSE
Figure 4: Graphical Comparison for the 3 Techniques in Terms of PSNR
Figure 5: Graphical Comparison for the 3 Techniques in Terms of Compression Ratio
Implementation of a Novel Transformation Technique to Improve Speech Compression Ratio
105
REFERENCES
1.
Amol R.Madane, Zalak shah, Raina shah, Sanket takur, “Speech compression Using Linear Predictive Coding”,
Proceedings of International workshop on Machine Intelligence Research – 2009 nagpur.
2.
Ankith Patel, Mark Tonkelowitz, “ Lossless sound compression using the Discrete Wavelet Transform.” Jan 14,
2002.
3.
Deepen Sinha, Ahmed H. Tewfik, “Low bit rate transparent Audio Compression using Adapted wavelets”, IEEE
transactions of signal processing, vol.41, No.12, Dec 1993.
4.
KT talale, S.T. Gandhe, “ Speech Compression using ADPCM”
5.
L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signal“.
6.
Mahmoud.A.Osman, Nasser Al, Hussein M. Magboub and S.A.Alfandi, “Speech Compression using LPC and
Wavelet”, 2nd International Conference on Computer Engineering and Technology, Al-fateh University, TripoliLibya, 2010 IEEE.
7.
P.S.Sathidevi, Y.Venkataramani, "Speech and Audio Coding using Wavelet Transforms", Proceedings of the
National Seminar on Applied Systems Engineering and Soft computing", (SASESC 2000 ), Dayalbagh
Educational Institute, Agra, 2000, pp.242-246.
8.
Palaniandavar Venkateswaran, Arindam Sanyal, Snehasish Das, Rabindranath Nandi, Salil Kumar Sanyal,” An
Efficient Time Domain Speech Compression Algorithm Based on LPC and Sub-Band Coding Techniques”,
Journal of Communications, Vol 4, No 6 (2009), 423-428, Jul 2009.
9.
Simon D. Boland, Mohammed Deriche, “Hybrid LPC and Discrete Wavelet Transform Audio Coding with a novel
Bit allocation algorithm”.