CN112270934B - Voice data processing method of NVOC low-speed narrow-band vocoder - Google Patents
Voice data processing method of NVOC low-speed narrow-band vocoder Download PDFInfo
- Publication number
- CN112270934B CN112270934B CN202011049193.1A CN202011049193A CN112270934B CN 112270934 B CN112270934 B CN 112270934B CN 202011049193 A CN202011049193 A CN 202011049193A CN 112270934 B CN112270934 B CN 112270934B
- Authority
- CN
- China
- Prior art keywords
- voice
- signal
- voice data
- value
- parameters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 13
- 238000013139 quantization Methods 0.000 claims abstract description 22
- 238000001228 spectrum Methods 0.000 claims abstract description 21
- 239000013598 vector Substances 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 12
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 10
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 10
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 7
- 238000004458 analytical method Methods 0.000 claims abstract description 6
- 238000011084 recovery Methods 0.000 claims abstract description 4
- 238000001914 filtration Methods 0.000 claims description 25
- 238000000034 method Methods 0.000 claims description 22
- 238000001514 detection method Methods 0.000 claims description 10
- 230000005284 excitation Effects 0.000 claims description 9
- 238000005311 autocorrelation function Methods 0.000 claims description 7
- 230000003595 spectral effect Effects 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000021615 conjugation Effects 0.000 claims description 3
- 238000005314 correlation function Methods 0.000 claims description 3
- 230000003111 delayed effect Effects 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 3
- 210000005069 ears Anatomy 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract description 4
- 230000006835 compression Effects 0.000 abstract description 3
- 238000007906 compression Methods 0.000 abstract description 3
- 108090000623 proteins and genes Proteins 0.000 description 5
- 238000005070 sampling Methods 0.000 description 5
- 230000001755 vocal effect Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 210000000056 organ Anatomy 0.000 description 3
- 230000000737 periodic effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 2
- 210000004704 glottis Anatomy 0.000 description 2
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000001208 nuclear magnetic resonance pulse sequence Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/0212—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/26—Pre-filtering or post-filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to a voice data processing method of an NVOC (voice over coax) low-speed narrow-band vocoder, which comprises the following steps of: step 1, the encoding end performs initialization configuration and analysis processing on an original voice digital signal; step 2, continuously extracting and quantizing parameters such as line spectrum pairs, pitch values, gain parameters, residual compensation gains, codebook vectors and the like on the basis of the numerical parameters of the pitch period, the unvoiced sound and the voiced sound obtained by calculation in the step 1; and 3, extracting the voice quantization parameters in the step 2, synthesizing voice through the voice quantization parameters, increasing the voice quality through noise compression, and performing voice reconstruction after the parameter recovery fails or after the voice synthesis fails. The invention can provide excellent voice quality under the condition of low speed.
Description
Technical Field
The invention belongs to the technical field of digital voice compression of a vocoder, and particularly relates to a voice data processing method of an NVOC (voice and video over cellular) low-speed narrow-band vocoder.
Background
With the rapid development of communication technology, frequency and resources are precious, and compared with an analog voice communication system, a digital voice communication system has the characteristics of strong anti-interference performance, high confidentiality, easiness in integration and the like, and a low-speed vocoder plays an important role in the digital voice communication system.
At present, most of speech coding algorithms are established on the basis of acoustic models of human vocal organs. The human vocal organs consist of the glottis, vocal tract and other auxiliary organs. The actual speech generation process is that the vibration generated by glottis is modulated by sound channel filter and radiated via mouth and nose, etc., and can be expressed as the following formula
s(n)=h(n)*e(n)
Wherein s (n) represents a voice signal, h (n) is a unit impulse response of a sound channel filter, and e (n) is a glottal vibration signal.
In order to clearly represent a speech signal, the glottal and the vocal tract can be respectively described in terms of frequency spectrum characteristics, and how to efficiently quantize the characteristic parameters of the glottal and the vocal tract is an objective to be achieved by the algorithm of parameter coding.
Vocoders belong to the class of parametric coding, and low-speed narrow vocoders are methods that compress a digital representation of a speech signal and recover the most similar speech to the original speech signal with fewer bits (bits). With the explosive increase in the efficiency of digital signal processing hardware, vocoders have been heavily used in addition to accelerated research into vocoders.
The existing low-speed narrowband vocoder comprises two code rates of 2.4kbps and 2.2kbps (used for encryption), the channel FEC code rate is 1.2kbps, and the voice codec and FEC perform encoding and decoding by taking 8K samples and 20 milliseconds as a frame.
The following problems still remain: (1) The gene parameters are extracted by utilizing the time domain correlation, so that the calculation is easy to be wrong; (2) Because the sound is not subjected to noise reduction, the extracted sound parameters are inaccurate when noise exists; (3) dialect sound distortion; (4) Because the compression ratio of the narrow-band low-speed coding is higher, the voice quality is low when the channel quality is poor and an error exists.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a voice data processing method of an NVOC low-speed narrowband vocoder, which has the advantages of reasonable design, high voice quality and strong adaptability to the aspect.
The invention solves the practical problem by adopting the following technical scheme:
a speech data processing method of NVOC low-speed narrow-band vocoder comprises the following steps:
step 1, an encoding end carries out initialization configuration and analysis processing on an original voice digital signal, firstly, denoising processing is carried out on the original voice digital signal, then whether the current voice signal is voice or not is judged, if the current voice signal is voice, fundamental tone in the voice is extracted, and then fundamental tone period and unvoiced and voiced numerical parameters of each sub-band are calculated;
step 2, extracting and quantizing parameters of a line spectrum pair, a base pitch value, a gain parameter, a residual compensation gain and a codebook vector on the basis of the numerical parameters of the pitch period, the unvoiced sound and the voiced sound calculated in the step 1 to obtain a sound quantization parameter;
and 3, after extracting the voice quantization parameters in the step 2, synthesizing the voice quantization parameters into voice, increasing the voice quality through noise pressing, and performing voice reconstruction after the parameter recovery fails or the voice synthesis fails.
The step 1 specifically includes:
(1) Denoising the original voice digital signal S (n) to obtain denoised voice data S 1 (n) and sound spectrum characteristics of 0 to 4000Hz of the original data S (n);
(2) Judging whether the current voice signal after the de-noising processing is voice by adopting VAD activation detection technology to obtain voice data S 2 (n);
(3) Extracting voice data S 2 A fundamental tone of (n);
(4) And calculating the parameters of the pitch period and the unvoiced and voiced values of each sub-band.
Moreover, the specific steps in step (1) of step 1 include:
(1) a high-pass filter is adopted to remove direct current components from voice data, improve high-frequency components and attenuate low frequency;
(2) windowing signals, adopting a Hamming window with the window length N, and obtaining energy distribution on a frequency spectrum through overlapped Fourier transform to obtain voice data S after noise removal 1 (n) and the sound spectral characteristics of 0 to 4000Hz of the original speech digital signal S (n).
Moreover, the specific method in the step 1 and the step (2) is as follows:
according to the auditory characteristics of human ears, the voice data S after noise removal 1 (n) filtering the sub-band and calculating the level of the sub-band signal, estimating the signal-to-noise ratio according to the following formula, and comparing the signal-to-noise ratio with a preset threshold value to further judge whether the current speech signal is voice:
wherein, a is the signal level value of the current frame, and b is the current signal level value estimated from the previous frames;
moreover, the specific method in the step 1 and the step (3) is as follows:
for speech data S using a low-pass filter with a cut-off frequency of BHz 2 (n) low-pass filtering is carried out, after the voice data after low-pass filtering is carried out by adopting a second-order inverse filter, the self-phase function of the output signal of the second-order inverse filtering is calculated according to the following formula, and fundamental tones are extracted:
wherein N is the window length of the mentioned window function in the step (1) in the step 1, S w (i) And (4) outputting a signal for the second-order inverse filtering in the step (3) of the step (1).
Moreover, the specific steps in step (1) and (4) include:
(1) dividing the frequency domain into 5 frequency bands at equal intervals of 0-4000, wherein the frequency bands are respectively [0-500] Hz, [500-1000] Hz, [1000-2000] Hz, [2000-3000] Hz, [3000-4000] Hz, and calculating the autocorrelation function of the bandpass signals in each interval by using the following formula:
where "t" is a continuous time argument, "τ" is an input signal delay ". Cndot. * f * () To take conjugation;
(2) taking the average value of the product of two values of the same time function at the moment t and t + a as the function of time t, wherein the average value is the measure of the similarity between the signal and the delayed signal, when the delay time is zero, the average value becomes the mean square value of the signal, the value of the mean square value is the maximum at the moment, and the maximum value of the function is taken as the voiced intensity to calculate the unvoiced and voiced numerical value of each sub-band;
further, the specific steps of step 2 include:
(1) Denoising by adopting high-pass filter with cut-off frequency AHzFiltering the speech data to obtain S 3 (n), windowing, calculating autocorrelation coefficients, solving line spectrum pair parameters by using a Levinson-Durbin recursive algorithm, and performing parameter quantization on the obtained line spectrum pair parameters by using a three-level vector quantization scheme;
(2) Quantizing the pitch value calculated in the step (3) in the step 1: linearly mapping integer intervals containing pitch values to [ 0-z ]]In the above, the number of z is m 1 Bit representation;
(3) Voice data S detected by voice in step 1 (2) 2 (n) obtaining a prediction error signal r (n) without the influence of the formants through a second-order inverse filter, wherein the coefficient of the second-order inverse filter is a 1 、a 2 1, the gain parameter is expressed by RMS of r (n), and the quantization is completed in a logarithmic domain;
(4) Quantizing the maximum value obtained by the correlation function of the band-pass signal value after the frequency domain segmentation of the step 1 and the step 4 into m 2 A bit;
(5) Computing residual compensation gain, computing linear prediction coefficient by using quantized LSF parameter to form prediction error filter for input speech S 2 (n) filtering to obtain a residual signal, wherein the length of the residual signal is 160 points;
(6) Using a Hamming window with the window length of 160 points to window the prediction residual error, supplementing 0 to 512 points to a windowed signal, performing 512-point complex FFT on the windowed signal, and finding out a Fourier transform value corresponding to the first x-order harmonic by using a spectrum peak point detection algorithm;
(7) Let P be the quantized fundamental, given that the initial position of the ith harmonic is 512i/P, peak point detection finds the maximum peak value within 512/P frequency samples centered around the initial position of each subharmonic, the width being truncated to an integer; the harmonic number of the search is limited to the smaller of x and P/4; the coefficients corresponding to the harmonics are then normalized, using an m for this x-dimensional vector 3 ∈[0,48]Quantizing the vector codebook of bits to obtain m 3 ∈[0,48]A bit.
Moreover, the specific method for synthesizing the voice quantization parameter into the voice in the step 3 is as follows:
by dividing into several frequency bands to form excited rear phasesAdding the synthesized speech through a synthesis filter to obtain synthesized speech, and post-filtering the synthesized speech to obtain decoded synthesized speech data, wherein the synthesis filter H (z) and the post-filter H pf The z-transform transfer function of (z) is as follows:
H(z)=1/A(z)
wherein A (z) is 1-az -1 A is the filter coefficient, z in all the above equations is a complex variable having real and imaginary parts, let z = e jw γ =0.56, β =0.75, μ being determined by the reflection coefficient, the value of μ being dependent on
Furthermore, the method also comprises the following steps before the step 1:
and initializing and configuring the encoding end, wherein the initializing and configuring comprises rate selection, parameters and coefficients used by the encoding end and the initializing and configuring of the algorithm of the encoding end of the filter.
Moreover, the method further comprises the following steps before the step 3:
and initializing and configuring a decoding end, wherein the initialization and configuration comprises rate selection, parameters of an algorithm of the decoding end and filter coefficients.
The invention has the advantages and beneficial effects that:
1. the invention provides excellent voice quality under low speed condition, provides good voice quality in application losing voice frequency under 300Hz and has strong adaptability to speech by analyzing the continuity of voice in time domain and the correlation of voice in frequency domain.
2. The invention extracts the actual parameters in two stages, more accurately extracts the parameters, improves the sound quality and saves the calculation resources for users.
3. The invention has the function of sound reconstruction during error code, and the function is to calculate the current parameter based on the past parameter, thereby improving the sound quality during error code.
4. The invention inhibits the noise through the noise inhibiting function, improves the accuracy of the extracted sound parameters when the noise exists, and ensures the sound quality.
5. The invention adopts the codebook based on various local conversation training, and has strong adaptability to dialects.
6. The invention is developed based on standard codes, is standard and sustainable, and is easy to be transplanted to various hardware platforms.
Drawings
Fig. 1 is a working principle diagram of the present invention.
Detailed Description
The embodiments of the invention will be described in further detail below with reference to the accompanying drawings:
the input parameter of the voice data processing method of the NVOC low-speed narrowband vocoder of the invention is a linear PCM voice digital signal with the sampling rate of 8000Hz (the number of voice signal samples collected per second) and the resolution of 16 bits; in the time domain, every 20 milliseconds of analysis, and in the frequency domain, a plurality of frequency bands are divided into 0-4000 for analysis.
A voice data processing method of NVOC low-speed narrowband vocoder, as shown in fig. 1, comprising the following steps:
step 1, initializing and configuring a coding end, wherein the initialization and configuration comprises rate selection, parameters and coefficients used by the coding end and a filter coding end algorithm;
step 2, the encoding end performs initialization configuration and analysis processing on the original voice digital signal: firstly, denoising an original voice digital signal, then judging whether the current voice signal is voice, if so, extracting fundamental tone in the voice and then calculating the fundamental tone period and the unvoiced and voiced numerical parameters of each sub-band;
the step 2 specifically comprises the following steps:
(1) Noise suppression: denoising the original voice digital signal S (n) to obtain voice data S with noise suppressed 1 (n) sound spectral characteristics of 0 to 4000Hz of the original data S (n);
the step 2, the step (1), comprises the following specific steps:
(1) a high-pass filter is adopted to remove direct-current components from voice data, improve high-frequency components and attenuate low frequency;
(2) windowing signal, adopting Hamming window with window length N, and performing overlapped Fourier transform to obtain energy distribution on frequency spectrum to obtain denoised voice data S 1 (n) and the sound spectrum characteristics of 0 to 4000Hz of the original speech digital signal S (n).
(2) Voice detection: judging whether the current voice signal after the de-noising processing is voice by adopting VAD activation detection technology to obtain voice data S 2 (n);
The specific method of the step (2) in the step 2 comprises the following steps:
according to the auditory characteristics of human ears, the voice data S after noise removal 1 (n) filtering the sub-band and calculating the level of the sub-band signal, estimating the signal-to-noise ratio according to the following formula, and comparing the signal-to-noise ratio with a preset threshold value to further judge whether the current speech signal is voice:
wherein, a is the signal level value of the current frame, and b is the current signal level value estimated from the previous frames;
(3) Gene estimation first stage: extracting voice data S 2 A pitch of (n);
the specific method of the step (2) and the step (3) comprises the following steps:
for speech data S using a low-pass filter with a cut-off frequency of BHz 2 (n) low-pass filtering, and after the low-pass filtered voice data is inversely filtered by adopting a second-order inverse filter, calculating a self-phase function of an output signal of the second-order inverse filter according to the following formula, and extracting fundamental tone:
wherein N is the step1 said (1) mentioned Window function Window Length, S w (i) And outputting a signal for the second-order inverse filtering in the step (3) of the step 2.
In this embodiment, in the frequency domain, the speech signal has a peak value and the frequency of the peak value is a multiple relation of fundamental tones, and a possible fundamental tone value or a fundamental tone range value is preliminarily calculated; in the time domain, speech has short-term autocorrelation, and if the original signal has periodicity, its autocorrelation function also has periodicity, and the periodicity is the same as that of the original signal. And peaks occur at integer multiples of the period. The unvoiced sound signal is non-periodic, its autocorrelation function is attenuated with the increase of frame length, the voiced sound is periodic, its autocorrelation function has peak value on the integral multiple of gene period, and the low-pass filter whose cut-off frequency is B Hz is used to make voice data S 2 (n) low-pass filtering is carried out to remove the influence of the high-frequency signal on the fundamental tone extraction, then a second-order inverse filter is adopted to carry out inverse filtering on the voice data after the low-pass filtering to remove the influence of formants, the self-phase function of the output signal of the second-order inverse filtering is calculated, and the fundamental tone is extracted:
in the autocorrelation function of the frame, the pitch value of the frame excluding the first maximum value is the sampling rate/frame length at which the maximum value appears.
(4) A first stage of multi-subband voiced and unvoiced decision: calculating the value of unvoiced and voiced sounds of each sub-band
The step 2, the step (4) comprises the following specific steps:
(1) dividing the frequency domain into 5 frequency bands with equal intervals of 0-4000, wherein the frequency bands are respectively [0-500] Hz, [500-1000] Hz, [1000-2000] Hz, [2000-3000] Hz, [3000-4000] Hz, and calculating the autocorrelation function of the bandpass signal in each interval by using the following formula:
wherein ". Star" isConvolution operator, (.) * f * () To obtain conjugation;
(2) taking the average value of the product of two values of the same time function at the moment t and t + a as the function of delay time t, wherein the average value is the measure of the similarity between the signal and the delayed signal, when the delay time is zero, the average value becomes the mean square value of the signal, the value of the mean square value is the maximum at the moment, and the maximum value of the function is taken as the voiced intensity to calculate the unvoiced and voiced numerical value of each sub-band;
step 3, extracting and quantizing parameters of the line spectrum pair, the pitch value, the gain parameter, the residual compensation gain and the codebook vector on the basis of the numerical parameters of the pitch period, the unvoiced sound and the voiced sound obtained by calculation in the step 2 to obtain a sound quantization parameter;
the specific steps of the step 3 comprise:
(1) Filtering the denoised voice data by adopting a high-pass filter with the cut-off frequency of A Hz to obtain S 3 (N), adding a Hamming window with the window length of N2, calculating an autocorrelation coefficient, solving line spectrum pair parameters by using a Levinson-Durbin recursive algorithm, and performing parameter quantization on the obtained line spectrum pair parameters by using a three-level vector quantization scheme to obtain m 1 A bit;
(2) Quantizing the pitch value calculated in the step (3) in the step 2: linearly mapping integer intervals containing pitch values to [ 0-z ]]In the above, the number of z is m 2 Bit representation;
(3) Voice data S detected by voice in step 2 (2) 2 (n) obtaining a prediction error signal r (n) without the influence of the formants through a second order inverse filter, wherein the coefficient of the second order inverse filter is a 1 、a 2 1, the excitation gain parameter is expressed by RMS (mean square root) of r (n), and the quantization is completed in a logarithmic domain;
(4) Quantizing the maximum value (namely the unvoiced and voiced state value) obtained by the correlation function of the band-pass signal value after the frequency domain segmentation of the step (4) in the step 2 into m 3 A bit;
(5) Calculating spectral compensation gain, and using quantized linear prediction coefficients to form a prediction error filter for the input speech S 2 (n) filtering to obtain residual signalA length of 160 points;
(6) Windowing the prediction residual error by using a Hamming window with the window length of 160 points, supplementing 0 to 512 points to a windowed signal, performing 512-point complex FFT on the windowed signal, and finding a Fourier transform value corresponding to the first x-order harmonic by using a spectrum peak point detection algorithm;
(7) Let P be the quantized pitch, given an initial position of the ith harmonic of 512i/P, peak detection looks for the largest peak within 512/P frequency samples centered around the initial position of each subharmonic, this width being truncated to an integer. The number of harmonics to be searched is limited to the smaller of x and P/4. The coefficients corresponding to these harmonics are then normalized, using an m for this x-dimensional vector 4 ∈[0,48]Quantizing the vector codebook of bits to obtain m 4 ∈[0,48]A bit.
Step 4, initializing and configuring a decoding end, wherein the initialization and configuration comprises the speed selection (2.2 kbps or 2.4 kbps), and parameters of an algorithm of the decoding end, filter coefficients and the like;
and 5, after the voice quantization parameters in the step 3 are extracted, synthesizing the voice quantization parameters into voice, increasing the voice quality through noise pressing, and performing voice reconstruction after the parameter recovery fails or the voice synthesis fails.
The specific method of the step 5 comprises the following steps:
the result after each frame signal coding is a numerical value formed by equating a line spectrum pair, gain, a gene period, voiced and voiced sounds and a vector codebook into bits. Among these parameters, the pitch period and the unvoiced/voiced value determine the excitation source for synthesizing the speech signal at the decoding end, and according to the step 1 (4) at the encoding end, the unvoiced/voiced signal covers 5 bands, so that the decoded synthesized speech data is obtained by dividing the band into several bands to form excitation, adding the excitation, and passing through the synthesis filter and the post-filtering. If the frame is an unvoiced frame, namely the unvoiced and voiced values bit are all 0, the random number is used as an excitation source, if the frame is a voiced frame, a periodic pulse sequence is selected to generate the excitation source through an all-pass filter, the amplitude of the excitation source is weighted by a gain parameter, and the length of a sampling point depends on the size of a gene period. All-pass filter H 1 (z), synthesis filteringH device 2 (z) and a postfilter H pf The z-transform transfer function of (z) is as follows:
wherein A (z) is 1-az -1 A is filter coefficient, obtained by P transformation of line spectrum pair parameter in step 3 of encoding end, P transformation is high mathematical transformation, z in all the above formulas is complex variable with real part and imaginary part, and z = e jw γ =0.56, β =0.75, μ being determined by the reflection coefficient, the value of μ being dependent on
It can be understood that the algorithm of the encoding and decoding is corresponding, the input parameter format of the decoding end and the output parameter format of the encoding end are also corresponding, the decoder decodes one frame and outputs 160 sampling values, and the sampling values need to be unified with the encoder speed when being called.
It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the present invention includes, but is not limited to, those examples described in this detailed description, as well as other embodiments that can be derived from the teachings of the present invention by those skilled in the art and that are within the scope of the present invention.
Claims (7)
1. A voice data processing method of NVOC low-speed narrow-band vocoder is characterized in that: the method comprises the following steps:
step 1, an encoding end carries out initialization configuration and analysis processing on an original voice digital signal, firstly, denoising processing is carried out on the original voice digital signal, then whether the current voice signal is voice or not is judged, if the current voice signal is voice, fundamental tone in the voice is extracted, and then fundamental tone period and unvoiced and voiced numerical parameters of each sub-band are calculated;
step 2, extracting and quantizing parameters of a line spectrum pair, a base pitch value, a gain parameter, a residual compensation gain and a codebook vector on the basis of the numerical parameters of the pitch period, the unvoiced sound and the voiced sound calculated in the step 1 to obtain a sound quantization parameter;
step 3, after the voice quantization parameters in the step 2 are extracted, synthesizing the voice quantization parameters into voice, increasing the voice quality through noise pressing, and performing voice reconstruction after the parameter recovery fails or the voice synthesis fails;
the step 1 comprises the following specific steps:
(1) Denoising the original voice digital signal S (n) to obtain denoised voice data S 1 (n) sound spectral characteristics of 0 to 4000Hz of the original data S (n);
(2) Judging whether the current voice signal after the de-noising processing is voice by adopting VAD activation detection technology to obtain voice data S 2 (n);
(3) Extracting voice data S 2 A pitch of (n);
(4) Calculating the pitch period and the unvoiced and voiced numerical parameters of each sub-band;
the specific steps of the step 2 comprise:
(1) Filtering the denoised voice data by adopting a high-pass filter with the cut-off frequency of A Hz to obtain S 3 (n), windowing, calculating autocorrelation coefficients, solving line spectrum pair parameters by using a Levinson-Durbin recursive algorithm, and performing parameter quantization on the obtained line spectrum pair parameters by using a three-level vector quantization scheme;
(2) Quantizing the pitch value calculated in the step (3) in the step 1: linearly mapping integer intervals containing pitch values to [ 0-z ]]In the above, the number of z is m 1 Bit representation;
(3) Voice data S detected by voice in step 1 (2) 2 (n) passing through a second order inverse filter to obtain a prediction error signal r (n) without the influence of formants, wherein twoCoefficient of order inverse filter is a 1 、a 2 1, the gain parameter is expressed by RMS of r (n), and the quantization is completed in a logarithmic domain;
(4) Quantizing the maximum value obtained by the correlation function of the band-pass signal value after the frequency domain segmentation of the step 1 and the step 4 into m 2 A bit;
(5) Computing residual compensation gain, computing linear prediction coefficient by using quantized LSF parameter to form prediction error filter for input speech S 2 (n) filtering to obtain a residual signal, wherein the length of the residual signal is 160 points;
(6) Windowing the prediction residual error by using a Hamming window with the window length of 160 points, supplementing 0 to 512 points to a windowed signal, performing 512-point complex FFT on the windowed signal, and finding a Fourier transform value corresponding to the first x-order harmonic by using a spectrum peak point detection algorithm;
(7) Setting P as quantized fundamental tone, setting the initial position of the ith harmonic as 512i/P, and searching the maximum peak value with the width within 512/P frequency samples and the initial position of each subharmonic as the center by peak point detection, wherein the width is truncated into an integer; the harmonic number of the search is limited to the smaller of x and P/4; the coefficients corresponding to the harmonics are then normalized, using an m for this x-dimensional vector 3 ∈[0,48]Quantizing the vector codebook of bits to obtain m 3 ∈[0,48]A bit;
the specific method for synthesizing the voice quantization parameter into the voice in the step 3 comprises the following steps:
by dividing the speech into several frequency bands to form excitation respectively, adding the excitation to pass through a synthesis filter to obtain synthesized speech, and post-filtering the synthesized speech to obtain decoded synthesized speech data, wherein the synthesis filter H (z) and the post-filter H pf The z-transform transfer function of (z) is as follows:
H(z)=1/A(z)
2. The method as claimed in claim 1, wherein the voice data processing method of the NVOC low-speed narrowband vocoder comprises: the step 1, the step (1), comprises the following specific steps:
(1) a high-pass filter is adopted to remove direct current components from voice data, improve high-frequency components and attenuate low frequency;
(2) windowing signal, adopting Hamming window with window length N, and performing overlapped Fourier transform to obtain energy distribution on frequency spectrum to obtain denoised voice data S 1 (n) and the sound spectral characteristics of 0 to 4000Hz of the original speech digital signal S (n).
3. The method of claim 1, wherein the voice data processing of the NVOC low-speed narrowband vocoder is as follows: the specific method of the step (1) and the step (2) comprises the following steps:
according to the auditory characteristics of human ears, the voice data S after noise removal 1 (n) filtering the sub-band and calculating the level of the sub-band signal, estimating the signal-to-noise ratio according to the following formula, and comparing the signal-to-noise ratio with a preset threshold value to further judge whether the current speech signal is voice:
where a is the signal level value of the current frame and b is the current signal level value estimated from the previous frames.
4. The method of claim 1, wherein the voice data processing of the NVOC low-speed narrowband vocoder is as follows: the specific method of the step 1 and the step (3) comprises the following steps:
for speech using a low-pass filter with a cut-off frequency of B HzData S 2 (n) low-pass filtering is carried out, after the voice data after low-pass filtering is carried out by adopting a second-order inverse filter, the self-phase function of the output signal of the second-order inverse filtering is calculated according to the following formula, and fundamental tones are extracted:
wherein N is the window length of the mentioned window function in the step 1, S w (i) And (4) outputting a signal for the second-order inverse filtering in the step (3) of the step (1).
5. The method as claimed in claim 1, wherein the voice data processing method of the NVOC low-speed narrowband vocoder comprises: the step 1, the step (4) comprises the following specific steps:
(1) dividing the frequency domain into 5 frequency bands with equal intervals of 0-4000, which are respectively [0-500] Hz,
[500-1000] Hz, [1000-2000] Hz, [2000-3000] Hz, [3000-4000] Hz, and the autocorrelation function of the bandpass signal in each interval is calculated using the following formula:
where "t" is a continuous time argument, "τ" is an input signal delay ". The" is a convolution operator ".) * f * () To obtain conjugation;
(2) the average value of the product of two values of the same time function at the instant t and t + a is taken as the function of the time t, which is the measure of the similarity between the signal and the delayed signal, when the delay time is zero, the average square value of the signal is obtained, the value of the average square value is the maximum value at the moment, and the maximum value of the function is taken as the voiced sound intensity to calculate the unvoiced and voiced sound value of each sub-band.
6. The method as claimed in claim 1, wherein the voice data processing method of the NVOC low-speed narrowband vocoder comprises: the method also comprises the following steps before the step 1:
and initializing and configuring the encoding end, wherein the initializing and configuring comprises rate selection, parameters and coefficients used by the encoding end and the initializing and configuring of the algorithm of the encoding end of the filter.
7. The method as claimed in claim 1, wherein the voice data processing method of the NVOC low-speed narrowband vocoder comprises: the method also comprises the following steps before the step 3:
and initializing and configuring a decoding end, wherein the initialization and configuration comprises rate selection, parameters of an algorithm of the decoding end and filter coefficients.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011049193.1A CN112270934B (en) | 2020-09-29 | 2020-09-29 | Voice data processing method of NVOC low-speed narrow-band vocoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011049193.1A CN112270934B (en) | 2020-09-29 | 2020-09-29 | Voice data processing method of NVOC low-speed narrow-band vocoder |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112270934A CN112270934A (en) | 2021-01-26 |
CN112270934B true CN112270934B (en) | 2023-03-28 |
Family
ID=74349393
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011049193.1A Active CN112270934B (en) | 2020-09-29 | 2020-09-29 | Voice data processing method of NVOC low-speed narrow-band vocoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112270934B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486964A (en) * | 2021-07-13 | 2021-10-08 | 盛景智能科技(嘉兴)有限公司 | Voice activity detection method and device, electronic equipment and storage medium |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6195632B1 (en) * | 1998-11-25 | 2001-02-27 | Matsushita Electric Industrial Co., Ltd. | Extracting formant-based source-filter data for coding and synthesis employing cost function and inverse filtering |
CN1604188A (en) * | 2004-11-12 | 2005-04-06 | 梁华伟 | Voice coding stimulation method based on multimodal extraction |
CN101556799A (en) * | 2009-05-14 | 2009-10-14 | 华为技术有限公司 | Audio decoding method and audio decoder |
CN102044243A (en) * | 2009-10-15 | 2011-05-04 | 华为技术有限公司 | Method and device for voice activity detection (VAD) and encoder |
CN102903365A (en) * | 2012-10-30 | 2013-01-30 | 山东省计算中心 | Method for refining parameter of narrow band vocoder on decoding end |
CN103050121A (en) * | 2012-12-31 | 2013-04-17 | 北京迅光达通信技术有限公司 | Linear prediction speech coding method and speech synthesis method |
CN103247293A (en) * | 2013-05-14 | 2013-08-14 | 中国科学院自动化研究所 | Coding method and decoding method for voice data |
CN103325375A (en) * | 2013-06-05 | 2013-09-25 | 上海交通大学 | Coding and decoding device and method of ultralow-bit-rate speech |
CN104318927A (en) * | 2014-11-04 | 2015-01-28 | 东莞市北斗时空通信科技有限公司 | Anti-noise low-bitrate speech coding method and decoding method |
CN104517614A (en) * | 2013-09-30 | 2015-04-15 | 上海爱聊信息科技有限公司 | Voiced/unvoiced decision device and method based on sub-band characteristic parameter values |
CN105118513A (en) * | 2015-07-22 | 2015-12-02 | 重庆邮电大学 | 1.2kb/s low-rate speech encoding and decoding method based on mixed excitation linear prediction MELP |
CN107564535A (en) * | 2017-08-29 | 2018-01-09 | 中国人民解放军理工大学 | A kind of distributed low rate speech call method |
CN109308894A (en) * | 2018-09-26 | 2019-02-05 | 中国人民解放军陆军工程大学 | Voice modeling method based on Bloomfield's model |
CN109346093A (en) * | 2018-12-17 | 2019-02-15 | 山东省计算中心(国家超级计算济南中心) | A kind of fusion method of low rate vocoder sub-band surd and sonant parameter extraction and quantization |
CN111694027A (en) * | 2020-06-04 | 2020-09-22 | 长沙北斗产业安全技术研究院有限公司 | Method and device for capturing super-large dynamic spread spectrum signal |
-
2020
- 2020-09-29 CN CN202011049193.1A patent/CN112270934B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6195632B1 (en) * | 1998-11-25 | 2001-02-27 | Matsushita Electric Industrial Co., Ltd. | Extracting formant-based source-filter data for coding and synthesis employing cost function and inverse filtering |
CN1604188A (en) * | 2004-11-12 | 2005-04-06 | 梁华伟 | Voice coding stimulation method based on multimodal extraction |
CN101556799A (en) * | 2009-05-14 | 2009-10-14 | 华为技术有限公司 | Audio decoding method and audio decoder |
CN102044243A (en) * | 2009-10-15 | 2011-05-04 | 华为技术有限公司 | Method and device for voice activity detection (VAD) and encoder |
CN102903365A (en) * | 2012-10-30 | 2013-01-30 | 山东省计算中心 | Method for refining parameter of narrow band vocoder on decoding end |
CN103050121A (en) * | 2012-12-31 | 2013-04-17 | 北京迅光达通信技术有限公司 | Linear prediction speech coding method and speech synthesis method |
CN103247293A (en) * | 2013-05-14 | 2013-08-14 | 中国科学院自动化研究所 | Coding method and decoding method for voice data |
CN103325375A (en) * | 2013-06-05 | 2013-09-25 | 上海交通大学 | Coding and decoding device and method of ultralow-bit-rate speech |
CN104517614A (en) * | 2013-09-30 | 2015-04-15 | 上海爱聊信息科技有限公司 | Voiced/unvoiced decision device and method based on sub-band characteristic parameter values |
CN104318927A (en) * | 2014-11-04 | 2015-01-28 | 东莞市北斗时空通信科技有限公司 | Anti-noise low-bitrate speech coding method and decoding method |
CN105118513A (en) * | 2015-07-22 | 2015-12-02 | 重庆邮电大学 | 1.2kb/s low-rate speech encoding and decoding method based on mixed excitation linear prediction MELP |
CN107564535A (en) * | 2017-08-29 | 2018-01-09 | 中国人民解放军理工大学 | A kind of distributed low rate speech call method |
CN109308894A (en) * | 2018-09-26 | 2019-02-05 | 中国人民解放军陆军工程大学 | Voice modeling method based on Bloomfield's model |
CN109346093A (en) * | 2018-12-17 | 2019-02-15 | 山东省计算中心(国家超级计算济南中心) | A kind of fusion method of low rate vocoder sub-band surd and sonant parameter extraction and quantization |
CN111694027A (en) * | 2020-06-04 | 2020-09-22 | 长沙北斗产业安全技术研究院有限公司 | Method and device for capturing super-large dynamic spread spectrum signal |
Also Published As
Publication number | Publication date |
---|---|
CN112270934A (en) | 2021-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5450522A (en) | Auditory model for parametrization of speech | |
EP2491558B1 (en) | Determining an upperband signal from a narrowband signal | |
Shrawankar et al. | Techniques for feature extraction in speech recognition system: A comparative study | |
JP4308345B2 (en) | Multi-mode speech encoding apparatus and decoding apparatus | |
KR100348899B1 (en) | The Harmonic-Noise Speech Coding Algorhthm Using Cepstrum Analysis Method | |
JPH05346797A (en) | Voiced sound discriminating method | |
JP2002516420A (en) | Voice coder | |
Kesarkar et al. | Feature extraction for speech recognition | |
CN103854662A (en) | Self-adaptation voice detection method based on multi-domain joint estimation | |
JPH07271394A (en) | Removal of signal bias for sure recognition of telephone voice | |
JP2002508526A (en) | Broadband language synthesis from narrowband language signals | |
CN112270934B (en) | Voice data processing method of NVOC low-speed narrow-band vocoder | |
BRPI0208584B1 (en) | method for forming speech recognition parameters | |
JP2779325B2 (en) | Pitch search time reduction method using pre-processing correlation equation in vocoder | |
Robinson | Speech analysis | |
US5812966A (en) | Pitch searching time reducing method for code excited linear prediction vocoder using line spectral pair | |
WO2015084658A1 (en) | Systems and methods for enhancing an audio signal | |
CN114550741A (en) | Semantic recognition method and system | |
CN112233686B (en) | Voice data processing method of NVOCPLUS high-speed broadband vocoder | |
Demuynck et al. | Synthesizing speech from speech recognition parameters | |
JP4527175B2 (en) | Spectral parameter smoothing apparatus and spectral parameter smoothing method | |
Srivastava | Fundamentals of linear prediction | |
JPH0736484A (en) | Sound signal encoding device | |
CN118230741A (en) | Low-rate voice encoding and decoding method based on sine harmonic model | |
Tan et al. | Speech feature extraction and reconstruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |