CN111899724A

CN111899724A - Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment

Info

Publication number: CN111899724A
Application number: CN202010781082.3A
Authority: CN
Inventors: 胡乔林; 王亨佳; 刘剑豪; 赵国林; 王良斯; 石子言; 都兴霖
Original assignee: Air Force Early Warning Academy
Current assignee: Air Force Early Warning Academy
Priority date: 2020-08-06
Filing date: 2020-08-06
Publication date: 2020-11-06

Abstract

The invention discloses a voice characteristic coefficient extraction method based on Hilbert-Huang transform, which comprises the steps of carrying out empirical mode decomposition on an input voice signal to obtain a plurality of inherent mode functions; performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function; respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components; and extracting a logarithmic energy spectrum from the amplitude response, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a new Mel cepstrum coefficient. The technical scheme of the invention has better performance of distinguishing the speaker and reflects more real frequency spectrum characteristic distribution of the voice signal.

Description

Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment.

Background

The characteristic coefficient extraction refers to a process of extracting characteristic information capable of expressing a speaker from the voice of the speaker, is a primary link in recognizing the voice of the speaker, and has a great importance.

At present, the most widely used characteristic parameter is MFCC (Mel-scale Frequency cepstral coefficient), which is extracted based on the auditory perception characteristic of human ears and more accurately describes the nonlinear characteristic of the auditory Frequency of human ears. But MFCCs are extracted based on fourier transforms, which have an a priori assumption when analyzing speech signals that speech has short-term stationary characteristics. Based on the assumption, before analyzing the voice signal, the voice signal is divided into a plurality of short time segments by means of frame windowing, and short-time Fourier transform is performed in each short time segment, so as to analyze the time-frequency characteristics of the voice signal. However, in a strict sense, a speech signal is a nonlinear non-stationary signal, and such a way of frame truncation may cause leakage of a speech spectrum, so that spectral characteristics of speech cannot be truly reflected.

Disclosure of Invention

The invention aims to provide a voice feature coefficient extraction method based on Hilbert-Huang transform aiming at the defects of the prior art, and aims to solve the problem that the existing voice feature coefficient extraction method based on Hilbert-Huang transform has leakage of voice frequency spectrum, so that the spectral characteristics of voice cannot be truly reflected.

The invention provides a voice characteristic coefficient extraction method based on Hilbert-Huang transform, which comprises the following steps:

carrying out empirical mode decomposition on an input voice signal to obtain a plurality of inherent mode functions;

performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function;

respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components;

and extracting a logarithmic energy spectrum from the amplitude response, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a Mel cepstrum coefficient.

Preferably, before the performing empirical mode decomposition on the input speech signal to obtain a plurality of intrinsic mode functions, the method further includes:

the initial audio signal of input is preprocessed, so that the voice signal is purer and more prominent, and features are easier to extract. Wherein the pre-emphasis formula is:

y(n)＝x(n)-μx(n-1)

wherein μ is the pre-emphasis coefficient, y (n) is the speech signal after pre-emphasis processing, and x (n) and x (n-1) are the initial audio signals.

Preferably, the empirical mode decomposition of the speech signal is performed to obtain a plurality of intrinsic mode functions, which specifically includes:

acquiring all local maxima and all local minima of the initial audio signal;

and respectively carrying out spline interpolation for preset times on all the local maximum values and all the local minimum values to generate a plurality of inherent mode functions.

Preferably, the spline interpolation is performed on all local maxima and all local minima for a preset number of times, so as to generate a plurality of intrinsic mode functions, specifically including:

generating an upper envelope line according to all local maximum values, and generating a lower envelope line according to all local minimum values;

calculating an average value of the upper envelope line and the lower envelope line;

and generating a plurality of intrinsic mode functions based on the average value and a preset rule.

Preferably, the generating a plurality of intrinsic mode functions based on the average value and a preset rule specifically includes:

calculating a first modal function component based on the average value and a first preset formula;

judging whether the first modal function component meets a preset component condition or not;

when a first modal function component meets a preset component condition, taking the first modal function component as an inherent modal function; and when the first mode function component does not meet the preset component condition, returning to the step of generating the upper envelope line according to all the local maximum values and generating the lower envelope line according to all the local minimum values.

Preferably, the obtaining a logarithmic energy spectrum according to the pair of response amplitudes and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain mel-frequency cepstrum coefficients specifically includes:

after the Hilbert marginal spectrum dispersion is carried out on the amplitude response, squaring to obtain an energy spectrum;

filtering the energy spectrum through a Mel filter bank to obtain logarithmic energy spectrums on different frequency components;

discrete cosine transform is carried out on the logarithmic energy spectrum, and the correlation among the parameters is removed to obtain a Mel cepstrum coefficient.

In order to achieve the above object, the present invention further provides a hilbert-yellow transform-based speech feature coefficient extraction device, where the access display device includes:

the decomposition module is used for carrying out empirical mode decomposition on the voice signals to obtain a plurality of inherent mode functions;

the analysis module is used for performing Hilbert spectrum analysis on each inherent mode function to obtain a Hilbert marginal spectrum of each inherent mode function;

the filtering module is used for respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components;

and the transformation module is used for extracting a logarithmic energy spectrum from the amplitude response and performing discrete cosine transformation on the obtained logarithmic energy spectrum to obtain a Mel cepstrum coefficient.

In order to achieve the above object, the present invention further provides a hilbert-yellow transform-based speech feature coefficient extraction device, wherein the hilbert-yellow transform-based speech feature coefficient extraction device includes: the system comprises a memory, a processor and a Hilbert-yellow-transform-based voice feature coefficient extraction program stored on the memory and capable of running on the processor, wherein when the Hilbert-yellow-transform-based voice feature coefficient extraction program is executed by the processor, the steps of the Hilbert-yellow-transform-based voice feature coefficient extraction method are realized.

In order to achieve the above object, the present invention further provides a storage medium, on which a hilbert-yellow transform-based speech feature coefficient extraction program is stored, wherein the hilbert-yellow transform-based speech feature coefficient extraction program, when executed by a processor, implements the steps of the hilbert-yellow transform-based speech feature coefficient extraction method as described above.

According to the technical scheme provided by the invention, empirical mode decomposition is carried out on an input voice signal to obtain a plurality of inherent mode functions; performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function; respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components; and extracting a logarithmic energy spectrum from the amplitude response, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a new Mel cepstrum coefficient. The invention utilizes the superiority of Hilbert-Huang transform analysis on nonlinear non-stationary signals, provides a speaker characteristic coefficient extraction algorithm for auditory perception, and obtains a new Mel cepstrum coefficient through the algorithm. The new Mel cepstrum coefficient can reflect auditory perception characteristics of human ears, distinguish speaker, and reflect more real spectrum characteristic distribution of voice signals.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flowchart of an embodiment of a Hilbert-Huang transform-based method for extracting speech feature coefficients according to the present invention;

FIG. 2 is a flowchart illustrating another embodiment of a Hilbert-Huang transform-based method for extracting speech feature coefficients according to the present invention;

FIG. 3 is a detailed flowchart of step S20 in FIG. 1;

FIG. 4 is a detailed flowchart of step S220 in FIG. 3;

FIG. 5 is a detailed flowchart of step S50 in FIG. 1;

fig. 6 is a flowchart of another embodiment of the hubert yellow transform-based speech feature coefficient extraction apparatus according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

In addition, if there is a description of "first", "second", etc. in an embodiment of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, the meaning of "and/or" appearing throughout includes three juxtapositions, exemplified by "A and/or B" including either A or B or both A and B. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of a method for extracting speech feature coefficients based on hilbert-yellow transform according to the present invention.

In a first embodiment, the hilbert yellow transform-based speech feature coefficient extraction method based on hilbert yellow transform includes the following steps:

step S20: and carrying out empirical mode decomposition on the input voice signal to obtain a plurality of inherent mode functions.

It should be noted that Empirical Mode Decomposition (EMD) is a novel adaptive signal time-frequency processing method, and is particularly suitable for analysis processing of nonlinear non-stationary signals.

The intrinsic mode function IMF, which is known from the physical meaning of the instantaneous frequency, is not discussed for any signal with instantaneous frequency, but only when the signal includes a vibrational mode without complex superimposed waves. In fact, the necessary condition for defining a meaningful instantaneous frequency is that the function is required to be symmetric about the local zero mean, and that the number of zero crossings and extrema is the same. For this reason, the concept of the natural mode function is proposed.

Step S30: and performing Hilbert spectrum analysis on each inherent mode function to obtain a Hilbert marginal spectrum of each inherent mode function. It is worth noting that Hilbert is Hilbert.

Step S40: and respectively filtering the Hilbert marginal spectrums of the inherent mode functions to obtain the amplitude responses of the voice signals on different frequency components. In this embodiment, the Mel filter is used for filtering.

Step S50: and obtaining a logarithmic energy spectrum according to the pair of amplitude responses, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a Mel cepstrum coefficient.

The Discrete Cosine Transform (DCT for Discrete Cosine Transform) is a Transform related to Fourier Transform, which is similar to the Discrete Fourier Transform (DFT for Discrete Fourier Transform), but uses only real numbers. The discrete cosine transform corresponds to a discrete fourier transform of approximately twice its length, which is performed on a real even function (since the fourier transform of a real even function is still a real even function), and within some variants it is necessary to shift the position of the input or output by half a unit.

According to the technical scheme provided by the invention, empirical mode decomposition is carried out on an input voice signal to obtain a plurality of inherent mode functions; performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function; respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components; and obtaining a logarithmic energy spectrum according to the pair of amplitude responses, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a new Mel cepstrum coefficient. The invention utilizes the superiority of Hilbert-Huang transform analysis on nonlinear non-stationary signals, provides a speaker characteristic coefficient extraction algorithm for auditory perception, and obtains a new Mel cepstrum coefficient through the algorithm. The new Mel cepstrum coefficient can reflect auditory perception characteristics of human ears, distinguish human speakers well, and reflect more real spectral characteristic distribution of voice signals

Referring to fig. 2, before performing empirical mode decomposition on the input speech signal to obtain a plurality of intrinsic mode functions, the method further includes:

and step S10, preprocessing the input initial audio signal to make the voice signal more pure, more prominent and easier to extract features.

In this embodiment, the initial audio signal of input is pre-emphasized, so that the speech signal is purer, more prominent, and easier to extract features, wherein the pre-emphasis formula is:

y(n)＝x(n)-μx(n-1)

Referring to fig. 3, further, the performing empirical mode decomposition on the speech signal to obtain a plurality of intrinsic mode functions specifically includes:

step S210: acquiring all local maxima and all local minima of the initial audio signal;

step S220: and respectively carrying out spline interpolation for preset times on all the local maximum values and all the local minimum values to generate a plurality of inherent mode functions.

Referring to fig. 4, it is worth explaining that step S220 specifically includes:

step S221: generating an upper envelope line according to all local maximum values, and generating a lower envelope line according to all local minimum values;

step S222: calculating an average value of the upper envelope line and the lower envelope line;

step S223: and generating a plurality of intrinsic mode functions based on the average value and a preset rule.

In this embodiment, step S223 specifically includes:

Now, the step S200 in this embodiment will be further described with reference to specific embodiments:

step 2.1: finding all local maxima and minima of the noisy speech signal x (n);

step 2.2: respectively carrying out cubic spline interpolation on all local maxima and minima to obtain an upper envelope line formed by the local maxima and a lower envelope line formed by all local minima, which are respectively marked as u (t) and l (t);

step 2.3: the mean of the upper and lower envelope lines is calculated as:

m(t)＝(u(t)+l(t))/2

step 2.4: let h (t) x (t) -m (t), verify whether h (t) satisfies the condition of IMF (Intrinsic mode function) component, and if yes, h (t) is the first mode function component; if not, taking h (t) as the signal to be decomposed, restarting the step 2.1 until the condition of the modal function component is met, and recording the first modal function component as imf₁(t)；

Step 2.5: the first signal margin r₁(t)＝x(t)-imf₁(t) repeating steps 2.1 to 2.4 as the decomposed signal to obtain a second IMF component IMF₂(t) at this time r₂(t)＝r₁(t)-imf₂(t)；

Step 2.6: repeating the step 2.5 until the remainder r is obtained_n(t) if the decomposition cannot be continued, obtaining a noisy speech signal x (t)Dry mode function imf₁(t),imf₂(t),...,imf_n(t)。

Referring to fig. 5, the obtaining a logarithmic energy spectrum according to the response amplitude, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a mel-frequency cepstrum coefficient specifically includes:

step S510: after the Hilbert marginal spectrum dispersion is carried out on the response amplitude, the square is taken to obtain an energy spectrum;

step S520: filtering the energy spectrum through a Mel filter bank to obtain logarithmic energy spectrums on different frequency components;

step S530: discrete cosine transform is carried out on the logarithmic energy spectrum, and the correlation among the parameters is removed to obtain a Mel cepstrum coefficient.

Further, for the step S30, the specific implementation includes the following sub-steps:

step 3.1: a plurality of inherent mode functions obtained according to the initial audio signal are respectively used as signals to be decomposed, the steps 2.1 to 2.6 are repeated, and the inherent mode function of the ith inherent mode function component is obtained

Step 3.2: computing a Hilbert spectrum H of the ith eigenmode function component_i(ω, t) and Hilbert marginal spectrum H_i(ω)。

Filtering the Hilbert marginal spectrum of each inherent modal function by a Mel filter bank to obtain amplitude responses of the voice signals on different frequency components, and then taking logarithmic energy spectrum characteristics, wherein the method specifically comprises the following substeps:

step 4.1: the Hilbert marginal spectrum H of the ith natural mode function component_iAfter (omega) dispersion, taking a square to obtain an energy spectrum S_i(k)；

Step 4.2: the energy spectrum S of the ith natural mode function_i(k) Filtering through a Mel filter bank to obtain energy responses E (m) on different frequency components;

and 5: DCT transformation: performing DCT transformation on the logarithmic energy spectrum obtained in the step 4, removing correlation among parameters, and obtaining a Mel cepstrum coefficient H-MFCC based on the Hilbert marginal spectrum:

wherein n, M, M are positive integers, and E (M) is the energy response on different frequency components.

Referring to fig. 6, in order to achieve the above object, the present invention further provides a hilbert-yellow transform-based speech feature coefficient extraction apparatus, where the access display apparatus includes:

the decomposition module 200 is configured to perform empirical mode decomposition on the voice signal to obtain a plurality of inherent modal functions;

an analysis module 300, configured to perform hilbert spectrum analysis on each intrinsic mode function to obtain a hilbert marginal spectrum of each intrinsic mode function;

the filtering module 400 is configured to filter the hilbert marginal spectrum of each intrinsic mode function, respectively, to obtain response amplitudes of the speech signal on different frequency components;

and the transformation module 500 is configured to obtain a logarithmic energy spectrum according to the pair of response amplitudes, and perform discrete cosine transformation on the obtained logarithmic energy spectrum to obtain a mel-frequency cepstrum coefficient.

Further, the apparatus for extracting speech feature coefficients based on hilbert yellow transform further comprises:

the preprocessing module 100: the method is used for preprocessing the input initial audio signal, so that the voice signal is purer, more prominent and easier to extract features.

Further, the preprocessing module 100 is further configured to perform pre-emphasis processing on the input initial audio signal to obtain a speech signal, where the pre-emphasis formula is as follows:

y(n)＝x(n)-μx(n-1)

Further, the analysis module 300 is further configured to obtain all local maxima and all local minima of the initial audio signal; and respectively carrying out spline interpolation for preset times on all the local maximum values and all the local minimum values to generate a plurality of inherent mode functions.

Further, the analysis module 300 is further configured to generate an upper envelope line according to all local maxima, and generate a lower envelope line according to all local minima; calculating an average value of the upper envelope line and the lower envelope line; and generating a plurality of intrinsic mode functions based on the average value and a preset rule.

Further, the analysis module 300 is further configured to calculate a first modal function component based on the average value and a first preset formula; judging whether the first modal function component meets a preset component condition or not; when a first modal function component meets a preset component condition, taking the first modal function component as an inherent modal function; and when the first mode function component does not meet the preset component condition, returning to the step of generating the upper envelope line according to all the local maximum values and generating the lower envelope line according to all the local minimum values.

Further, the transformation module 500 is further configured to perform hilbert marginal spectrum discretization on the response amplitude, and then perform squaring to obtain an energy spectrum; filtering the energy spectrum through a Mel filter bank to obtain logarithmic energy spectrums on different frequency components; discrete cosine transform is carried out on the logarithmic energy spectrum, and the correlation among the parameters is removed to obtain a Mel cepstrum coefficient.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. The use of the words first, second, third, etc. do not denote any order, but rather the words first, second, etc. are to be interpreted as names.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech feature coefficient extraction method based on Hilbert-Huang transform is characterized by comprising the following steps of:

2. The method according to claim 1, further comprising, before performing empirical mode decomposition on the input speech signal to obtain a plurality of eigenmode functions:

y(n)＝x(n)-μx(n-1)

3. The method for extracting coefficients of a speech feature based on hilbert-yellow transform according to claim 1, wherein the empirical mode decomposition of the speech signal is performed to obtain a plurality of eigenmode functions, which specifically includes:

acquiring all local maxima and all local minima of the initial audio signal;

4. The hilbert yellow transform-based speech feature coefficient extraction method according to claim 3, wherein the spline interpolation is performed on all local maxima and all local minima for a predetermined number of times, respectively, to generate a plurality of eigenmode functions, specifically including:

5. The method according to claim 4, wherein the generating a plurality of eigenmode functions based on the average value and a predetermined rule specifically includes:

6. The method for extracting a speech feature coefficient based on hilbert yellow transform as claimed in claim 1, wherein the obtaining a logarithmic energy spectrum according to the pair response amplitude, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a mel-frequency cepstrum coefficient specifically includes:

after the Hilbert marginal spectrum dispersion is carried out on the response amplitude, the square is taken to obtain an energy spectrum;

7. A hilbert yellow transform-based speech feature coefficient extraction apparatus based on hilbert yellow transform, wherein the access display apparatus includes:

8. A hilbert yellow transform-based speech feature coefficient extraction device based on a hilbert yellow transform, characterized in that the hilbert yellow transform-based speech feature coefficient extraction device comprises: memory, a processor and a hilbert yellow transform based speech feature coefficient extraction program stored on the memory and executable on the processor, the hilbert yellow transform based speech feature coefficient extraction program, when executed by the processor, implementing the steps of the hilbert yellow transform based speech feature coefficient extraction method according to any one of claims 1 to 6.

9. A storage medium, wherein a hilbert-yellow transform-based speech feature coefficient extraction program is stored on the storage medium, and when being executed by a processor, the hilbert-yellow transform-based speech feature coefficient extraction program implements the steps of the hilbert-yellow transform-based speech feature coefficient extraction method according to any one of claims 1 to 6.