Nothing Special   »   [go: up one dir, main page]

CN111899724A - Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment - Google Patents

Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment Download PDF

Info

Publication number
CN111899724A
CN111899724A CN202010781082.3A CN202010781082A CN111899724A CN 111899724 A CN111899724 A CN 111899724A CN 202010781082 A CN202010781082 A CN 202010781082A CN 111899724 A CN111899724 A CN 111899724A
Authority
CN
China
Prior art keywords
hilbert
transform
inherent
spectrum
speech feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010781082.3A
Other languages
Chinese (zh)
Inventor
胡乔林
王亨佳
刘剑豪
赵国林
王良斯
石子言
都兴霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Air Force Early Warning Academy
Original Assignee
Air Force Early Warning Academy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Air Force Early Warning Academy filed Critical Air Force Early Warning Academy
Priority to CN202010781082.3A priority Critical patent/CN111899724A/en
Publication of CN111899724A publication Critical patent/CN111899724A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/147Discrete orthonormal transforms, e.g. discrete cosine transform, discrete sine transform, and variations therefrom, e.g. modified discrete cosine transform, integer transforms approximating the discrete cosine transform
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Discrete Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a voice characteristic coefficient extraction method based on Hilbert-Huang transform, which comprises the steps of carrying out empirical mode decomposition on an input voice signal to obtain a plurality of inherent mode functions; performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function; respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components; and extracting a logarithmic energy spectrum from the amplitude response, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a new Mel cepstrum coefficient. The technical scheme of the invention has better performance of distinguishing the speaker and reflects more real frequency spectrum characteristic distribution of the voice signal.

Description

Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment
Technical Field
The invention relates to the technical field of voice processing, in particular to a voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment.
Background
The characteristic coefficient extraction refers to a process of extracting characteristic information capable of expressing a speaker from the voice of the speaker, is a primary link in recognizing the voice of the speaker, and has a great importance.
At present, the most widely used characteristic parameter is MFCC (Mel-scale Frequency cepstral coefficient), which is extracted based on the auditory perception characteristic of human ears and more accurately describes the nonlinear characteristic of the auditory Frequency of human ears. But MFCCs are extracted based on fourier transforms, which have an a priori assumption when analyzing speech signals that speech has short-term stationary characteristics. Based on the assumption, before analyzing the voice signal, the voice signal is divided into a plurality of short time segments by means of frame windowing, and short-time Fourier transform is performed in each short time segment, so as to analyze the time-frequency characteristics of the voice signal. However, in a strict sense, a speech signal is a nonlinear non-stationary signal, and such a way of frame truncation may cause leakage of a speech spectrum, so that spectral characteristics of speech cannot be truly reflected.
Disclosure of Invention
The invention aims to provide a voice feature coefficient extraction method based on Hilbert-Huang transform aiming at the defects of the prior art, and aims to solve the problem that the existing voice feature coefficient extraction method based on Hilbert-Huang transform has leakage of voice frequency spectrum, so that the spectral characteristics of voice cannot be truly reflected.
The invention provides a voice characteristic coefficient extraction method based on Hilbert-Huang transform, which comprises the following steps:
carrying out empirical mode decomposition on an input voice signal to obtain a plurality of inherent mode functions;
performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function;
respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components;
and extracting a logarithmic energy spectrum from the amplitude response, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a Mel cepstrum coefficient.
Preferably, before the performing empirical mode decomposition on the input speech signal to obtain a plurality of intrinsic mode functions, the method further includes:
the initial audio signal of input is preprocessed, so that the voice signal is purer and more prominent, and features are easier to extract. Wherein the pre-emphasis formula is:
y(n)=x(n)-μx(n-1)
wherein μ is the pre-emphasis coefficient, y (n) is the speech signal after pre-emphasis processing, and x (n) and x (n-1) are the initial audio signals.
Preferably, the empirical mode decomposition of the speech signal is performed to obtain a plurality of intrinsic mode functions, which specifically includes:
acquiring all local maxima and all local minima of the initial audio signal;
and respectively carrying out spline interpolation for preset times on all the local maximum values and all the local minimum values to generate a plurality of inherent mode functions.
Preferably, the spline interpolation is performed on all local maxima and all local minima for a preset number of times, so as to generate a plurality of intrinsic mode functions, specifically including:
generating an upper envelope line according to all local maximum values, and generating a lower envelope line according to all local minimum values;
calculating an average value of the upper envelope line and the lower envelope line;
and generating a plurality of intrinsic mode functions based on the average value and a preset rule.
Preferably, the generating a plurality of intrinsic mode functions based on the average value and a preset rule specifically includes:
calculating a first modal function component based on the average value and a first preset formula;
judging whether the first modal function component meets a preset component condition or not;
when a first modal function component meets a preset component condition, taking the first modal function component as an inherent modal function; and when the first mode function component does not meet the preset component condition, returning to the step of generating the upper envelope line according to all the local maximum values and generating the lower envelope line according to all the local minimum values.
Preferably, the obtaining a logarithmic energy spectrum according to the pair of response amplitudes and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain mel-frequency cepstrum coefficients specifically includes:
after the Hilbert marginal spectrum dispersion is carried out on the amplitude response, squaring to obtain an energy spectrum;
filtering the energy spectrum through a Mel filter bank to obtain logarithmic energy spectrums on different frequency components;
discrete cosine transform is carried out on the logarithmic energy spectrum, and the correlation among the parameters is removed to obtain a Mel cepstrum coefficient.
In order to achieve the above object, the present invention further provides a hilbert-yellow transform-based speech feature coefficient extraction device, where the access display device includes:
the decomposition module is used for carrying out empirical mode decomposition on the voice signals to obtain a plurality of inherent mode functions;
the analysis module is used for performing Hilbert spectrum analysis on each inherent mode function to obtain a Hilbert marginal spectrum of each inherent mode function;
the filtering module is used for respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components;
and the transformation module is used for extracting a logarithmic energy spectrum from the amplitude response and performing discrete cosine transformation on the obtained logarithmic energy spectrum to obtain a Mel cepstrum coefficient.
In order to achieve the above object, the present invention further provides a hilbert-yellow transform-based speech feature coefficient extraction device, wherein the hilbert-yellow transform-based speech feature coefficient extraction device includes: the system comprises a memory, a processor and a Hilbert-yellow-transform-based voice feature coefficient extraction program stored on the memory and capable of running on the processor, wherein when the Hilbert-yellow-transform-based voice feature coefficient extraction program is executed by the processor, the steps of the Hilbert-yellow-transform-based voice feature coefficient extraction method are realized.
In order to achieve the above object, the present invention further provides a storage medium, on which a hilbert-yellow transform-based speech feature coefficient extraction program is stored, wherein the hilbert-yellow transform-based speech feature coefficient extraction program, when executed by a processor, implements the steps of the hilbert-yellow transform-based speech feature coefficient extraction method as described above.
According to the technical scheme provided by the invention, empirical mode decomposition is carried out on an input voice signal to obtain a plurality of inherent mode functions; performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function; respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components; and extracting a logarithmic energy spectrum from the amplitude response, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a new Mel cepstrum coefficient. The invention utilizes the superiority of Hilbert-Huang transform analysis on nonlinear non-stationary signals, provides a speaker characteristic coefficient extraction algorithm for auditory perception, and obtains a new Mel cepstrum coefficient through the algorithm. The new Mel cepstrum coefficient can reflect auditory perception characteristics of human ears, distinguish speaker, and reflect more real spectrum characteristic distribution of voice signals.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 is a flowchart of an embodiment of a Hilbert-Huang transform-based method for extracting speech feature coefficients according to the present invention;
FIG. 2 is a flowchart illustrating another embodiment of a Hilbert-Huang transform-based method for extracting speech feature coefficients according to the present invention;
FIG. 3 is a detailed flowchart of step S20 in FIG. 1;
FIG. 4 is a detailed flowchart of step S220 in FIG. 3;
FIG. 5 is a detailed flowchart of step S50 in FIG. 1;
fig. 6 is a flowchart of another embodiment of the hubert yellow transform-based speech feature coefficient extraction apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
In addition, if there is a description of "first", "second", etc. in an embodiment of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, the meaning of "and/or" appearing throughout includes three juxtapositions, exemplified by "A and/or B" including either A or B or both A and B. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of a method for extracting speech feature coefficients based on hilbert-yellow transform according to the present invention.
In a first embodiment, the hilbert yellow transform-based speech feature coefficient extraction method based on hilbert yellow transform includes the following steps:
step S20: and carrying out empirical mode decomposition on the input voice signal to obtain a plurality of inherent mode functions.
It should be noted that Empirical Mode Decomposition (EMD) is a novel adaptive signal time-frequency processing method, and is particularly suitable for analysis processing of nonlinear non-stationary signals.
The intrinsic mode function IMF, which is known from the physical meaning of the instantaneous frequency, is not discussed for any signal with instantaneous frequency, but only when the signal includes a vibrational mode without complex superimposed waves. In fact, the necessary condition for defining a meaningful instantaneous frequency is that the function is required to be symmetric about the local zero mean, and that the number of zero crossings and extrema is the same. For this reason, the concept of the natural mode function is proposed.
Step S30: and performing Hilbert spectrum analysis on each inherent mode function to obtain a Hilbert marginal spectrum of each inherent mode function. It is worth noting that Hilbert is Hilbert.
Step S40: and respectively filtering the Hilbert marginal spectrums of the inherent mode functions to obtain the amplitude responses of the voice signals on different frequency components. In this embodiment, the Mel filter is used for filtering.
Step S50: and obtaining a logarithmic energy spectrum according to the pair of amplitude responses, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a Mel cepstrum coefficient.
The Discrete Cosine Transform (DCT for Discrete Cosine Transform) is a Transform related to Fourier Transform, which is similar to the Discrete Fourier Transform (DFT for Discrete Fourier Transform), but uses only real numbers. The discrete cosine transform corresponds to a discrete fourier transform of approximately twice its length, which is performed on a real even function (since the fourier transform of a real even function is still a real even function), and within some variants it is necessary to shift the position of the input or output by half a unit.
According to the technical scheme provided by the invention, empirical mode decomposition is carried out on an input voice signal to obtain a plurality of inherent mode functions; performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function; respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components; and obtaining a logarithmic energy spectrum according to the pair of amplitude responses, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a new Mel cepstrum coefficient. The invention utilizes the superiority of Hilbert-Huang transform analysis on nonlinear non-stationary signals, provides a speaker characteristic coefficient extraction algorithm for auditory perception, and obtains a new Mel cepstrum coefficient through the algorithm. The new Mel cepstrum coefficient can reflect auditory perception characteristics of human ears, distinguish human speakers well, and reflect more real spectral characteristic distribution of voice signals
Referring to fig. 2, before performing empirical mode decomposition on the input speech signal to obtain a plurality of intrinsic mode functions, the method further includes:
and step S10, preprocessing the input initial audio signal to make the voice signal more pure, more prominent and easier to extract features.
In this embodiment, the initial audio signal of input is pre-emphasized, so that the speech signal is purer, more prominent, and easier to extract features, wherein the pre-emphasis formula is:
y(n)=x(n)-μx(n-1)
wherein μ is the pre-emphasis coefficient, y (n) is the speech signal after pre-emphasis processing, and x (n) and x (n-1) are the initial audio signals.
Referring to fig. 3, further, the performing empirical mode decomposition on the speech signal to obtain a plurality of intrinsic mode functions specifically includes:
step S210: acquiring all local maxima and all local minima of the initial audio signal;
step S220: and respectively carrying out spline interpolation for preset times on all the local maximum values and all the local minimum values to generate a plurality of inherent mode functions.
Referring to fig. 4, it is worth explaining that step S220 specifically includes:
step S221: generating an upper envelope line according to all local maximum values, and generating a lower envelope line according to all local minimum values;
step S222: calculating an average value of the upper envelope line and the lower envelope line;
step S223: and generating a plurality of intrinsic mode functions based on the average value and a preset rule.
In this embodiment, step S223 specifically includes:
calculating a first modal function component based on the average value and a first preset formula;
judging whether the first modal function component meets a preset component condition or not;
when a first modal function component meets a preset component condition, taking the first modal function component as an inherent modal function; and when the first mode function component does not meet the preset component condition, returning to the step of generating the upper envelope line according to all the local maximum values and generating the lower envelope line according to all the local minimum values.
Now, the step S200 in this embodiment will be further described with reference to specific embodiments:
step 2.1: finding all local maxima and minima of the noisy speech signal x (n);
step 2.2: respectively carrying out cubic spline interpolation on all local maxima and minima to obtain an upper envelope line formed by the local maxima and a lower envelope line formed by all local minima, which are respectively marked as u (t) and l (t);
step 2.3: the mean of the upper and lower envelope lines is calculated as:
m(t)=(u(t)+l(t))/2
step 2.4: let h (t) x (t) -m (t), verify whether h (t) satisfies the condition of IMF (Intrinsic mode function) component, and if yes, h (t) is the first mode function component; if not, taking h (t) as the signal to be decomposed, restarting the step 2.1 until the condition of the modal function component is met, and recording the first modal function component as imf1(t);
Step 2.5: the first signal margin r1(t)=x(t)-imf1(t) repeating steps 2.1 to 2.4 as the decomposed signal to obtain a second IMF component IMF2(t) at this time r2(t)=r1(t)-imf2(t);
Step 2.6: repeating the step 2.5 until the remainder r is obtainedn(t) if the decomposition cannot be continued, obtaining a noisy speech signal x (t)Dry mode function imf1(t),imf2(t),...,imfn(t)。
Referring to fig. 5, the obtaining a logarithmic energy spectrum according to the response amplitude, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a mel-frequency cepstrum coefficient specifically includes:
step S510: after the Hilbert marginal spectrum dispersion is carried out on the response amplitude, the square is taken to obtain an energy spectrum;
step S520: filtering the energy spectrum through a Mel filter bank to obtain logarithmic energy spectrums on different frequency components;
step S530: discrete cosine transform is carried out on the logarithmic energy spectrum, and the correlation among the parameters is removed to obtain a Mel cepstrum coefficient.
Further, for the step S30, the specific implementation includes the following sub-steps:
step 3.1: a plurality of inherent mode functions obtained according to the initial audio signal are respectively used as signals to be decomposed, the steps 2.1 to 2.6 are repeated, and the inherent mode function of the ith inherent mode function component is obtained
Figure BDA0002620224390000101
Step 3.2: computing a Hilbert spectrum H of the ith eigenmode function componenti(ω, t) and Hilbert marginal spectrum Hi(ω)。
Filtering the Hilbert marginal spectrum of each inherent modal function by a Mel filter bank to obtain amplitude responses of the voice signals on different frequency components, and then taking logarithmic energy spectrum characteristics, wherein the method specifically comprises the following substeps:
step 4.1: the Hilbert marginal spectrum H of the ith natural mode function componentiAfter (omega) dispersion, taking a square to obtain an energy spectrum Si(k);
Step 4.2: the energy spectrum S of the ith natural mode functioni(k) Filtering through a Mel filter bank to obtain energy responses E (m) on different frequency components;
and 5: DCT transformation: performing DCT transformation on the logarithmic energy spectrum obtained in the step 4, removing correlation among parameters, and obtaining a Mel cepstrum coefficient H-MFCC based on the Hilbert marginal spectrum:
Figure BDA0002620224390000102
wherein n, M, M are positive integers, and E (M) is the energy response on different frequency components.
Referring to fig. 6, in order to achieve the above object, the present invention further provides a hilbert-yellow transform-based speech feature coefficient extraction apparatus, where the access display apparatus includes:
the decomposition module 200 is configured to perform empirical mode decomposition on the voice signal to obtain a plurality of inherent modal functions;
an analysis module 300, configured to perform hilbert spectrum analysis on each intrinsic mode function to obtain a hilbert marginal spectrum of each intrinsic mode function;
the filtering module 400 is configured to filter the hilbert marginal spectrum of each intrinsic mode function, respectively, to obtain response amplitudes of the speech signal on different frequency components;
and the transformation module 500 is configured to obtain a logarithmic energy spectrum according to the pair of response amplitudes, and perform discrete cosine transformation on the obtained logarithmic energy spectrum to obtain a mel-frequency cepstrum coefficient.
Further, the apparatus for extracting speech feature coefficients based on hilbert yellow transform further comprises:
the preprocessing module 100: the method is used for preprocessing the input initial audio signal, so that the voice signal is purer, more prominent and easier to extract features.
Further, the preprocessing module 100 is further configured to perform pre-emphasis processing on the input initial audio signal to obtain a speech signal, where the pre-emphasis formula is as follows:
y(n)=x(n)-μx(n-1)
wherein μ is the pre-emphasis coefficient, y (n) is the speech signal after pre-emphasis processing, and x (n) and x (n-1) are the initial audio signals.
Further, the analysis module 300 is further configured to obtain all local maxima and all local minima of the initial audio signal; and respectively carrying out spline interpolation for preset times on all the local maximum values and all the local minimum values to generate a plurality of inherent mode functions.
Further, the analysis module 300 is further configured to generate an upper envelope line according to all local maxima, and generate a lower envelope line according to all local minima; calculating an average value of the upper envelope line and the lower envelope line; and generating a plurality of intrinsic mode functions based on the average value and a preset rule.
Further, the analysis module 300 is further configured to calculate a first modal function component based on the average value and a first preset formula; judging whether the first modal function component meets a preset component condition or not; when a first modal function component meets a preset component condition, taking the first modal function component as an inherent modal function; and when the first mode function component does not meet the preset component condition, returning to the step of generating the upper envelope line according to all the local maximum values and generating the lower envelope line according to all the local minimum values.
Further, the transformation module 500 is further configured to perform hilbert marginal spectrum discretization on the response amplitude, and then perform squaring to obtain an energy spectrum; filtering the energy spectrum through a Mel filter bank to obtain logarithmic energy spectrums on different frequency components; discrete cosine transform is carried out on the logarithmic energy spectrum, and the correlation among the parameters is removed to obtain a Mel cepstrum coefficient.
In order to achieve the above object, the present invention further provides a hilbert-yellow transform-based speech feature coefficient extraction device, wherein the hilbert-yellow transform-based speech feature coefficient extraction device includes: the system comprises a memory, a processor and a Hilbert-yellow-transform-based voice feature coefficient extraction program stored on the memory and capable of running on the processor, wherein when the Hilbert-yellow-transform-based voice feature coefficient extraction program is executed by the processor, the steps of the Hilbert-yellow-transform-based voice feature coefficient extraction method are realized.
In order to achieve the above object, the present invention further provides a storage medium, on which a hilbert-yellow transform-based speech feature coefficient extraction program is stored, wherein the hilbert-yellow transform-based speech feature coefficient extraction program, when executed by a processor, implements the steps of the hilbert-yellow transform-based speech feature coefficient extraction method as described above.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. The use of the words first, second, third, etc. do not denote any order, but rather the words first, second, etc. are to be interpreted as names.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1. A speech feature coefficient extraction method based on Hilbert-Huang transform is characterized by comprising the following steps of:
carrying out empirical mode decomposition on an input voice signal to obtain a plurality of inherent mode functions;
performing Hilbert spectrum analysis on each inherent modal function to obtain a Hilbert marginal spectrum of each inherent modal function;
respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components;
and extracting a logarithmic energy spectrum from the amplitude response, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a Mel cepstrum coefficient.
2. The method according to claim 1, further comprising, before performing empirical mode decomposition on the input speech signal to obtain a plurality of eigenmode functions:
the initial audio signal of input is preprocessed, so that the voice signal is purer and more prominent, and features are easier to extract. Wherein the pre-emphasis formula is:
y(n)=x(n)-μx(n-1)
wherein μ is the pre-emphasis coefficient, y (n) is the speech signal after pre-emphasis processing, and x (n) and x (n-1) are the initial audio signals.
3. The method for extracting coefficients of a speech feature based on hilbert-yellow transform according to claim 1, wherein the empirical mode decomposition of the speech signal is performed to obtain a plurality of eigenmode functions, which specifically includes:
acquiring all local maxima and all local minima of the initial audio signal;
and respectively carrying out spline interpolation for preset times on all the local maximum values and all the local minimum values to generate a plurality of inherent mode functions.
4. The hilbert yellow transform-based speech feature coefficient extraction method according to claim 3, wherein the spline interpolation is performed on all local maxima and all local minima for a predetermined number of times, respectively, to generate a plurality of eigenmode functions, specifically including:
generating an upper envelope line according to all local maximum values, and generating a lower envelope line according to all local minimum values;
calculating an average value of the upper envelope line and the lower envelope line;
and generating a plurality of intrinsic mode functions based on the average value and a preset rule.
5. The method according to claim 4, wherein the generating a plurality of eigenmode functions based on the average value and a predetermined rule specifically includes:
calculating a first modal function component based on the average value and a first preset formula;
judging whether the first modal function component meets a preset component condition or not;
when a first modal function component meets a preset component condition, taking the first modal function component as an inherent modal function; and when the first mode function component does not meet the preset component condition, returning to the step of generating the upper envelope line according to all the local maximum values and generating the lower envelope line according to all the local minimum values.
6. The method for extracting a speech feature coefficient based on hilbert yellow transform as claimed in claim 1, wherein the obtaining a logarithmic energy spectrum according to the pair response amplitude, and performing discrete cosine transform on the obtained logarithmic energy spectrum to obtain a mel-frequency cepstrum coefficient specifically includes:
after the Hilbert marginal spectrum dispersion is carried out on the response amplitude, the square is taken to obtain an energy spectrum;
filtering the energy spectrum through a Mel filter bank to obtain logarithmic energy spectrums on different frequency components;
discrete cosine transform is carried out on the logarithmic energy spectrum, and the correlation among the parameters is removed to obtain a Mel cepstrum coefficient.
7. A hilbert yellow transform-based speech feature coefficient extraction apparatus based on hilbert yellow transform, wherein the access display apparatus includes:
the decomposition module is used for carrying out empirical mode decomposition on the voice signals to obtain a plurality of inherent mode functions;
the analysis module is used for performing Hilbert spectrum analysis on each inherent mode function to obtain a Hilbert marginal spectrum of each inherent mode function;
the filtering module is used for respectively filtering the Hilbert marginal spectrums of the inherent modal functions to obtain amplitude responses of the voice signals on different frequency components;
and the transformation module is used for extracting a logarithmic energy spectrum from the amplitude response and performing discrete cosine transformation on the obtained logarithmic energy spectrum to obtain a Mel cepstrum coefficient.
8. A hilbert yellow transform-based speech feature coefficient extraction device based on a hilbert yellow transform, characterized in that the hilbert yellow transform-based speech feature coefficient extraction device comprises: memory, a processor and a hilbert yellow transform based speech feature coefficient extraction program stored on the memory and executable on the processor, the hilbert yellow transform based speech feature coefficient extraction program, when executed by the processor, implementing the steps of the hilbert yellow transform based speech feature coefficient extraction method according to any one of claims 1 to 6.
9. A storage medium, wherein a hilbert-yellow transform-based speech feature coefficient extraction program is stored on the storage medium, and when being executed by a processor, the hilbert-yellow transform-based speech feature coefficient extraction program implements the steps of the hilbert-yellow transform-based speech feature coefficient extraction method according to any one of claims 1 to 6.
CN202010781082.3A 2020-08-06 2020-08-06 Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment Pending CN111899724A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010781082.3A CN111899724A (en) 2020-08-06 2020-08-06 Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010781082.3A CN111899724A (en) 2020-08-06 2020-08-06 Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment

Publications (1)

Publication Number Publication Date
CN111899724A true CN111899724A (en) 2020-11-06

Family

ID=73246912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010781082.3A Pending CN111899724A (en) 2020-08-06 2020-08-06 Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment

Country Status (1)

Country Link
CN (1) CN111899724A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882901A (en) * 2022-04-26 2022-08-09 同济大学 Shrimp acoustic signal time-frequency feature extraction method based on frequency domain convolution and marginal spectrum feedback
CN115597901A (en) * 2022-12-13 2023-01-13 江苏中云筑智慧运维研究院有限公司(Cn) Method for monitoring damage of bridge expansion joint

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727905A (en) * 2009-11-27 2010-06-09 江南大学 Method for acquiring vocal print picture with refined time-frequency structure
CN102169690A (en) * 2011-04-08 2011-08-31 哈尔滨理工大学 Voice signal recognition system and method based on surface myoelectric signal
US20130163839A1 (en) * 2011-12-27 2013-06-27 Industrial Technology Research Institute Signal and image analysis method and ultrasound imaging system
CN105788608A (en) * 2016-03-03 2016-07-20 渤海大学 Chinese initial consonant and compound vowel visualization method based on neural network
CN106024010A (en) * 2016-05-19 2016-10-12 渤海大学 Speech signal dynamic characteristic extraction method based on formant curves
CN108447503A (en) * 2018-01-23 2018-08-24 浙江大学山东工业技术研究院 Motor abnormal sound detection method based on Hilbert-Huang transformation
CN109643554A (en) * 2018-11-28 2019-04-16 深圳市汇顶科技股份有限公司 Adaptive voice Enhancement Method and electronic equipment
CN109887510A (en) * 2019-03-25 2019-06-14 南京工业大学 Voiceprint recognition method and device based on empirical mode decomposition and MFCC
CN110772279A (en) * 2019-11-28 2020-02-11 北方民族大学 Lung sound signal acquisition device and analysis method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727905A (en) * 2009-11-27 2010-06-09 江南大学 Method for acquiring vocal print picture with refined time-frequency structure
CN102169690A (en) * 2011-04-08 2011-08-31 哈尔滨理工大学 Voice signal recognition system and method based on surface myoelectric signal
US20130163839A1 (en) * 2011-12-27 2013-06-27 Industrial Technology Research Institute Signal and image analysis method and ultrasound imaging system
CN105788608A (en) * 2016-03-03 2016-07-20 渤海大学 Chinese initial consonant and compound vowel visualization method based on neural network
CN106024010A (en) * 2016-05-19 2016-10-12 渤海大学 Speech signal dynamic characteristic extraction method based on formant curves
CN108447503A (en) * 2018-01-23 2018-08-24 浙江大学山东工业技术研究院 Motor abnormal sound detection method based on Hilbert-Huang transformation
CN109643554A (en) * 2018-11-28 2019-04-16 深圳市汇顶科技股份有限公司 Adaptive voice Enhancement Method and electronic equipment
CN109887510A (en) * 2019-03-25 2019-06-14 南京工业大学 Voiceprint recognition method and device based on empirical mode decomposition and MFCC
CN110772279A (en) * 2019-11-28 2020-02-11 北方民族大学 Lung sound signal acquisition device and analysis method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882901A (en) * 2022-04-26 2022-08-09 同济大学 Shrimp acoustic signal time-frequency feature extraction method based on frequency domain convolution and marginal spectrum feedback
CN115597901A (en) * 2022-12-13 2023-01-13 江苏中云筑智慧运维研究院有限公司(Cn) Method for monitoring damage of bridge expansion joint
CN115597901B (en) * 2022-12-13 2023-05-05 江苏中云筑智慧运维研究院有限公司 Bridge expansion joint damage monitoring method

Similar Documents

Publication Publication Date Title
CN112562691B (en) Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium
WO2018149077A1 (en) Voiceprint recognition method, device, storage medium, and background server
Hossan et al. A novel approach for MFCC feature extraction
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
CN109147798B (en) Speech recognition method, device, electronic equipment and readable storage medium
CN109036437A (en) Accents recognition method, apparatus, computer installation and computer readable storage medium
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
CN109065043B (en) Command word recognition method and computer storage medium
CN108847253B (en) Vehicle model identification method, device, computer equipment and storage medium
Afrillia et al. Performance measurement of mel frequency ceptral coefficient (MFCC) method in learning system Of Al-Qur’an based in Nagham pattern recognition
CN111899724A (en) Voice feature coefficient extraction method based on Hilbert-Huang transform and related equipment
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
CN112489692B (en) Voice endpoint detection method and device
Maazouzi et al. MFCC and similarity measurements for speaker identification systems
Abd El-Moneim et al. Hybrid speech enhancement with empirical mode decomposition and spectral subtraction for efficient speaker identification
CN113421584A (en) Audio noise reduction method and device, computer equipment and storage medium
CN110875037A (en) Voice data processing method and device and electronic equipment
Mini et al. Feature vector selection of fusion of MFCC and SMRT coefficients for SVM classifier based speech recognition system
CN115359800A (en) Engine model detection method and device, electronic equipment and storage medium
CN112614483B (en) Modeling method, voice recognition method and electronic equipment based on residual convolution network
CN108962249B (en) Voice matching method based on MFCC voice characteristics and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201106

RJ01 Rejection of invention patent application after publication