CN112687277B

CN112687277B - Method and device for determining voice formant, electronic equipment and readable storage medium

Info

Publication number: CN112687277B
Application number: CN202110273503.6A
Authority: CN
Inventors: 曹岩岗; 王秋明
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-06-18
Anticipated expiration: 2041-03-15
Also published as: CN112687277A

Abstract

The application provides a method and a device for determining a voice formant, electronic equipment and a readable storage medium, wherein a voice segment to be calculated in a voice to be identified is subjected to pre-emphasis processing according to a determined pre-emphasis mode; windowing the pre-emphasized voice segment to be calculated to obtain a windowed voice segment to be calculated; determining a target power spectrum sequence of the windowed speech segment to be calculated based on the solved linear prediction coefficient sequence and gain factor of the windowed speech segment to be calculated; and calculating to obtain the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence. Therefore, each section of voice segment to be calculated can be subjected to pre-emphasis processing according to the pre-emphasis mode of the voice to be identified, the polynomial linear equation is used for realizing the calculation of the attribute parameters of the voice formants, the calculation time of the attribute parameters of the formants in the voice can be greatly reduced, and the accuracy of the calculation result is improved.

Description

Method and device for determining voice formant, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method and an apparatus for determining a speech formant, an electronic device, and a readable storage medium.

Background

In the voiceprint identification technology, when the acquired voice is a long voice, it is generally desirable to lock the interested region by quickly browsing the approximate trend of the formants, and to enlarge and observe the interested region, so as to improve the speed of the voiceprint identification. The formants refer to some regions where energy is relatively concentrated in the frequency spectrum of sound, and are not only determining factors of sound quality, but also reflect physical characteristics of vocal tract (resonance cavity).

At present, a common formant calculation method is a band-pass filter bank method, in which a group of band-pass filters with different center frequencies are used to filter speech, peak values of spectra obtained by the filters are smoothly connected to obtain a spectrum envelope of the speech, and a frequency resolution characteristic close to that of human ears can be obtained through a non-uniform center frequency interval. The cepstrum method is characterized in that the convolution of the fundamental frequency and the formants obtains the frequency spectrum of the voice, the homomorphic deconvolution is carried out by using the cepstrum method to obtain the fundamental frequency and the formants, the result is more accurate, but the calculation amount is larger. The linear prediction root-solving method is characterized in that the linear prediction uses an all-pole model to model a sound channel, can deconvolute a speech signal, brings pulse excitation into a prediction residual error, can obtain the all-pole model of the sound channel by calculating a linear prediction coefficient, constructs a polynomial by using the linear prediction coefficient, and calculates the complex root of the polynomial by using a Newton-Laverson method to obtain accurate formant information, and has the defect of higher complexity of calculating the complex root by the method.

Disclosure of Invention

In view of this, an object of the present application is to provide a method, an apparatus, an electronic device and a readable storage medium for determining a speech formant, which can determine an emphasis mode of a speech segment to be calculated according to a window length and a window shift of an emphasis window when performing pre-emphasis processing on the speech segment to be calculated, perform pre-emphasis processing on the speech segment to be calculated according to the emphasis mode, and use polynomial linear equations with different orders to realize calculation of the speech formant, so as to reduce low-frequency components in the speech segment to be calculated, increase high-frequency components, and greatly reduce calculation time of the speech formant, so that a calculation result is balanced between precision and speed, and thus, the method, the apparatus, the electronic device and the readable storage medium are beneficial to improving accuracy of the calculation result.

The embodiment of the application provides a method for determining a voice formant, which comprises the following steps:

determining a pre-emphasis mode of the voice to be evaluated based on window shift and window length of an emphasis window for pre-emphasizing the voice to be evaluated;

for each section of the voice segment to be calculated in the voice to be identified, pre-emphasis processing is carried out on the voice segment to be calculated according to the pre-emphasis mode, and the pre-emphasized voice segment to be calculated is obtained;

windowing the pre-emphasized voice segment to be calculated to obtain a windowed voice segment to be calculated;

based on a preset linear prediction order, solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated;

determining a target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor;

and calculating to obtain the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence.

Further, a window shift of the emphasis window is determined by:

acquiring the number of pixels on the horizontal axis of a formant display window of the voice to be identified;

and calculating the window shift of the emphasis window based on the number of the horizontal axis pixels and the number of the voice sampling points in the voice to be identified.

Further, the determining the pre-emphasis mode of the speech to be evaluated based on the window shift and the window length of the emphasis window for pre-emphasizing the speech to be evaluated includes:

when the window length is less than the window length, determining that the emphasis mode for performing pre-emphasis processing on the voice to be identified is traversal emphasis;

and when the window shift is larger than or equal to the window length, determining that the emphasis mode of the pre-emphasis processing on the identified voice is interval emphasis.

Further, when the pre-emphasis mode includes interval emphasis, pre-emphasizing each to-be-calculated speech segment in the speech to be identified according to the pre-emphasis mode to obtain a pre-emphasized to-be-calculated speech segment, including:

determining whether the voice segment to be calculated is a non-silent voice segment;

and if so, pre-emphasizing the voice segment to be calculated according to the pre-emphasizing mode to obtain the pre-emphasized voice segment to be calculated.

Further, when the pre-emphasis mode includes traversal emphasis, pre-emphasis processing is performed on each to-be-calculated speech segment in the speech to be identified according to the pre-emphasis mode, so as to obtain a pre-emphasized to-be-calculated speech segment, including:

pre-emphasis processing is carried out on the voice segment to be calculated according to the emphasis mode, and the processed voice segment to be calculated is obtained;

determining whether the processed voice segment to be calculated is a non-silent voice segment;

and if so, determining the processed voice segment to be calculated as the pre-emphasized voice segment to be calculated.

Further, according to the emphasis mode, pre-emphasizing the to-be-calculated speech segment to obtain an emphasized to-be-calculated speech segment, including:

moving the emphasis window on the voice segment to be calculated according to the emphasis mode and the window movement, and determining a voice sample point to be emphasized from the calculated voice segment;

and emphasizing the voice sample points to be emphasized through the pre-emphasis coefficient of the voice to be identified to obtain the pre-emphasized voice fragment to be calculated.

Further, the windowing the pre-emphasized to-be-calculated voice segment to obtain a windowed to-be-calculated voice segment includes:

determining a frame sequence number corresponding to the pre-emphasized voice fragment to be calculated;

determining the sample point sequence number of each voice sample point in the pre-emphasized voice segment to be calculated based on the frame sequence number and the window length of the emphasis window;

aiming at each voice sampling point, determining a window function corresponding to the voice sampling point based on the sampling point serial number of the voice sampling point;

and windowing each corresponding voice sample point in the pre-emphasized voice segment to be calculated based on the window function corresponding to each voice sample point to obtain the windowed voice segment to be calculated.

Further, the determining the target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor includes:

converting the windowed voice segment to be calculated in the time domain to a frequency domain based on the linear prediction coefficient sequence;

calculating the initial power spectrum sequence of the windowed speech segment to be calculated in the frequency domain;

converting the gain factor to a logarithmic domain;

and correcting the initial power spectrum sequence based on the gain factor in the logarithmic domain to obtain a target power spectrum sequence.

Further, it is determined whether the speech segment to be calculated is a non-silent speech segment by the following steps:

calculating the sample point energy of each voice sample point in the pre-emphasized voice segment to be calculated;

determining the voice energy of the pre-emphasized voice segment to be calculated based on the sample point energy of each voice sample point;

when the voice energy is greater than or equal to a preset energy threshold value, determining that the pre-emphasized voice segment to be calculated is a non-mute voice segment;

and when the voice energy is smaller than a preset energy threshold value, determining that the pre-emphasized voice segment to be calculated is a mute voice segment.

Further, the calculating to obtain the attribute parameters of the formants of the speech segment to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence includes:

determining local maximum sampling points and a preset number of adjacent sampling points of which the difference value with the power spectrum of the local maximum sampling points is within a preset difference value range from the voice segment to be calculated according to the target power spectrum sequence, wherein the preset number is the polynomial order of the polynomial linear equation;

determining a polynomial coefficient of the constructed polynomial linear equation based on the amplitude and frequency of the local maximum sample point and each adjacent sample point;

and determining attribute parameters of the formants of the voice segments to be calculated based on the polynomial coefficients and the constructed polynomial linear equation, wherein the attribute parameters comprise formant intensity, formant frequency and formant-3 dB bandwidth.

Further, the speech segment to be calculated is segmented by the following steps:

acquiring a voice to be identified;

and dividing the voice to be authenticated into a plurality of voice fragments to be calculated according to the time sequence, allocating a frame sequence number to each voice fragment to be calculated according to the time sequence, and allocating a sample sequence number to each voice sample in the voice to be authenticated.

The embodiment of the present application further provides a device for determining a speech formant, where the device for determining a speech formant includes:

the mode determining module is used for determining the pre-emphasis mode of the voice to be identified based on the window shift and the window length of an emphasis window for pre-emphasizing the voice to be identified;

the emphasis module is used for carrying out pre-emphasis processing on each section of the voice segment to be calculated in the voice to be identified according to the pre-emphasis mode to obtain the pre-emphasized voice segment to be calculated;

the windowing module is used for windowing the pre-emphasized voice segment to be calculated to obtain a windowed voice segment to be calculated;

the first calculation module is used for solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated based on a preset linear prediction order;

a sequence determining module, configured to determine a target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor;

and the second calculation module is used for calculating and obtaining the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence.

Further, the manner determination module is configured to determine a window shift of the emphasis window by:

Further, when the emphasis module is configured to determine the pre-emphasis mode of the speech to be authenticated based on the window shift and the window length of the emphasis window for pre-emphasizing the speech to be authenticated, the emphasis module is configured to:

when the window shift is smaller than the window length, determining that the emphasis mode for performing pre-emphasis processing on the voice segment to be calculated is traversal emphasis;

and when the window length is greater than or equal to the window length, determining that the emphasis mode of the pre-emphasis processing on the voice segment to be calculated is interval emphasis.

Further, when the pre-emphasis mode includes interval emphasis, the emphasis module is configured to, for each to-be-calculated speech segment in the speech to be identified, pre-emphasize the to-be-calculated speech segment according to the pre-emphasis mode to obtain a pre-emphasized to-be-calculated speech segment, and the emphasis module is configured to:

Further, when the pre-emphasis mode includes traversal emphasis, the emphasis module is configured to, for each to-be-calculated speech segment in the speech to be identified, perform pre-emphasis processing on the to-be-calculated speech segment according to the pre-emphasis mode, and obtain a pre-emphasized to-be-calculated speech segment, the emphasis module is configured to:

Further, when the emphasis module is configured to perform pre-emphasis processing on the to-be-calculated speech segment according to the emphasis mode to obtain an emphasized to-be-calculated speech segment, the emphasis module is configured to:

Further, the windowing module is configured to, when the pre-emphasized to-be-calculated voice segment is a non-silent voice segment, perform windowing on the pre-emphasized to-be-calculated voice segment, and when the windowed to-be-calculated voice segment is obtained, the windowing module is configured to:

Further, when the sequence determining module is configured to determine the target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor, the sequence determining module is configured to:

converting the gain factor to a logarithmic domain;

Further, the windowing module is configured to determine whether the speech segment to be calculated is a non-silent speech segment by:

calculating the sample point energy of each voice sample point in the voice segment to be calculated;

determining the voice energy of the voice segment to be calculated based on the sample point energy of each voice sample point;

when the voice energy is greater than or equal to a preset energy threshold value, determining that the voice segment to be calculated is a non-mute voice segment;

and when the voice energy is smaller than a preset energy threshold value, determining that the voice segment to be calculated is a mute voice segment.

Further, when the second calculation module is configured to calculate the attribute parameter of the formant of the speech segment to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence, the second calculation module is configured to:

Further, the determining device further comprises a segmenting module, and the segmenting module is used for segmenting the speech segment to be calculated by the following steps:

acquiring a voice to be identified;

An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine readable instructions when executed by the processor performing the steps of the method of determining speech formants as described above.

Embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method for determining a speech formant as described above.

The method, the device, the electronic device and the readable storage medium for determining the voice formants provided by the embodiment of the application determine the pre-emphasis mode of the voice to be identified based on the window shift and the window length of the emphasis window for pre-emphasizing the voice to be identified; for each section of the voice segment to be calculated in the voice to be identified, pre-emphasis processing is carried out on the voice segment to be calculated according to the pre-emphasis mode to obtain the pre-emphasized voice segment to be calculated, so that the low-frequency component of the frequency spectrum in the voice segment to be calculated is reduced, and the high-frequency component is added; windowing the pre-emphasized voice segment to be calculated to obtain a windowed voice segment to be calculated; based on a preset linear prediction order, solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated; determining a target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor; and calculating to obtain the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence. Therefore, the emphasis mode of the voice fragment to be calculated can be determined according to the window length and the window shift of the emphasis window when the voice fragment to be calculated is subjected to pre-emphasis processing, the voice fragment to be calculated is subjected to pre-emphasis processing according to the emphasis mode, the calculation of the voice formant attribute parameters is realized by using polynomial linear equations with different orders, the low-frequency components in the voice fragment to be calculated can be reduced, the high-frequency components are increased, the calculation time of the voice formant attribute parameters is greatly reduced, the calculation result is balanced between the precision and the speed, and the accuracy of the calculation result is improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of a method for determining a speech formant according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of another method for determining a speech formant according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of formant attribute parameter calculation;

fig. 4 is a schematic structural diagram of a device for determining a speech formant according to an embodiment of the present disclosure;

fig. 5 is a second schematic structural diagram of a device for determining a speech formant according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

Research shows that, at present, a common formant calculation method is a band-pass filter bank method, a group of band-pass filters with different central frequencies are used for filtering voice, peak values of frequency spectrums obtained by the filters are smoothly connected to obtain a spectrum envelope of the voice, frequency resolution characteristics close to human ears can be obtained through non-uniform central frequency intervals, but high resolution requires a high filter order, so that the performance is poor, the calculation delay is high, and the time-varying characteristics of the voice are difficult to estimate.

Based on this, the embodiment of the application provides a method for determining a speech formant, which can greatly reduce the calculation time of the attribute parameters of the speech formant and is beneficial to improving the accuracy of the calculation result.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for determining a speech formant according to an embodiment of the present disclosure. As shown in fig. 1, a method for determining a speech formant provided by an embodiment of the present application includes:

s101, determining a pre-emphasis mode of the voice to be evaluated based on window shift and window length of an emphasis window for pre-emphasizing the voice to be evaluated;

in the step, the voice to be identified is obtained, and the window shift of the emphasis window and the window length of the emphasis window when the voice to be identified is subjected to pre-emphasis processing are obtained, and the emphasis mode for the pre-emphasis processing of the voice to be identified is determined according to the obtained window shift and window length.

And S102, aiming at each section of voice segment to be calculated in the voice to be identified, pre-emphasis processing is carried out on the voice segment to be calculated according to the pre-emphasis mode, and the pre-emphasized voice segment to be calculated is obtained.

In the step, aiming at each section of voice segment to be calculated in the voice to be identified, pre-emphasis processing is carried out on the section of voice segment to be calculated in a sliding emphasis window mode through a preset emphasis window according to the determined emphasis mode, so that the pre-emphasized voice segment to be calculated is obtained, further, low-frequency components in the section of voice segment to be calculated are reduced, high-frequency components are increased, and the situations of low-frequency energy concentration and insufficient high-frequency energy are avoided.

S103, windowing the pre-emphasized voice segment to be calculated to obtain a windowed voice segment to be calculated.

In the step, windowing processing is carried out on the pre-emphasized voice segment to be calculated so as to reduce frequency spectrum leakage and obtain the windowed voice segment to be calculated.

Here, a hamming window is selected to perform windowing processing on the emphasized speech segment to be calculated, where the length of the hamming window is the same as the length of the formant display window, and the lengths of both windows may be set according to the length of the speech segment to be calculated obtained after segmentation, which is not limited herein.

And S104, based on a preset linear prediction order, solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated.

In the step, according to the length of the voice segment to be calculated, a preset linear prediction order suitable for the voice segment to be calculated is obtained, a linear prediction coefficient corresponding to each voice sample point in the windowed voice segment to be calculated is obtained, and then a linear prediction coefficient sequence and a gain factor of the windowed voice segment to be calculated are obtained.

Here, linear prediction refers to using an all-pole model to model the vocal tract and estimate the spectral envelope of the speech, and the idea is: using devices appearing in the pastpA linear combination of speech samples to approximate a currently occurring speech sample can also be regarded as passing the speech through a Finite Impulse Response (FIR) prediction filter of order p, and the specific formula is as follows:

；

wherein n is the sample point sequence number of the voice sample point in the windowed voice segment to be calculated, and isx(n) P is the linear prediction order,a _kis a linear prediction coefficient sequence.

Here, withx(n) Difference of (2)e(n) Can be represented by the following formula:

；

wherein,e(n) Referred to as Residual (Residual) or Prediction Error (Prediction Error). FIR prediction filter passzThe transformation, its system function P (z) can be expressed as:

；

then it is determined that,e(n) Is/are as followszTransformation ofE(z) Can be expressed as:

；

here, a prediction error filter is definedA(z) Comprises the following steps:

；

thenE(z) Can be expressed as:

；

here, the system function of an All-pole Filter (All-pole Filter)H(z) Can be expressed as:

；

when calculatingE(z) AndH(z) Of reconstructed speech signalszTransformation ofY(z) Comprises the following steps:

；

wherein,E(z) AndH(z) The characteristics of the sound source and the sound channel are described separately,Gis a linear prediction gain factor, and the speech characteristics are obtained through multiplication.

Calculating each element in a linear prediction coefficient sequencea _kDefining the prediction error energyE _p：

；

Wherein E { }Indicating the desire to obtainE _pMinimum value of (1), can orderE _pTo paira _kIs 0, i.e.:

；

wherein,i=1，…，pit can be deduced that:

；

according to the definition of the autocorrelation sequence, the above equation can be written as:

；

linear prediction coefficient sequence in this equationa _k(wherein,k=1,…，p) The autocorrelation sequence in the above formula can be solved by Lavinson-Dubin Recursion (Levinson-Durbin Recursion)r _xxIs defined as:

；

wherein,Nis a sequenceu(n) Length of (d).

Gain factorGEach element in (a) can be calculated by the following formula:

；

thus, the linear prediction coefficient sequence and the gain factor of the windowed speech segment to be calculated are obtained.

And S105, determining the target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor.

In the step, a target power spectrum sequence of the windowed speech segment to be calculated is calculated and obtained based on the linear prediction coefficient sequence and the gain factor obtained by calculation, wherein a sample power spectrum of each speech sample in the windowed speech segment to be calculated is integrated in the target power spectrum sequence.

And S106, calculating to obtain the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence.

In the step, a polynomial linear equation corresponding to the constructed voice segment to be calculated is obtained, and attribute parameters of a formant in the voice segment to be calculated are obtained through calculation based on the target power spectrum sequence and the polynomial linear equation obtained through calculation, wherein the attribute parameters comprise formant intensity, formant frequency and formant-3 dB bandwidth.

The method for determining the voice formant provided by the embodiment of the application determines the pre-emphasis mode of the voice to be identified based on the window shift and the window length of the emphasis window for pre-emphasizing the voice to be identified; for each section of the voice segment to be calculated in the voice to be identified, pre-emphasis processing is carried out on the voice segment to be calculated according to the pre-emphasis mode to obtain the pre-emphasized voice segment to be calculated, so that the low-frequency component of the frequency spectrum in the voice segment to be calculated is reduced, and the high-frequency component is added; windowing the pre-emphasized voice segment to be calculated to obtain a windowed voice segment to be calculated; based on a preset linear prediction order, solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated; determining a target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor; calculating to obtain attribute parameters of formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence; therefore, the emphasis mode of the voice fragment to be calculated can be determined according to the window length and the window shift of the emphasis window when the voice fragment to be calculated is subjected to pre-emphasis processing, the voice fragment to be calculated is subjected to pre-emphasis processing according to the emphasis mode, the calculation of the voice formant attribute parameters is realized by using polynomial linear equations with different orders, the low-frequency components in the voice fragment to be calculated can be reduced, the high-frequency components are increased, the calculation time of the voice formant attribute parameters is greatly reduced, the calculation result is balanced between the precision and the speed, and the accuracy of the calculation result is improved.

Referring to fig. 2, fig. 2 is a flowchart illustrating another method for determining a speech formant according to an embodiment of the present disclosure. As shown in fig. 2, a method for determining a speech formant provided by an embodiment of the present application includes:

s201, pre-emphasis coefficients of the voice segments to be calculated and the number of horizontal axis pixels of the formant display window are obtained.

In the step, a pre-emphasis coefficient of the preset voice segment to be calculated and the number of horizontal axis pixels of a formant display window for displaying formants are obtained.

S202, calculating to obtain the window shift of the emphasis window based on the number of the horizontal axis pixels and the number of the voice sampling points in the voice to be identified.

The window shift determines the distance each formant display window moves, and is typically fixed, e.g., 0.25 (the window length of the formant display window) or 0.5 (the window length of the formant display window). For larger speech files, the number of frames calculated using a fixed window shift and the window length of the formant display window tends to be large, which, in turn, causes the time consumption of the formant algorithm to increase linearly with the number of frames.

In addition, in voiceprint identification, formants calculated by a large voice file cannot be displayed completely when being displayed, assuming that the number of pixels of a screen on the horizontal axis is 1920 and a formant window is full of the whole screen, the formants of 1920 frames are displayed at most at one time, if the number of frames of the formants calculated at one time is more than 1920, only 1920 frames of display can be extracted from the formants, and the waste of calculation power is caused.

In view of this, in order to make the window shift more suitable for the speech segment to be calculated, the window shift of the emphasis window is calculated based on the number of horizontal-axis pixels of the formant display window and the number of speech samples in the speech segment to be calculated.

Specifically, the window shift of the to-be-calculated speech segment is calculated by the following formula:

；

wherein,hin order to move the window, the window is moved,nfor the number of speech samples in the speech segment to be calculated,uthe number of pixels on the horizontal axis of the window is shown for the formants.

Here, during the movement of the emphasis window, whenh>If =1, it means that the emphasis window is shifted by at least 1 speech sample at a time, then the samples in the adjacent 2 emphasis windows do not completely coincide, and the formants will not be calculated repeatedly, and at this time, the window shift is set tohThe number of frames for formants is set asu(ii) a When in useh<When the number of the voice sampling points in the voice segment to be calculated is less than one frame, the number of the voice sampling points in the voice segment to be calculated is filled with zero to one frame, the window shift is set to be 0, and the number of the calculation frames of the formants is set to be 1; when 0 is present<=h<1, which indicates that the speech samples in the emphasis window after moving and the emphasis window before moving may completely coincide to affect the observation, in this case, the window shift is set to 1, and the number of frames for formant calculation is set to 1n-l+1。

S203, determining the pre-emphasis mode of the voice to be identified based on the window shift and the window length of the emphasis window for pre-emphasizing the voice to be identified.

And S204, aiming at each section of the voice segment to be calculated in the voice to be identified, pre-emphasis processing is carried out on the voice segment to be calculated according to the pre-emphasis mode, and the pre-emphasized voice segment to be calculated is obtained.

S205, windowing the pre-emphasized voice segment to be calculated to obtain a windowed voice segment to be calculated.

And S206, based on a preset linear prediction order, solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated.

S207, determining a target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor.

And S208, calculating to obtain the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence.

The descriptions of S203 to S208 may refer to the descriptions of S101 to S106, and the same technical effects can be achieved, which are not described in detail.

Further, the determining the pre-emphasis mode of the speech to be evaluated based on the window shift and the window length of the emphasis window for pre-emphasizing the speech to be evaluated includes: when the window shift is smaller than the window length, determining that the emphasis mode for performing pre-emphasis processing on the voice segment to be calculated is traversal emphasis; and when the window length is greater than or equal to the window length, determining that the emphasis mode of the pre-emphasis processing on the voice segment to be calculated is interval emphasis.

In this step, when the window of the emphasis window is movedhWindow length less than weighted windowlWhen is h<lWhen the method is used, all the voice sample points in the voice segment to be calculated can be traversed in the process of calculating the formants, namely the whole voice segment to be calculated is subjected to emphasis processing, and at the moment, the emphasis mode of performing pre-emphasis processing on the voice segment to be calculated is determined as traversal emphasis; when inh>=lDuring the calculation of the formants, it is stated that all speech samples in the speech segment to be calculated are not traversed, that is, the speech segment to be calculated is selectively emphasized, and at this time, the emphasis mode for performing the pre-emphasis processing on the speech segment to be calculated is determined as interval emphasis.

Further, when the pre-emphasis mode includes interval emphasis, step S204 includes: determining whether the voice segment to be calculated is a non-silent voice segment; and if so, pre-emphasizing the voice segment to be calculated according to the pre-emphasizing mode to obtain the pre-emphasized voice segment to be calculated.

In the step, when the determined pre-emphasis mode comprises interval emphasis, firstly, determining whether a voice segment to be calculated is a non-silent voice segment; and when the voice segment to be calculated is the non-mute voice segment, pre-emphasizing the voice segment to be calculated in a sliding emphasis window mode according to the determined pre-emphasizing mode to obtain the pre-emphasized voice segment to be calculated.

Further, when the pre-emphasis manner includes a traversal emphasis, step S204 includes: pre-emphasis processing is carried out on the voice segment to be calculated according to the emphasis mode, and the processed voice segment to be calculated is obtained; determining whether the processed voice segment to be calculated is a non-silent voice segment; and if so, determining the processed voice segment to be calculated as the pre-emphasized voice segment to be calculated.

In the step, when the determined pre-emphasis mode comprises traversal emphasis, pre-emphasis processing is carried out on the section of the voice segment to be calculated in a sliding emphasis window mode according to the determined pre-emphasis mode, and the processed voice segment to be calculated is obtained; and judging whether the processed voice segment to be calculated is a non-mute voice segment or not, and if so, determining the processed voice segment to be calculated as a pre-emphasized voice segment to be calculated.

Further, step S204 includes: moving the emphasis window on the voice segment to be calculated according to the emphasis mode and the window movement, and determining a voice sample point to be emphasized from the calculated voice segment; and emphasizing the voice sample points to be emphasized through the pre-emphasis coefficient of the voice to be identified to obtain the emphasized voice fragments to be calculated.

In the step, according to the window shift of the weighted window obtained by calculation, the weighted window is moved on the voice segment to be calculated in the determined weighted mode, and the voice sample point of the voice segment to be calculated, which is positioned in the weighted window after each movement, is determined as the voice sample point to be weighted in the moving process.

Pre-emphasizing all determined voice sample points to be emphasized through a pre-acquired pre-emphasis coefficient of the voice to be identified to obtain a pre-emphasized voice segment to be calculated, wherein the pre-emphasis coefficient is preset according to the voice sample points to be emphasized.

Further, step S205 includes: determining a frame sequence number corresponding to the pre-emphasized voice fragment to be calculated; determining the sample point sequence number of each voice sample point in the pre-emphasized voice segment to be calculated based on the frame sequence number and the window length of the emphasis window; aiming at each voice sampling point, determining a window function corresponding to the voice sampling point based on the sampling point serial number of the voice sampling point; and windowing each corresponding voice sample point in the pre-emphasized voice segment to be calculated based on the window function corresponding to each voice sample point to obtain the windowed voice segment to be calculated.

In the step, when the voice to be identified is segmented, a unique frame serial number is distributed to each segmented voice fragment to be calculated, and a unique sampling point serial number is distributed to the sampling point serial number of each voice sampling point in each segment of voice fragment to be calculated; when the pre-emphasized voice segment to be calculated needs to be windowed, determining a frame sequence number corresponding to the pre-emphasized voice segment to be calculated; determining the sample point sequence number of each voice sample point in the pre-emphasized voice segment to be calculated based on the frame sequence number of the pre-emphasized voice segment to be calculated and the window length of an emphasis window; aiming at each voice sampling point, determining a window function corresponding to the voice sampling point based on the sampling point serial number of the voice sampling point; and windowing the voice sampling points by using the window function corresponding to each voice sampling point to obtain the windowed voice fragments to be calculated.

Here, each speech sample is identical to the corresponding window function in sequence number, for example, when the speech sample has a sequence number of 1, the window function has a sequence number of 1, and the formula of the window function is as follows:

；

wherein,lin order to emphasize the window length of the window,vis the order number of the window function.

Further, step S207 includes: converting the windowed voice segment to be calculated in the time domain to a frequency domain based on the linear prediction coefficient sequence; calculating the initial power spectrum sequence of the windowed speech segment to be calculated in the frequency domain; converting the gain factor to a logarithmic domain; and correcting the initial power spectrum sequence based on the gain factor in the logarithmic domain to obtain a target power spectrum sequence.

In this step, zero is filled to the number of Fourier transform points for the number of coefficients in the linear prediction coefficient sequence obtained by calculation, Fast Fourier Transform (FFT) is performed on the windowed speech segment to be calculated, and the windowed speech segment to be calculated is converted from a time domain signal to a frequency domain signal, with the following formula:

；

wherein,X(k) Is as followskA plurality of frequency scales corresponding to the plurality of frequency scales,kis as followstThe frequency scale corresponding to each voice sample point,tis the serial number of the voice sample point,Nis the number of points in the fourier transform,jis an imaginary unit.

And after converting the windowed voice segment to be calculated to a frequency domain, calculating the power spectrum of each voice sample point in the windowed voice segment to be calculated in the frequency domain, and further obtaining an initial power spectrum sequence of the voice segment to be calculated.

After FFT, each speech sample point corresponds to a frequency scale in the frequency domain, and each frequency readable corresponds to a complex number through calculationX(k) Since the complex sequence obtained after FFT is conjugate symmetric, its power spectrum is related to the secondNThe 2+1 frequency scales are symmetrical, so only the front one is takenNThe power spectrum on the 2+1 frequency scales is used as the power spectrum in the initial power spectrum sequence of the voice segment to be calculated, and the calculation formula of each power spectrum in the initial power spectrum sequence is as follows:

；

wherein，E(k) Is as followskThe power spectrum of the individual frequency scales,X _R(k) Is composed ofX(k) The real part of (a) is,X _I(k) Is composed ofX(k) The imaginary part of (a) is,epsa minimum value is indicated.

Converting the calculated gain factor into a logarithmic domain by the following formula:

；

wherein,

is a gain factor in the log domain,Gin order to be a gain factor, the gain factor,lto emphasize the window length of the window.

And then, correcting the initial power spectrum sequence based on the gain factor in the logarithmic domain to obtain a target power spectrum sequence of the windowed speech segment to be calculated, wherein a specific correction formula is as follows:

；

wherein,

in order to be a sequence of the target power spectrum,E _kin the form of an initial sequence of power spectra,

is a gain factor in the logarithmic domain.

Further, it is determined whether the speech segment to be calculated is a non-silent speech segment by the following steps: calculating the sample point energy of each voice sample point in the voice segment to be calculated; determining the voice energy of the voice segment to be calculated based on the sample point energy of each voice sample point; when the voice energy is greater than or equal to a preset energy threshold value, determining that the voice segment to be calculated is a non-mute voice segment; and when the voice energy is smaller than a preset energy threshold value, determining that the pre-emphasized voice segment to be calculated is a mute voice segment.

In the step, the sample point energy of each voice sample point in the voice segment to be calculated is calculated, and the voice energy of the voice segment to be calculated is obtained through summation based on the sample point energy of each voice sample point.

Determining whether the voice energy of the voice segment to be calculated is greater than or equal to a preset energy threshold value or not, and determining that the voice segment to be calculated is a non-silent voice segment when the voice energy is greater than or equal to the preset energy threshold value; and when the voice energy is less than the preset energy threshold value, determining the voice segment to be calculated as a mute voice segment.

Specifically, the speech energy is calculated by the following formula:

；

wherein,Ein order to be the energy of the voice,x(i) Is as followsiThe amplitude of each of the speech samples,lthe number of speech samples in the speech segment to be calculated.

The amplitude of the ith voice sample point is calculated by the following formula:

；

wherein,x(i) Is as followsiThe amplitude of each of the speech samples,x(i-1) isi-the amplitude of 1 speech sample,

is the amplitude of the original speech samples,empis a pre-emphasis coefficient.

Similarly, when determining whether the processed to-be-calculated voice segment is a non-mute voice segment, calculating the sample point energy of each voice sample point in the processed to-be-calculated voice segment, and summing to obtain the voice energy of the processed to-be-calculated voice segment based on the sample point energy of each voice sample point.

Determining whether the voice energy of the processed voice segment to be calculated is greater than or equal to a preset energy threshold value or not, and determining that the processed voice segment to be calculated is a non-mute voice segment when the voice energy is greater than or equal to the preset energy threshold value; and when the voice energy is less than the preset energy threshold value, determining that the processed voice segment to be calculated is a mute voice segment.

Further, step S208 includes: determining local maximum sampling points and a preset number of adjacent sampling points of which the difference value with the power spectrum of the local maximum sampling points is within a preset difference value range from the voice segment to be calculated according to the target power spectrum sequence, wherein the preset number is the polynomial order of the polynomial linear equation; determining a polynomial coefficient of the constructed polynomial linear equation based on the amplitude and frequency of the local maximum sample point and each adjacent sample point; and determining attribute parameters of the formants of the voice segments to be calculated based on the polynomial coefficients and the constructed polynomial linear equation, wherein the attribute parameters comprise formant intensity, formant frequency and formant-3 dB bandwidth.

In the step, according to the target power spectrum obtained by calculation, a preset number of adjacent sample points, in which the difference between the local maximum sample point and the power spectrum of the local maximum sample point is within a preset difference range, are determined from the voice segment to be calculated, that is, a preset number of adjacent sample points are near the local maximum sample point, where the preset number is consistent with the polynomial order of the polynomial linear equation.

Obtaining the amplitude and frequency of local maximum samples and of each neighboring sample, i.e. (,)f ₀，e ₀)，(f ₁，e ₁)，…，(f _q，e _q) Which brings in the amplitude and frequency of local maximum samples and of neighboring samples, respectivelyqFormed by a polynomial of orderqSolving the equation system of primary order to obtain the polynomial coefficient of the constructed polynomial linear equationc ₀，…，c _q：

；

Furthermore, the polynomial coefficient is solved by solving the above equation set, and the polynomial coefficient is obtainedqMaximum value in frequency interval where +1 interpolation point is locatedV _maxFrequency scale corresponding to maximum value as intensity of resonance peakf _maxFor the frequency of the formant, the-3 dB bandwidth of the formant can be determined by solving the following equationf _maxThe absolute value of the difference between the left and right roots as a center

Obtaining:

；

here, polynomial orderqIt should not be too large, otherwise it will increase the computational complexity.

Further, the speech segment to be calculated is segmented by the following steps: acquiring a voice to be identified; and dividing the voice to be authenticated into a plurality of voice fragments to be calculated according to the time sequence, allocating a frame sequence number to each voice fragment to be calculated according to the time sequence, and allocating a sample sequence number to each voice sample in the voice to be authenticated.

In this step, a to-be-identified voice to be identified is acquired, the to-be-identified voice is divided into a plurality of sections of to-be-calculated voice fragments according to a time sequence, and a frame number is allocated to each section of to-be-calculated voice fragment according to the time sequence, for example, the frame number of a first section of to-be-calculated voice fragment is "1", the frame number of a second section of to-be-calculated voice fragment is "2", and the time of the first section of to-be-calculated voice fragment is before the time of the second section of to-be-calculated voice fragment.

Meanwhile, a sample point sequence number is allocated to each voice sample point in the voice to be authenticated, for example, a first voice segment to be calculated and a second voice segment to be calculated respectively include three voice sample points, then according to the time sequence, the sample point sequence number of a first voice sample point in the first voice segment to be calculated is "1", the sample point sequence number of a second voice sample point in the first voice segment to be calculated is "2", and the sample point sequence number of a third voice sample point in the first voice segment to be calculated is "3"; according to the time sequence, the sample sequence number of the first voice sample in the second voice segment to be calculated is "4", the sample sequence number of the second voice sample in the second voice segment to be calculated is "5", the sample sequence number of the third voice sample in the second voice segment to be calculated is "6", and so on.

Exemplarily, as shown in fig. 3, fig. 3 is a schematic flow chart of formant attribute parameter calculation. Step 301: determining the window shift of the emphasis window according to the number of voice sampling points in the voice to be identified, the window length of the emphasis window and the number of pixels on the horizontal axis of the formant display window; step 302: judging whether the window shift is smaller than 1, if so, executing step 303, and if not, executing step 305; step 303: judging whether the window shift is less than 0, if so, executing a step 304, and if not, executing a step 306; step 304: setting the window shift as 0, and setting the number of calculation frames of the formants as 1; step 305: determining the sample point sequence number of each voice sample point in each section of voice segment to be calculated of the voice to be identified according to the window length and the window shift of the weighted window; step 306: setting the window shift as 1 and the number of formant calculation frames as n-l+ 1; step 307: determining the emphasis mode of the voice to be identified according to the window shift and the window length of the emphasis window, when the emphasis mode is traversal emphasis, pre-emphasizing each voice segment to be calculated in the voice to be identified, otherwise, directly executing the step 308; step 308: determining whether the current frame to-be-calculated voice segment is a non-mute voice segment, if so, executing step 309, and if not, executing step 317; step 309: determining the emphasis mode of the speech to be identified as interval emphasis, and pre-emphasizing the current frame speech segment to be calculated; step 310: windowing the pre-emphasized voice segment to be calculated; step 311: solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated; step 312: carrying out Fourier transform on the windowed speech segment to be calculated; step by stepStep 313: calculating an initial power spectrum sequence of the voice segment to be calculated after Fourier transformation; step 314: correcting the initial power spectrum sequence to obtain a target power spectrum sequence; step 315: determining polynomial coefficients of a polynomial linear equation; step 316: calculating attribute parameters of formants of a voice segment to be calculated; step 317: determining the number of formants in a voice segment to be calculated to be 0; step 318: determining whether the voice segment to be calculated is the last frame of the voice to be identified, if so, executing step 319, and if not, executing step 320; step 319: finishing the calculation; step 320: and acquiring a voice segment to be calculated of the next frame of the current voice segment to be calculated according to the frame sequence number, and repeatedly executing the steps 301 to 317.

The method for determining the voice formant provided by the embodiment of the application obtains the pre-emphasis coefficient of the voice segment to be calculated and the number of pixels on the horizontal axis of a formant display window; calculating to obtain the window shift of the weighted window based on the number of the horizontal axis pixels and the number of the voice sampling points in the voice segment to be calculated; determining a pre-emphasis mode of the voice to be evaluated based on window shift and window length of an emphasis window for pre-emphasizing the voice to be evaluated; for each section of the voice segment to be calculated in the voice to be identified, pre-emphasis processing is carried out on the voice segment to be calculated according to the pre-emphasis mode, and the pre-emphasized voice segment to be calculated is obtained; windowing the pre-emphasized voice segment to be calculated to obtain a windowed voice segment to be calculated; based on a preset linear prediction order, solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated; determining a target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor; and calculating to obtain the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence. Therefore, the emphasis mode of the voice fragment to be calculated can be determined according to the window length and the window shift of the emphasis window when the voice fragment to be calculated is subjected to pre-emphasis processing, the voice fragment to be calculated is subjected to pre-emphasis processing according to the emphasis mode, the calculation of the voice formant attribute parameters is realized by using polynomial linear equations with different orders, the low-frequency components in the voice fragment to be calculated can be reduced, the high-frequency components are increased, the calculation time of the voice formant attribute parameters is greatly reduced, the calculation result is balanced between the precision and the speed, and the accuracy of the calculation result is improved.

Referring to fig. 4 and 5, fig. 4 is a schematic structural diagram of a device for determining a speech formant according to an embodiment of the present application, and fig. 5 is a second schematic structural diagram of the device for determining a speech formant according to the embodiment of the present application. As shown in fig. 4, the determining means 400 includes:

a mode determining module 410, configured to determine a pre-emphasis mode of the speech to be authenticated based on a window shift and a window length of an emphasis window for pre-emphasizing the speech to be authenticated;

the emphasis module 420 is configured to perform pre-emphasis processing on each to-be-calculated voice segment in the to-be-identified voice according to the pre-emphasis mode, so as to obtain a pre-emphasized to-be-calculated voice segment;

a windowing module 430, configured to perform windowing on the pre-emphasized to-be-calculated voice segment to obtain a windowed to-be-calculated voice segment;

a first calculating module 440, configured to, based on a preset linear prediction order, obtain a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated;

a sequence determining module 450, configured to determine a target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor;

and a second calculating module 460, configured to calculate an attribute parameter of a formant of the to-be-calculated speech segment based on the constructed polynomial linear equation and the target power spectrum sequence.

Further, as shown in fig. 5, the determining apparatus 400 further includes a segmenting module 470, where the segmenting module 470 is configured to segment the speech segment to be calculated by:

acquiring a voice to be identified;

Further, the mode determination module 410 is configured to determine the window shift of the emphasis window by:

Further, when the emphasis module 420 is configured to determine the pre-emphasis mode of the speech to be evaluated based on the window shift and the window length of the emphasis window for pre-emphasizing the speech to be evaluated, the emphasis module 420 is configured to:

Further, when the pre-emphasis mode includes interval emphasis, the emphasis module 420 is configured to, for each to-be-calculated speech segment in the speech to be identified, pre-emphasize the to-be-calculated speech segment in the pre-emphasis mode to obtain a pre-emphasized to-be-calculated speech segment, and the emphasis module 420 is configured to:

Further, when the pre-emphasis mode includes traversal emphasis, the emphasis module 420 is configured to, when the pre-emphasis module is configured to pre-emphasize each to-be-calculated speech segment in the to-be-identified speech according to the pre-emphasis mode to obtain a pre-emphasized to-be-calculated speech segment, the emphasis module 420 is configured to:

Further, when the emphasis module 420 is configured to perform pre-emphasis processing on the speech segment to be calculated according to the emphasis mode, so as to obtain an emphasized speech segment to be calculated, the emphasis module 420 is configured to:

Further, when the windowing module 430 is configured to perform windowing on the emphasized to-be-calculated voice segment when the emphasized to-be-calculated voice segment is a non-mute voice segment, and obtain a windowed to-be-calculated voice segment, the windowing module 430 is configured to:

Further, when the sequence determining module 450 is configured to determine the target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor, the sequence determining module 450 is configured to:

converting the gain factor to a logarithmic domain;

Further, the windowing module 430 is configured to determine whether the speech segment to be calculated is an un-muted speech segment by:

Further, when the second calculating module 460 is configured to calculate the attribute parameters of the formants of the speech segment to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence, the second calculating module 460 is configured to:

The device for determining the voice formants provided by the embodiment of the application determines the pre-emphasis mode of the voice to be identified based on the window shift and the window length of the emphasis window for pre-emphasizing the voice to be identified; for each section of the voice segment to be calculated in the voice to be identified, pre-emphasis processing is carried out on the voice segment to be calculated according to the pre-emphasis mode to obtain the pre-emphasized voice segment to be calculated, so that the low-frequency component of the frequency spectrum in the voice segment to be calculated is reduced, and the high-frequency component is added; when the weighted to-be-calculated voice segment is determined to be a non-mute voice segment, windowing the pre-weighted to-be-calculated voice segment to obtain a windowed to-be-calculated voice segment; based on a preset linear prediction order, solving a linear prediction coefficient sequence and a gain factor of the windowed speech segment to be calculated; determining a target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor; and calculating to obtain the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence. Therefore, the emphasis mode of the voice fragment to be calculated can be determined according to the window length and the window shift of the emphasis window when the voice fragment to be calculated is subjected to pre-emphasis processing, the voice fragment to be calculated is subjected to pre-emphasis processing according to the emphasis mode, the calculation of the voice formant attribute parameters is realized by using polynomial linear equations with different orders, the low-frequency components in the voice fragment to be calculated can be reduced, the high-frequency components are increased, the calculation time of the voice formant attribute parameters is greatly reduced, the calculation result is balanced between the precision and the speed, and the accuracy of the calculation result is improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 6, the electronic device 600 includes a processor 610, a memory 620, and a bus 630.

The memory 620 stores machine-readable instructions executable by the processor 610, when the electronic device 600 runs, the processor 610 communicates with the memory 620 through the bus 630, and when the machine-readable instructions are executed by the processor 610, the steps of the method for determining a speech formant in the method embodiments shown in fig. 1 and fig. 2 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for determining a speech formant in the method embodiments shown in fig. 1 and fig. 2 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for determining a speech formant, the method comprising:

calculating to obtain attribute parameters of formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence;

wherein, the determining the pre-emphasis mode of the voice to be identified based on the window shift and the window length of the emphasis window for pre-emphasizing the voice to be identified comprises:

when the window shift is greater than or equal to the window length, determining that the emphasis mode of the pre-emphasis processing on the identified voice is interval emphasis;

the calculating to obtain the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence comprises the following steps:

2. The determination method according to claim 1, characterized in that the window shift of the emphasis window is determined by:

3. The method for determining as claimed in claim 1, wherein when the pre-emphasis mode includes interval emphasis, pre-emphasizing each to-be-calculated speech segment in the to-be-identified speech according to the pre-emphasis mode to obtain a pre-emphasized to-be-calculated speech segment, comprising:

4. The method for determining according to claim 1, wherein when the pre-emphasis mode includes traversal emphasis, pre-emphasizing each to-be-calculated speech segment in the to-be-identified speech according to the pre-emphasis mode to obtain a pre-emphasized to-be-calculated speech segment, including:

5. The method for determining as claimed in claim 1, wherein the pre-emphasizing the to-be-calculated speech segment according to the emphasizing manner to obtain an emphasized to-be-calculated speech segment includes:

6. The method of claim 1, wherein the windowing the pre-emphasized speech segment to be computed to obtain a windowed speech segment to be computed comprises:

7. The method according to claim 1, wherein the determining the target power spectrum sequence of the windowed speech segment to be calculated based on the linear prediction coefficient sequence and the gain factor comprises:

converting the gain factor to a logarithmic domain;

8. The determination method according to claim 3, wherein it is determined whether the speech segment to be calculated is a non-silent speech segment by:

9. An apparatus for determining a speech formant, the apparatus comprising:

the second calculation module is used for calculating and obtaining the attribute parameters of the formants of the voice segments to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence;

wherein, when the emphasis module is used for determining the pre-emphasis mode of the voice to be authenticated based on the window shift and the window length of the emphasis window for pre-emphasizing the voice to be authenticated, the emphasis module is used for:

when the window length is greater than or equal to the window length, determining that the emphasis mode of the pre-emphasis processing on the voice segment to be calculated is interval emphasis;

when the second calculation module is configured to calculate and obtain the attribute parameters of the formants of the speech segment to be calculated based on the constructed polynomial linear equation and the target power spectrum sequence, the second calculation module is configured to:

10. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of determining speech formants according to any one of claims 1 to 8.

11. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for determining a speech formant according to any one of claims 1 to 8.