Nothing Special   »   [go: up one dir, main page]

GB2536727B - A speech processing device - Google Patents

A speech processing device Download PDF

Info

Publication number
GB2536727B
GB2536727B GB1505361.4A GB201505361A GB2536727B GB 2536727 B GB2536727 B GB 2536727B GB 201505361 A GB201505361 A GB 201505361A GB 2536727 B GB2536727 B GB 2536727B
Authority
GB
United Kingdom
Prior art keywords
speech
noise
output
signal
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
GB1505361.4A
Other versions
GB2536727A (en
GB201505361D0 (en
Inventor
Griffin Anthony
Stylianou Ioannis
Zorila Tudor-Catalin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1505361.4A priority Critical patent/GB2536727B/en
Publication of GB201505361D0 publication Critical patent/GB201505361D0/en
Publication of GB2536727A publication Critical patent/GB2536727A/en
Application granted granted Critical
Publication of GB2536727B publication Critical patent/GB2536727B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Description

A speech processing device
FIELD
Embodiments described herein relate generally to speech processing and, more particularly, increasing the intelligibility of noisy speech.
BACKGROUND
Hearing and understanding speech is an extremely important part of a person’s ability to communicate with others. As background noise increases it gets harder to make out the content of the speech of interest. This is true for a person with normal hearing, but even more so for a hearing impaired person. In order to combat this, many signal processing algorithms have been developed to increase the intelligibility of speech in noise.
Most speech intelligibility enhancement algorithms are designed to use clean speech as an input and their performance may suffer once the input speech signal-to-noise ratio decreases, a common case in face-to-face communication environments such as restaurants or cafes.
BRIEF DESCRIPTION OF THE DRAWINGS
Systems, devices and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which:
Figure 1 is a schematic of an SSDRC system;
Figure 2 is a further schematic showing an SSDRC system with a spectral shaping filter and a dynamic range compression stage;
Figure 3 is a schematic showing the spectral shaping filter and a dynamic range compression stage of figure 2;
Figure 4 is a schematic of the spectral shaping filter in more detail;
Figure 5 is a schematic showing the dynamic range compression stage in more detail;
Figure 6 is a plot of an input-output envelope characteristic curve;
Figure 7(a) is a plot of a speech signal and figure 7(b) is a plot of the output from the dynamic range compression stage;
Figure 8 is a plot of an input-output envelope characteristic curve adapted in accordance with an environmental signal to noise ratio; and
Figure 9 is a schematic of a system in accordance with a further SSDRC system with multiple outputs.
Figure 10 is a schematic of a noise reduction system for use with an embodiment;
Figure 11 is a further schematic of a noise reduction system for use with an embodiment;
Figure 12 is a schematic of an embodiment;
Figure 13 is a schematic of an embodiment, with a noisy output environment;
Figure 14 is a graph illustrating the speech intelligibility performance of a SSDRC system with a range of input signal to noise (SNR) ratios;
Figure 15a, 15b and 15c are graphs illustrating the speech intelligibility performance of an embodiment with a range of input SNR values; and
Figure 16 is a further graph illustrating the speech intelligibility performance of an embodiment with a range of input SNR values.
DETAILED DESCRIPTION
In an embodiment, a noise reduction method is used in combination with spectral shaping (SS) and dynamic range compression (DRC). Spectral shaping and dynamic range compression are referred to as SSDRC when combined. When a noise reduction method is combined with SSDRC, the system is referred to herein as noise-tolerant SSDRC (ntSSDRC). SSDRC will be discussed first, before embodiments according to the present disclosure are discussed.
In an embodiment of a SSDRC component for use with the present disclosure, a speech intelligibility enhancing system may be provided for enhancing speech to be outputted in a noisy environment, the system comprising: a speech input for receiving speech to be enhanced; an environmental noise input for receiving real-time information concerning the noisy environment; an enhanced speech output to output said enhanced speech; and a processor configured to convert speech received from said speech input to enhanced speech to be output by said enhanced speech output, the processor being configured to: apply a spectral shaping filter to the speech received via said speech input; apply dynamic range compression to the output of said spectral shaping filter; and measure the environmental signal to noise ratio at the environmental noise input, wherein the spectral shaping filter comprises an output environment control parameter and the dynamic range compression comprises an output environment control parameter and wherein at least one of the output environment control parameters for the dynamic range compression or the spectral shaping is updated in real time according to the measured environmental signal to noise ratio.
In an alternative embodiment of SSDRC a speech intelligibility enhancing system for enhancing speech to be outputted in a noisy output environment may be provided, the system comprising: a speech input for receiving speech to be enhanced; a speech output for outputting said enhanced speech; and a processor configured to convert speech received via said speech input to enhanced speech to be output by said speech output, the processor being configured to: apply a spectral shaping filter to the speech received via said speech input; and apply dynamic range compression to the output of said spectral shaping filter.
In systems in accordance with the above, but not necessarily all, embodiments of the SSDRC component of the present disclosure, the output is adapted to the noise environment. Further, the output may be continually updated such that it adapts in real time to the changing noise environment. For example, if the above system is built into a mobile telephone and the user is standing outside a noisy room, the system can adapt to enhance the speech dependent on whether the door to the room is open or closed. Similarly, if the system is used in a public address system in a railway station, the system can adapt in real time to the changing noise conditions as trains arrive and depart.
In an embodiment, the environmental signal to noise ratio may be estimated on a frame by frame basis and the environmental signal to noise ratio for a previous frame is used to update the parameters for a current frame. A typical frame length is from 1 to 3 seconds.
The above system can adapt either the spectral shaping filter and/or the dynamic range compression stage to the noisy environment. In some embodiments, both the spectral shaping filter and the dynamic range compression stage will be adapted to the noisy environment.
When adapting the dynamic range compression in line with the SNR, the output environment control parameter that is updated may be used to control the gain to be applied by said dynamic range compression. Further according to the disclosure, the output environment control parameter may be updated such that it gradually supresses the boosting of the low energy segments of the input speech with increasing environmental signal to noise ratio. In some embodiments, a linear relationship is assumed between the SNR and output environment control parameter, in other embodiments a non-linear or logistic relationship is used.
To control the volume of the output, in some embodiments, the system further comprise an energy banking box, said energy banking box being a memory provided in said system and configured to store the total energy of said input speech before enhancement, said processor being further configured to increase the energy of low energy parts of the enhanced signal using energy stored in the energy banking box.
The spectral shaping filter may comprise an adaptive spectral shaping stage and a fixed spectral shaping stage. The adaptive spectral shaping stage may comprise a formant shaping filter and a filter to reduce the spectral tilt. In an embodiment, a first output environment control parameter is provided to control said formant shaping filter and a second output environment control parameter is configured to control said filter configured to reduce the spectral tilt and wherein said first and/or second output environment control parameters are updated in accordance with the environmental signal to noise ratio. The first and/or second output environment control parameters may have a linear dependence on said environmental signal to noise ratio.
The above discussion has concentrated on adapting the signal in response to an environmental SNR. However, an SSDRC system for use with an embodiment may not necessarily be adapted as such. Adapting the signal in response to an environmental SNR at the output may be an optional feature of the present disclosure. The system may be configured to modify the spectral shaping filter in accordance with the input speech independent of noise measurements. For example, the processor may be configured to estimate the maximum probability of voicing when applying the spectral shaping filter. The system may be configured to update the maximum probability of voicing every m seconds, wherein m is a value from 2 to 10.
The system may be configured to modify the dynamic range compression in accordance with the input speech independent of noise measurements. For example, the processor is configured to estimate the maximum value of the signal envelope of the input speech when applying dynamic range compression. The system may be configured to update the maximum value of the signal envelope of the input speech every m seconds, wherein m is a value from 2 to 10.
The system may also be configured to output enhanced speech in a plurality of locations. For example, such a system may comprise a plurality of environmental noise inputs corresponding to the plurality of locations, the processor being configured to apply a plurality of spectral shaping filters and a plurality of corresponding dynamic range compression stages, such that there is a spectral shaping filter and dynamic range compression stage pair for each environmental noise input, the processor being configured to update the output environment control parameters for each spectral shaping filter and dynamic range compression stage pair in accordance with the environmental signal to noise ratio measured from its corresponding environmental noise input. Such a system would be of use for example in a PA system with a plurality of speakers in different environments.
In further SSDRC embodiments for use with the present disclosure, a method for enhancing speech to be outputted in a noisy environment is provided, the method comprising: receiving speech to be enhanced; receiving real-time information concerning the noisy environment at an environmental noise input; converting speech received from said speech input to enhanced speech; and outputting said enhanced speech, wherein converting said speech comprises: measuring the environmental signal to noise ratio at the environmental noise input, applying a spectral shaping filter to the speech received via said speech input; and applying dynamic range compression to the output of said spectral shaping filter; wherein the spectral shaping filter comprises an output environment control parameter and the dynamic range compression comprises an output environment control parameter and wherein at least one of the output environment control parameters for the dynamic range compression or the spectral shaping is updated in real time according to the measured environmental signal to noise ratio.
The above SSDRC embodiments have discussed adaptability of the system in response to SNR. However, in some embodiments, the speech is enhanced independent of the SNR of the environment where it is to be output. Here, a speech intelligibility enhancing system for enhancing speech to be output is provided, the system comprising: a speech input for receiving speech to be enhanced; an enhanced speech output to output said enhanced speech; and a processor configured to convert speech received from said speech input to enhanced speech to be output by said enhanced speech output, the processor being configured to: apply a spectral shaping filter to the speech received via said speech input; and apply dynamic range compression to the output of said spectral shaping filter, wherein the spectral shaping filter comprises an output environment control parameter and the dynamic range compression comprises an output environment control parameter and at least one of the output environment control parameters for the dynamic range compression or the spectral shaping is updated in real time according to the speech received at the speech input.
For example, the processor may be configured to estimate the maximum probability of voicing when applying the spectral shaping filter, and wherein the system is configured to update the maximum probability of voicing every m seconds, wherein m is a value from 2 to 10.
The system may be configured to modify the dynamic range compression in accordance with the input speech independent of noise measurements. For example, the processor is configured to estimate the maximum value of the signal envelope of the input speech when applying dynamic range compression and wherein the system is configured to update the maximum value of the signal envelope of the input speech every m seconds, wherein m is a value from 2 to 10.
In a further embodiment of an SSDRC system for use with the present disclosure, a method for enhancing speech intelligibility is provided, the method comprising: receiving speech to be enhanced; converting speech received from said speech input to enhanced speech; and outputting said enhanced speech, wherein converting said speech comprises: applying a spectral shaping filter to the speech received via said speech input; and applying dynamic range compression to the output of said spectral shaping filter, wherein the spectral shaping filter comprises an output environment control parameter and the dynamic range compression comprises an output environment control parameter and at least one of the output environment control parameters for the dynamic range compression or the spectral shaping is updated in real time according to the speech received at the speech input.
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
Embodiments relating to the combination of a noise reduction method and elements of SSDRC will now be discussed. Any discussion made above applies mutatis mutandis to equivalent features of any embodiment discussed below.
According to an embodiment is a speech intelligibility enhancing device for enhancing speech to be outputted in a noisy output environment, the device comprising: a speech input for receiving speech to be enhanced; a speech output for outputting said enhanced speech; and a processor configured to convert speech received via said speech input to enhanced speech to be output by said speech output; the processor being configured to: apply a noise reduction method to the speech received via said speech input to increase the signal to noise ratio of said speech; determine a measurement of the noise of the speech received via said speech input; apply a spectral shaping filter to the output of said noise reduction method; and apply dynamic range compression to the output of said spectral shaping filter; wherein the dynamic range compression comprises a speech control parameter; said speech control parameter being determined by the measurement of the noise of the speech received via said speech input; and wherein the dynamic range compression is dependent on the speech control parameter.
An embodiment may be to a speech intelligibility enhancing method, or a speech intelligibility system, wherein the method or system comprises the features of the speech intelligibility enhancing device described below.
Speech to be enhanced may be input to the device via a speech input. The input speech (“speech to be enhanced”) may be a signal comprising a clean speech component and a noise component. The device may comprise an output for outputting speech after it has been enhanced. In some embodiments, the device may not comprise an output. The processor may be configured to act on the input speech to enhance its intelligibility.
The processor may be configured to make a variety of different types of measurement of the noise present in the input speech. The processor may be configured to obtain information regarding noise in the input speech or signal. The processor may be configured to determine by estimating a measurement of the noise associated with the speech signal, received via said speech input. As such, determining a measurement may be equivalent to estimating, or deriving an estimation. The measurement of the present embodiment may relate to the signal inputted into the device, rather than the environment from which the signal is input, or the environment into which the output speech is output. As such, the measurement may be determined by measuring or analysing (in order to obtain an estimate) the speech received via the speech input, rather than via an input for measuring environmental noise. The measurement may be determined at one or a plurality of different locations within the device.
The processor may be configured to apply a method of reducing the noise, or increasing the tolerance of the signal to noise. Any method for increasing the tolerance of the signal to noise, or reducing the effect of noise on the signal, may be suitable for use with the current device. As such, any noise reduction method may be implemented with embodiments according to the present disclosure. Introducing a noise reduction method before applying SSDRC methods improves the performance of the SSDRC system with noisy speech. A spectral shaping filter and dynamic range compression may then be applied to the output of the noise reduction method. Traditional methods involving simple amplification of the output of the noise reduction method may result in the amplification of distortions. Using SSDRC methods on the output of the noise reduction method may increase intelligibility while minimising the amplification of distortions.
Any of the discussion relating to spectral shaping or dynamic range compression discussed herein, can be used in combination with the features of the present embodiment. The terms “spectral shaping” and “dynamic range compression” when used with reference to an embodiment of the present disclosure are understood to refer to any of the methods, systems and features described in relation to SSDRC, herein. A speech control parameter may be used to modify the DRC (or other step) in response to the input speech. Modifying the post-noise reduction behaviour of the device may reduce the amplification of distortions or noise. A speech control parameter may be directly, or indirectly dependent on the measurement of the noise of the speech. The dynamic range compression may be directly, or indirectly dependent on the speech control parameter. The speech control parameter may be linearly or non-linearly dependent on the measurement of the noise of the speech.
The speech control parameter may comprise a first portion, wherein the speech control parameter is not dependent on the measurement of the noise of the speech; a second portion, wherein the speech control parameter is linearly, or non-linearly, dependent on the measurement of the noise of the speech; and a third portion, wherein the speech control parameter is again not dependent on the measurement of the noise of the speech. The three portions of the speech control parameter may be delineated at certain values of the measurement of the noise of the speech. The first portion may change to the second portion at a first value of the measurement of the noise of the speech. The second portion may change to the third portion at a second value of the measurement of the noise of the speech.
The speech control parameter may comprise a plurality of portions or regions, each with a different dependency between the speech control parameter and the measurement of the noise of the speech (which may be the speech signal to noise ratio). Each portion or region of the speech control parameter may be bounded at specific values of the measurement of the noise of the speech, and may have a different dependency on the measurement of the noise of the speech. A further, separate, process may be dependent on the speech control parameter.
In an embodiment, the spectral shaping filter may comprise a speech control parameter. A speech control parameter of the spectral shaping filter may be determined by the measurement of the noise of the speech received via said speech input; and wherein the spectral shaping filter is dependent on the speech control parameter. This may be in addition to the speech control parameter of the DRC. A speech control parameter of the spectral shaping feature may comprise any of the features discussed in relation to the speech control parameter of the DRC, mutatis mutandis. The formant shaping filter, spectral tilt or fixed spectral shaping may be dependent on a speech control parameter of the spectral shaping filter.
The device may further comprise an environmental noise input for receiving real-time information concerning the noisy output environment; wherein the processor may be further configured to measure the environmental signal to noise ratio at the environmental noise input; and wherein at least one of said spectral shaping filter and said dynamic range compression may comprise an output environment control parameter; and wherein at least one output environment control parameter is updated in real time according to the measured environmental signal to noise ratio.
In some embodiments, only one of the spectral shaping filter and the dynamic range compression comprises an output environment control parameter, in some embodiments both might. An output environment control parameter may be updated in real time according to the measured environmental signal to noise ratio. Discussion relating to the environmental signal to noise ratio and output environment control parameters made in relation to SSDRC embodiments may apply to any ntSSDRC embodiment of the disclosure, mutatis mutandis.
The measurement of the noise of the speech received via said speech input may be determined at any point before the spectral shaping filter is applied. The measurement of the noise of the speech received via said speech input may be determined immediately before the spectral shaping filter is applied. The measurement of the noise of the speech received via said speech input may be determined at the speech input. The measurement of the noise of the speech received via said speech input may be determined after the noise reduction method is applied to the speech. The measurement of the noise of the speech received via said speech input may be determined after the spectral shaping filter is applied to the speech. The measurement of the noise of the speech may comprise a measurements determined at a plurality of the above-mentioned points.
The measurement of the noise, or the obtaining of information obtained from the speech, may be determined at the speech input. The measurement may be determined before the speech is altered by the noise reduction method. Alternatively, the measurement may be made before the speech reaches the speech input, or after the speech has entered the speech input. In other embodiments, the measurement may be made while the speech is being processed by the noise reduction method, or after the speech has been processed by the noise reduction method.
The dynamic range compression may comprise an input/output envelope characteristic to control the gain to be applied by said dynamic range compression; and the input/output envelope characteristic may be dependent on the speech control parameter.
An input/output envelope characteristic (IOEC) may be as described anywhere herein. The IOEC may be modified by, or dependent on, the speech control parameter. The IOEC may be linearly, or non-linearly dependent on the speech control parameter.
The IOEC being dependent on the speech control parameter, and thus dependent on the measurement of noise of the speech, may allow characteristics of the input speech to facilitate modification of the DRC stage. Such modification may reduce the amplification of distortions or noise in the signal. Other parts of the DRC may be made dependent on the speech control parameter, for example either one, or both, of the dynamic and static stage ofthe DRC may be updated in response to, or dependent on, the speech control parameter. A speech control parameter may be determined by the measurement of the noise of the speech received via said speech input. A speech control parameter may be dependent upon the measurement of the noise of the speech received via said speech input. The dependency may be direct or indirect. The dependency may be linear or non-linear.
The speech control parameter may determine the gradient of the IOEC curve. The gain applied to the input envelope at certain dBs may be determined by, or dependent upon, the speech control parameter.
The speech control parameter may be a threshold, wherein zero gain is applied to processed signal (i.e. the speech) by the dynamic range compression to parts of the signal below the threshold. The output and threshold may be determined using the unit of decibels.
The speech control parameter may be a threshold, wherein parts of the signal below the threshold are compressed by the dynamic range compression. Compression may refer to reducing or suppressing the envelope amplitude (i.e. the opposite of amplifying) The output and threshold may be determined using the unit of decibels.
The speech control parameter may be a threshold, wherein zero gain is applied to the output of said spectral shaping filter by the input/output envelope characteristic when the output of said spectral shaping filter is below the threshold. The output and threshold may be determined using the unit of decibels.
The speech control parameter may be a threshold, wherein the output of said spectral shaping filter is compressed when the output of said spectral shaping filter is below the threshold. As such, the gradient of the IOEC curve may be less than 1 below the threshold. The output and threshold may be determined using the unit of decibels.
The speech control parameter may be a threshold, wherein zero gain is applied to a portion of the output of said spectral shaping filter below the threshold by the input/output envelope characteristic. The speech control parameter may be a threshold, wherein zero gain is applied to parts of the output of said spectral shaping filter below the threshold by the dynamic range compression.
The speech control parameter may be a threshold in decibels, wherein zero gain is applied to the output of said spectral shaping filter by the dynamic range compression when the output of said spectral shaping filter is below the threshold.
The speech control parameter may define a threshold. The threshold may be in decibels and may define a level, below which it is assumed the signal comprises only noise. This threshold may be a threshold of silence, below which zero gain or compression is applied by the dynamic range compression, or a sub-step thereof.
The measurement of the noise of the speech received via said speech input may be the signal to noise ratio; and the speech control parameter may be determined by the speech signal to noise ratio. The processor may be configured to estimate the noise, or the signal to noise ratio, of the speech. The processor may be configured to determine a measurement of the noise of the speech by estimating the signal to noise ratio of the speech.
Other measurements or characterisations of the noise present in a speech signal may also be suitable for use with embodiments disclosed herein.
The measurement of the noise of the speech received via said speech input and the speech control parameter may be updated in real time.
The speech control parameter may be continually updated such that it adapts in real time, or near to real time, to the changing noise environment. For example, if the above device is built into a mobile telephone and the user is standing outside a noisy room, the device can adapt dependent on whether the door to the room is open or closed. Similarly, if the device is used in a public address system in a railway station, the system can adapt in real time to the changing noise conditions as trains arrive and depart.
The measurement of the noise of the speech received via said speech input may be determined on a frame by frame basis and the measurement of the noise of the speech received via said speech input for a previous frame may be used to update the measurement for a current frame.
The measurement of the noise of the speech received via said speech input may alternatively be determined on a frame by frame basis and the measurement or estimation of the noise of the speech received via said speech input for a previous frame may be used to update the measurement for a current frame. A typical frame length may be from 1 to 3 seconds.
There may be a linear relationship between the speech control parameter and the speech signal to noise ratio.
Alternatively, there may be a non-linear relationship between the speech control parameter and the speech signal to noise ratio.
The speech control parameter may comprise a plurality of portions or regions, each with a different dependency (e.g. linear, non-linear, non-dependent) between the speech control parameter and the measurement of the noise of the speech (which may be the speech signal to noise ratio). Each portion or region of the speech control parameter may be bounded at specific values of the measurement of the noise of the speech, and may have a different dependency on the measurement of the noise of the speech.
The device may further comprise an energy banking box, said energy banking box being a memory provided in said device and configured to store the total energy of said speech received at said speech input before enhancement, said processor being further configured to redistribute energy from high energy parts of the speech to low energy parts using said energy banking box.
Discussion regarding the energy banking box made in relation to the SSDRC, applies mutatis mutandis to the energy banking box of an embodiment.
The processor may be configured to estimate the maximum probability of voicing when applying the spectral shaping filter and the device may be configured to update the maximum probability of voicing every m seconds, wherein m is a value from 2 to 10.
Discussion made in relation to the probability of voicing with reference to a SSDRC system apply, mutatis mutandis, to that of an embodiment.
Any noise reduction method may be used in embodiments of the present disclosure. Any method for increasing the signal to noise ration of a speech signal may be used in embodiments.
The noise reduction method may operate on the amplitude spectrum ofthe speech.
The noise reduction method may estimate the phase ofthe speech signal.
The noise reduction method may produce a phase-aware estimate of the magnitude of the speech signal.
The noise reduction method may comprise a Weiner filter. The noise reduction method may comprise a series of Wiener filters. A, or each Weiner filter may provide an amplitude estimation. An amplitude estimation may be a minimum mean square error estimate of the amplitude spectrum of the speech signal.
The noise reduction method may comprise a phase estimation of the speech signal. A phase estimation may use geometry and group delay minimization. The phase estimation may use the magnitude of the speech signal and an estimate of the magnitude of the noise power spectral density.
The estimate of the phase of the speech signal may be used to produce a phase-aware estimate of the magnitude of the speech signal. The phase-aware estimate of the magnitude of the speech signal may be used in a Weiner filter to produce a further estimate of the amplitude of the speech signal. The phase-aware estimate of the magnitude of the speech signal may be used to produce an estimated speech signal.
According to an embodiment, there may be a method for enhancing speech intelligibility, the method comprising: receiving speech to be enhanced at a speech input; and converting the speech to be enhanced to enhanced speech; wherein converting said speech comprises: applying a noise reduction method to the speech received via said speech input to increase the signal to noise ratio of the speech; determining a measurement of the noise of the speech received via said speech input applying a spectral shaping filter to the output of said noise reduction method; and applying dynamic range compression to the output of said spectral shaping filter; wherein the dynamic range compression comprises a speech control parameter; said speech control parameter being determined by the measurement of the noise of the speech received via said speech input; and wherein the dynamic range compression is dependent on the speech control parameter.
The method may further comprise outputting enhanced speech.
The measurement of the noise of the speech received via said speech input may be determined at the speech input.
The dynamic range compression may comprise an input/output envelope characteristic to control the gain to be applied by said dynamic range compression; and the input/output envelope characteristic may be dependent on the speech control parameter. Applying dynamic range compression may comprise applying the IOEC to the output of said spectral shaping filter.
The measurement of the noise of the speech received via said speech input and the speech control parameter may be updated in real time.
The measurement of the noise of the speech received via said speech input may be determined on a frame by frame basis and wherein the measurement of the noise of the speech received via said speech input for a previous frame is used to update the measurement for a current frame.
There may be a linear relationship between the speech control parameter and the speech signal to noise ratio.
The method may further comprise storing the total energy of said speech received at said speech input before enhancement and redistributing energy from high energy parts of the speech to low energy parts using an energy banking box.
The method may further comprise estimating the maximum probability of voicing when applying the spectral shaping filter, and updating the maximum probability of voicing every m seconds, wherein m is a value from 2 to 10.
The method may further comprise: receiving real-time information concerning a noisy environment in which the enhanced speech is to be output, at an environmental noise input; and wherein converting said speech further comprises: measuring the environmental signal to noise ratio at the environmental noise input; and wherein at least one of said spectral shaping filter and the dynamic range compression comprises an output environment control parameter; and wherein at least one output environment control parameter is updated in real time according to the measured environmental signal to noise ratio.
The measurement of the noise of the speech received via said speech input may be the signal to noise ratio; and the speech control parameter may be determined by the speech signal to noise ratio.
The speech control parameter may be a threshold, wherein no gain is applied to the output of said spectral shaping filter by the dynamic range compression (e.g. the input/output envelope characteristic) when the output of said spectral shaping filter is below the threshold. Applying dynamic range compression may comprise applying the threshold to the output of said spectral shaping filter.
The noise reduction method may operate on the amplitude spectrum ofthe speech.
The noise reduction method may comprise estimating the phase ofthe speech signal.
The noise reduction method may comprise producing a phase-aware estimate of the magnitude of the speech signal.
The noise reduction method may estimate the phase ofthe speech signal; and produce a phase-aware estimate of the magnitude of the speech signal.
According to an embodiment there may be a carrier medium comprising computer readable code configured to cause a computer to perform any method described herein.
Any discussion of features of a device according to an embodiment applies, mutatis mutandis, to discussion of equivalent features in relation to a method according to an embodiment.
Embodiments described herein may have a wide range of applications. Any hearing device such as a hearing aid or ear-piece may comprise an embodiment. These devices receive speech comprising noise and are designed to output speech for a user to hear. The environment in which a user uses these devices is often noisy. Additionally, such devices may be implemented in public speaking equipment, where both the input and output are subject to noise - for example public loudspeaker systems. Mobile phones may also comprise embodiments according to the present disclosure.
Figure 1 is a schematic of a speech intelligibility enhancing system comprising SSDRC.
The system 1 comprises a processor 3 which comprises a program 5 which takes input speech and information about the noise conditions where the speech will be output and enhances the speech to increase its intelligibility in the presence of noise. The storage 7 stores data that is used by the program 5. Details of what data is stored will be described later.
The system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an input for data relating to the speech to be enhanced and also and input for collecting data concerning the real time noise conditions in the places where the enhanced speech is to be output. The type of data that is input may take many forms, which will be described in more detail later. The input 15 may be an interface that allows a user to directly input data. Alternatively, the input may be a receiver for receiving data from an external storage medium or a network.
Connected to the output module 13 is output is audio output 17.
In use, the system 1 receives data through data input 15. The program 5 executed on processor 3, enhances the inputted speech in the manner which will be described with reference to figures 2 to 8.
Figure 2 is a flow diagram showing the processing steps provided by program 5. In an embodiment, to enhance or boost the intelligibility of the speech, the system comprises a spectral shaping step S21 and a dynamic range compression step S23. These steps are shown in figure 3. The output of the spectral shaping step S21 is delivered to the dynamic range compression step S23.
Step S21 operates in the frequency domain and its purpose is to increase the “crisp” and “clean” quality of the speech signal, and therefore improve the intelligibility of speech even in clear (not-noisy) conditions. This is achieved by sharpening the formant information (following observations in clear speech) and by reducing spectral tilt using pre-emphasis filters (following observations in Lombard speech). The specific characteristics of this sub-system are adapted to the degree of speech frame voicing.
The steps S21 and S23 are shown in more detail in figure 3. For this purpose, several spectral operations are applied all combined into an algorithm which contains two stages: (i) an adaptive stage S31(to the voiced nature of speech segments); and (ii) a fixed stage S33 as shown in figure 4.
In this embodiment, the spectral intelligibility improvements are applied inside the adaptive Spectral Shaping stage S31. In this embodiment, the adaptive spectral shaping stage comprises a first transformation which is a formant sharpening transformation and a second transformation which is a spectral tilt flattening transformation. Both the first and second transformations are adapted to the voiced nature of speech, given as a probability of voicing per speech frame. These adaptive filter stages are used to suppress artefacts in the processed signal especially in fricatives, silence or other “quiet” areas of speech.
Given a speech frame, the probability of voicing which is determined in step S35 is defined as:
CO
Where a = l/max(Pv(t)) is a normalisation parameter, rms(t) and z(t) denote the RMS value and the zero-crossing rate. A speech frame s[(t)
(2) is extracted from the speech signal s(t) using a rectangular window wr(t) centred at each analysis instant tit In an embodiment, the window is length 2.5 times the average fundamental period of speaker’s gender (8:3ms and 4:5ms for males and women, respectively). In this particular embodiment, analysis frames are extracted each 10ms. The two above transformations are adaptive (to the local probability of voicing) filters that are used to implement the adaptive spectral shaping.
First, the formant shaping filter is applied. The input of this filter is obtained by extracting speech frames sj/t) using Hanning windows of the same length as those specified for computing the probability of voicing, then applying an N-point discrete Fourier transform (DFT) in step S37
G) and estimating the magnitude spectral envelope Ε(ωκΤ) for every frame i. The magnitude spectral envelope is estimated using the magnitude spectrum in (3) and a spectral envelope estimation vocoder (SEEVOC) algorithm in step S39. Fitting the spectral envelope by cepstral analysis provides a set of cepstral coefficients, c:
(4) which are used to compute the spectral tilt, Τ(ω,ί,)'.
(5)
Thus, the adaptive formant shaping filter is defined as:
(6.>
The formant enhancement achieved using the filter defined by equation (6) is controlled by the local probability of voicing Pv(ti) and the β parameter, which allows for an extra noise-dependent adaptivity of Hs.
In an embodiment, β is fixed, in other embodiments, it is controlled in accordance with the environmental signal to noise ratio (SNR) ofthe environment where the voice signal is to be outputted.
For example, β may be set to a fixed value of β0. In an embodiment, β0 is 0.25 or 0.3. If β is adapted with noise, then for example: ifSNR<=0, β = β0 ifO<SNR<=15, β = /3O*(1-SNR/15)
if SNR>15, /3=0
The above example assumes a linear relationship between β and the SNR, but a nonlinear relationship could also be used.
The second adaptive (to the probability of voicing) filter which is applied in step S31 is used to reduce the spectral tilt. In an embodiment, the pre-emphasis filter is expressed as:
(7) where ω0 = 0:125π for a sampling frequency of IQkHz.
In some embodiments, g is fixed, in other embodiments g is dependent on the SNR environment where the voice signal is to be outputted.
For example, g may be set to a fixed value of g0. In an embodiment, g0 is 0.3. If g is adapted with noise, then for example: if SNR<=0, g = go ifO<SNR<=15, g = g0*(1-SNR/15) if SNR>15, g =0
The above example assumes a linear relationship between g and the SNR, but a nonlinear relationship could also be used.
The fixed Spectral Shaping step (S33) is a filter H/ayt) used to protect the speech signal from low-pass operations during its reproduction. In frequency, Hr boosts the energy between 1000 Hz and 4000 Hz by 12 dB/octave and reduces by 6 dB/octave the frequencies below 500 Hz. Both voiced and unvoiced speech segments are equally affected by the low-pass operations. In this embodiment, the filter is not related to the probability of voicing.
Finally, after the magnitude spectra are modified accordingly to:
(8) the modified speech signal is reconstructed by means of inverse DFT (S41) and Overlap-and-Add, using the original phase spectra as shown in figure 4.
In the above described spectral shaping step, the parameters β and g may be controlled in accordance with real time information about the environmental signal to noise ratio in the environment where the speech is to be outputted.
Returning to figure 2, the dynamic range compression step S23 will be described in more detail with reference to figure 5.
The signal’s time envelope is estimated in step S51 using the magnitude of the analytical signal:
(9) where s(n) denotes the Hilbert transform of the speech signal s(n). Furthermore, because the estimate in (9) has fast fluctuations, a new estimate e(n) is computed based on a moving average operator with order given by the average pitch of the speaker’s gender. In an embodiment, the speaker’s gender is assumed to be male since the average fundamental period is longer for men. However, in some embodiments as noted above, the system can be adapted specifically for female speakers with a shorter fundamental period.
The signal is then passed to the DRC dynamic step S53. In an embodiment, during the DRC’s dynamic stage S53, the envelope of the signal is dynamically compressed with 2ms release and almost instantaneous attack time constants:
(10) where ar= 0.15 and aa = 0.0001.
Following the dynamic stage S53, a static amplitude compression step S55 controlled by an Input-Output Envelope Characteristic (IOEC) is applied.
The IOEC curve depicted in Fig. 6 is a plot of the desired output in decibels against the input in decibels. Unity gain is shown as a straight dotted line and the desired gain to implement DRC is shown as a solid line. This curve is used to generate time-varying gains required to reduce the envelope’s variations. To achieve this, first the dynamically compressed e(n) is transposed in dB
(11) setting the reference level e0 to 0.3 the maximum level of the signal’s envelope, selection that provided good listening results for a broad range of SNRs. Then, applying the IOEC to (11) generates eout(n) and allows the computation of the time-varying gains:
(12) which produces the DRC-modified speech signal which is shown in figure 7(b). Figure 7(a) shows the speech before modification.
(13)
As a final step, the global power of sg(n) is altered to match the one of the unmodified speech signal.
In the IOEC curve of figure 6, a threshold of silence, defined as an input value below which unity gain is applied to the input, is set at -30dB.
In an embodiment, the IOEC curve is controlled in accordance with the SNR where the speech is to be output. Such a curve is shown in figure 8.
In figure 8, as the current SNR λ increases from a specified minimum value Amin towards a maximum value Amax, the IOEC is modified from the curve depicted in Fig. 6 towards the bisector of the first quadrant angle. At Amin, the signal’s envelope is compressed by the baseline DRC as shown by the solid line, while at Amax nocompression is taking place. In between, different morphing strategies may be used for the SNR-adaptive IOEC. The levels Amin and Amax are given as input parameters for each type of noise. E.g., for SSN type of noise they may be chosen -9dB and 3dB. A piecewise linear IOEC (as the one given in Figure 8) is obtained using a discrete set of M points Pj,i = 0,M - 1. Further on, x, and y, will denote respectively the input and output levels of IOEC at point i. Also, the discrete family of M points denoted as Pj = (xi.yiW) in Figure 8 parameterize the modified IOEC with respect to a given SNR A. In this context, the noise adaptive IOEC segment (P2,P2+1) has the following analytical expression:
(1.4) where a(A) is the segment’s slope
(15) and b(A) is the segment’s offset
(16)
Two embodiments will now be discussed where respectively two types of effective morphing methods were selected to control the IOEC curve: a linear and a non-linear (logistic) slope variation over A. For an embodiment, where a linear relationship is employed, the following expression may be used for a:
(17) where
and
For the non-linear (logistic) form:
(18) where λ0 is the logistic offset, o0 is the logistic slope, while
(19) and (20)
In an embodiment, λ0 and o0 are constants given as input parameters for each type of noise (e.g., for SSN type of noise they may be chosen -6dB and 2, respectively). In a further embodiment, λ0 and/or o0 may be controlled in accordance with the measured SNR. For example, they may be controlled as described above for β and g with a linear relationship on the SNR.
Finally, imposing Pj = Ρθ, adaptive IOEC is computed for a given λ, considering the expression (17) or (18) as slopes for each of its segments i = 1,M- 1. Then, using (14) the new piecewise linear IOEC is generated.
Psychometric measurements have indicated that speech intelligibility changes with SNR following a logistic function of the type used in accordance with the above embodiment.
In the above embodiments, the spectral shaping step S21 and the DRC step S23 are very fast processes which allow real time execution at a perceptual high quality modified speech.
Systems in accordance with the above described embodiments, show enhanced performance in terms of speech intelligibility gain especially for low SNRs. They also provide suppression of audible arte- facts inside the modified speech signal at high SNRs. At high SNRs, increasing the amplitude of low energy segments of speech (such as unvoiced speech) can cause perceptual quality and intelligibility degradation.
Systems and methods in accordance with the above embodiments provide a light, simple and fast method to adapt dynamic range compression to the noise conditions, inheriting high speech intelligibility gains at low SNRs from the non-adaptive DRC and improve perceptual quality and intelligibility at high SNRs.
Returning to figure 2, an entire system is shown where stages S21 and S23 have been described in detail with reference to figures 3 to 8.
If speech is not present the system is off. In stage S61 a voice activity detection module is provided to detect the presence of speech. Once speech is detected, the speech signal is passed for enhancement. The voice activity detection module may employ a standard voice activity detection (VAD) algorithm can be used.
The speech will be output at speech output 63. Sensors are provided at speech output 63 to allow the noise and SNR at the output to be measured. The SNR determined at speech output 63 is used to calculate β and g in stage S21. Similarly, the SNR λ is used to control stage S23 as described in relation to figure 5 above.
The current SNR at frame t is predicted from previous frames of noise as they have been already observed in the past (M, t-2, t-3 ...). In an embodiment, the SNR is estimated using long windows in order to avoid fast changes in the application of stages S21 and S23. In an example, the window lengths can be from 1s to 3s.
The system of figure 2 is adaptive in that it updates the filters applied in stage S21 and the IOEC curve of step S23 in accordance with the measured SNR. However, the system of figure 2 also adapts stages S21 and/or S23 dependent on the input voice signal independent of the noise at speech output 63. For example, in stage S23, the maximum probability of voicing can be updated every n seconds, where n is a value between 2 and 10, in one embodiment, n is from 3-5.
In stage S23, in the above embodiment, e0 was set to 0.3 times the maximum value of the signal envelope. This envelope can be continually updated dependent on the input signal. Again, the envelope can be updated every n seconds, where n is a value between 2 and 10, in one embodiment, n is from 3-5.
The initial values for the maximum probability of voicing and the maximum value of the signal envelope are obtained from database 65 where speech signals have been previously analysed and these parameters have been extracted. These parameters are passed to parameter update stage S67 with the speech signal and stage S67 updates these parameters.
In an embodiment, the dynamic range compression, energy is distributed over time. This modification is constrained by the following condition: total energy of the signal before and after modifications should remain the same (otherwise one can increase intelligibility by increasing the energy of the signal i.e. the volume). Since the signal which is modified is not known a priori, Energy Banking box 69 is provided. In box 69, energy from the most energetic part of speech is “taken” and saved (as in a Bank) and it is then distributed to the less energetic parts of speech. These less energetic parts are very vulnerable to the noise. In this way, the distribution of energy helps the overall the modified signal to be above the noise level.
In an embodiment, this can be implemented by modifying equation (13) to be:
(20a)
Where a(n) is calculated from the values saved in the energy banking box to allow the overall modified signal to be above the noise level.
(21)
where E(sg(n)j is the energy of the enhanced signal sg(ri) for the frame (n) and E^Noisefji)) is the energy of the noise for the same frame.
If E(sg(n)) < E(Noise(iif) the system attempts to further distribute energy to boost low energy parts of the signal so that they are above the level of the noise. However, the system only attempts to further distribute the energy if there is energy Eb stored in the energy banking box.
If the gain g(n)<1, then the energy difference between the input signal and the enhanced signal (E(s(n))-E(sg(n))) is stored in the energy banking box. The energy banking box stores the sum of these energy differences where g(n)<1 to provide the stored energy Eb.
To calculate a(n) whenE(s5(n)) < f(/Voise(n)), a bound on a is derived as a<
(22) A second expression a2(n) for a(n) is derived using Eb
(23)
Where y is a parameter chosen such that 0< y <1 which expresses a percentage of the energy bank which can be allocated to a single frame. In an embodiment, y = 0.2, but other values can be used. (24)
However, (25)
When energy is distributed as above, the energy is removed from the energy banking box Eb such that the new value of Eb is:
(26)
Once α(η) is derived, it is applied to the enhanced speech signal in step S71.
The system of figure 2 can be used with devices producing speech as output (cell phones, TVs, tablets, car navigation etc.) or accepting speech (i.e., hearing aids). The system can also be applied to Public Announcement apparatus. In such a system, there may be a plurality of speech outputs, for example, speakers, located in a number of places, e.g. inside or outside a station, in the main area of an airport and a business lounge. The noise conditions will vary greatly between these environments. The system of figure 2 can therefore be modified to produce one or more speech outputs as shown in figure 9.
The system of figure 9 has been simplified to show a speech input 101, which is then split to provide an input into a first sub-system 103 and a second subsystem 105. Both the first and second subsystems comprise a spectral shaping stage S21 and a dynamic range compression stage S23. The spectral shaping stage S21 and the dynamic range compression stage S23 are the same as those described in relation to figures 2 to 8. Both subsystems comprise a speech output 63 and the SNR at the speech output 63 for the first subsystem is used to calculate β, g and the IOEC curve for stages S21 and S23 of the first subsystem. The SNR at the speech output 63 for the second subsystem 105 is used to calculate β, g and the IOEC curve for stages S21 and S23 of the second subsystem 105. The parameter update stage S67 can be used to supply the same data to both subsystems as it provides parameters calculated from the input speech signal. For clarity the Voice activity detection module and the energy banking box have been omitted from figure 9, but they will both be present in such a system.
Spectral shaping and dynamic range compression (SSDRC) as described above has been shown to be suitable for improving speech intelligibility. The spectral shaping operates in the frequency domain, and the dynamic range compression (DRC) operates primarily in the time domain. The spectral shaping comprises two cascaded subsystems which are adaptive to the probability of voicing: (i) an adaptive sharpening where the formant information is enhanced, and (ii) an adaptive pre-emphasis filter.
Furthermore, a third fixed spectral shaping may be used to prevent attenuation of high frequencies in the speech signal during signal reproduction.
The output of the spectral shaping system may then be input to the DRC, which has a dynamic and a static stage. During the dynamic stage, the envelope of the total time signal may be dynamically compressed with a 2 ms release time constant and almost instantaneous attack time constant. During the static amplitude compression, the 0 dB reference level is set to 0.3 times the peak of the signal envelope. Thus, DRC enhances the transient components of speech. There is a final stage in SSDRC that ensures the input power and the output power are the same, this guarantees that the gains in intelligibility are not due to signal amplification.
The present disclosure uses a preprocessing input stage to reduce the noise, effectively increasing the input speech’s SNR. Although this type of processing is often referred to in the literature as speech enhancement, in the present disclosure it will be referred to as noise reduction to differentiate it from the speech enhancement of the following SSDRC stage.
Figures 10 and 11 illustrate a noise reduction system or method suitable for use with an embodiment of the present disclosure.
Any noise reduction method can be used as part of an embodiment. Any method or device for increasing the signal to noise ratio of a speech signal may be used as part of an embodiment. A specific example of a noise reduction method will now be discussed, although it is to be understood that any noise reduction method may be substituted for that described below.
Most noise reduction methods operate only on the amplitude spectrum of the noisy speech, and ignore the phase. Perhaps the most classic example of this is the Weiner filter, which provides a minimum mean square error (MMSE) estimate of the amplitude spectrum of the speech signal. A noisy speech signal i.e. speech to be enhanced, is first sampled and converted to the Fourier domain. After the appropriate sampling and conversion to the Fourier domain, we can write the contents of the k-th bin of the l-th frame of the noisy speech signal as
(27)
Where Sk,i is a speech contribution and NkJ is a noise contribution.
The softmask gain of a Weiner filter is then given as
(28) where the estimate of the magnitude of the speech signal |sfc;I| may be found by recursion, and the estimate of the magnitude of the noise power spectral density (PSD) \Nk,i\ may be found by methods based on optimal smoothing and minimum statistics or unbiased MMSE-based noise power estimation. Once Gkl has been calculated, the speech estimate is given by
(29) where φγ,κ.ι is the phase of the Fourier domain noisy speech signal, given by
(30)
The phase of the speech signal can be accurately estimated using a method based on geometry and group delay minimization to estimate the phase tps.k.i-
Using the above method, or any other, equivalent method an improved estimate of the speech signal can then be formed as
(31) A “phase-aware” estimate of the magnitude of the speech signal, |s'kl| can be produced using, for example MMSE optimal spectral amplitude estimation given the STFTphase. In these methods the initial estimate of the magnitude of the speech signal
|§ki| and the estimate of the phase of the speech signal cpSikl are used to produce a phase-aware estimate of the magnitude of the speech signal |s'kl|. This can then be used in another Weiner filter to produce a phase-aware softmask gain as
(32)
Finally, similar to (28), the estimated speech signal S"kl is given by
(33) A full noise reduction system or method is illustrated in Fig. 10. The noisy speech signal YkJ is first fed into a Noise PSD Estimation block S100 which produces an estimate of the amplitude of the noise |Nkl|, which is used by most of the other blocks. The first Weiner filter S102 then produces an initial estimate of the amplitude of the speech signal |§kl|, and this is used to generate an estimate of the phase of the speech signal Ts,k,l by the Phase Estimation block S104. The Phase-aware Amplitude Estimation block S106 then uses fpS;kl to produce an improved estimate of the amplitude of the speech signal |s'kl|, which is further refined by the second Weiner filter S108 to produce |§"kl|. This is then combined with fpS;kl in the Magnitude and Phase Combiner Block S110 to produce the final estimate of the speech signal S"kl.
Figure 11 also illustrates a noise reduction system or method for use with an embodiment. The system of figure 11 operates using the same method as that of figure 10, but is represented slightly differently, with blocks for the Fourier transform and inverse Fourier transform explicitly illustrated S111a, S111 b.
An embodiment of the present disclosure is illustrated in figure 12. A measurement of the noise of the speech to be enhanced, received via said speech input, is determined. This can be done before or after the noise reduction method. The speech to be enhanced is received via a speech input and is fed into a noise reduction subsystem S112 according to that of figure 10. The noise reduction subsystem applies a noise reduction method to increase the signal to noise ratio of the speech. The output of the noise reduction method is processed by the spectral shaper S114, whereby a spectral
shaping filter which enhances the frequency characteristics of the speech is applied to the output of the noise reduction method. The output of the spectral shaping filter is converted to the time domain and is modified in the time-domain by the dynamic range compression stage S116. The dynamic range compression comprises a speech control parameter which is determined by the measurement of the noise of the speech received via said speech input. In the present embodiment, the measurement of the noise of the speech received via said speech input is the signal to noise ratio and the speech control parameter is a threshold value as described below. Finally the energy of the output signal is constrained to be the same as that at the input to the spectral shaper by an energy banking box S118.
Figure 13 schematically illustrates an embodiment implemented in a real world scenario in which speech from a speaker is input along with environmental noise, as the speech to be enhanced, received via a speech input. The noise reduction and SSDRC are applied to the received speech and high intelligibility speech is output to a listener in a noisy output environment.
The part of SSDRC most sensitive to noise on the input was the DRC. This is because the DRC is intended to transfer energy over time from louder speech segments (such as voiced speech) to quieter (often unvoiced) parts of the speech signal, resulting in short passages of noise being amplified and reducing intelligibility. In order to counter this, the input/output envelope characteristic (IOEC) curve is modified. The modification results in a change to the threshold of silence of the IOEC curve based on the input SNR γ. The threshold of silence is denoted ξ. In the clean speech scenario, γ = °°, and ξ = 30dB. ξ is varied as:
(34)
The overall effect of each subsystem in an embodiment is hard to predict, so changes have to be carefully considered. Indeed, optimizing each block in the whole system independently will almost certainly not result in the best overall performance.
When choosing parameters for the noise reduction subsystem, a balance must be maintained between eliminating as much of the noise in the pauses between speech as possible and minimizing the distortion of the speech.
The gain of the final Weiner filter may be smoothed in the cepstral domain. This may decrease the so-called “musical noise” in the output of the noise reduction subsystem.
Figures 14, 15 and 16 illustrate the performance of SSDRC systems and embodiments according to the present disclosure. Improved speech intelligibility is a very subjective measurement and the proposed system or device of figure 12 is non-linear. Furthermore, speech intelligibility is somewhat both speech sample - and speaker -dependent. Thus, all the evaluations were done on 5 different speakers (three males, two females) each saying 5 different sentences, for a total of 25 different samples. The clean speech was recorded at 16 kHz, to which the appropriately scaled speechshaped noise was added. The extended speech intelligibility index (ESII) was used as the final performance indicator in figures 14 to 16. Nonetheless, the perceptual evaluation of speech quality (PESQ) may be used at intermediate stages to help tune and test the system, as well as informal intelligibility listening tests.
Figure 14 illustrates how the ESI I of SSDRC speech deteriorates as the input speech SNR γ decreases.
The performance of an embodiment is shown in figure 15, from which it is clear that the use of the noise reduction subsystem of ntSSDRC allows it to regain some of the lost performance due to noisy input speech. Each figure plots the ESI I against the listening SNR, e.g. the SNR at an environmental noise input for receiving information concerning the noisy output environment. In each figure there is a plot of SSDRC with clean input speech, noisy speech without any speech processing, SSDRC with noisy speech and ntSSDRC with noisy speech. The input signal to noise ratio (SNR) for the “noisy speech” is specified below each graph, i.e. in figure 15(a) the SNR = 0 dB; in figure 15(b) the SNR = 10 dB; and in figure 15(c) the SNR = 20 dB.
The gains in intelligibility are greatest at the lower input SNR values, especially at 0 dB. Generally speaking, it could be argued thatthat the use of ntSSDRC provides the same ESII as SSDRC with an input SNR 10 dB higher. Thus, ntSSDRC could be said to provide a 10 dB gain in intelligibility. This is further illustrated in figure 16, in which the ESI I of output speech of ntSSDRC and SSDRC provided with input speech with a range of different SNRs is plotted.
The above system was implemented in C++, on a Windows 8 laptop with a Core i7 processor running at 2.4 GHz. With input and output soundcard frame sizes of 424 samples at 44.1 kHz, less than 20% of the available processing time between audio interrupts was needed to perform the processing of the proposed system, confirming its suitability as a real-time system. The majority of the processing time is taken by the phase estimation algorithm of the noise reduction subsystem.
The present disclosure provides a face-to-face communication system and device designed to work in a significantly noisy environment. Furthermore, the present device is suitable for real-time application.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel devices, systems and methods described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Claims (20)

CLAIMS:
1. A speech intelligibility enhancing device for enhancing speech to be output in a noisy output environment, the device comprising: a speech input for receiving speech to be enhanced; a speech output for outputting enhanced speech; and a processor configured to convert the speech, received via said speech input, to enhanced speech to be output by said speech output, the processor being configured to: apply a noise reduction method to the speech to increase the signal to noise ratio of said speech; determine a measurement of the noise of the speech; apply a spectral shaping filter to the output of said noise reduction method; and apply dynamic range compression to the output of said spectral shaping filter; wherein the dynamic range compression comprises a speech control parameter; said speech control parameter being determined by the measurement of the noise of the speech and wherein the dynamic range compression is dependent on the speech control parameter.
2. A speech intelligibility enhancing device according to claim 1, further comprising an environmental noise input for receiving real-time information concerning the noisy output environment; wherein the processor is further configured to measure the environmental signal to noise ratio at the environmental noise input; and wherein at least one of said spectral shaping filter and said dynamic range compression comprises an output environment control parameter; and wherein at least one output environment control parameter is updated in real time according to the measured environmental signal to noise ratio.
3. A speech intelligibility enhancing device according to claim 1, wherein the processor is further configured such that the measurement of the noise of the speech is determined after the noise reduction method is applied to the speech.
4. A speech intelligibility enhancing device according to claim 1, wherein the dynamic range compression comprises an input/output envelope characteristic to control the gain to be applied by said dynamic range compression; and the input/output envelope characteristic is dependent on the speech control parameter.
5. A speech intelligibility enhancing device according to claim 4, wherein the speech control parameter is a threshold, wherein zero gain is applied to a portion of the output of said spectral shaping filter below the threshold by the input/output envelope characteristic.
6. A speech intelligibility enhancing device according to claim 1, wherein said measurement of the noise of the speech is the signal to noise ratio; and the speech control parameter is determined by the speech signal to noise ratio.
7. A speech intelligibility enhancing device according to claim 1, wherein the measurement of the noise of the speech and the speech control parameter are updated in real time.
8. A speech intelligibility enhancing device according to claim 1, wherein the measurement of the noise of the speech is determined on a frame by frame basis and wherein the measurement of the noise of the speech for a previous frame is used to update the measurement for a current frame.
9. A speech intelligibility enhancing device according to claim 7, wherein there is a linear relationship between the speech control parameter and the speech signal to noise ratio.
10. A speech intelligibility enhancing device according to claim 1, wherein the device further comprises an energy banking box, said energy banking box being a memory provided in said device and configured to store the total energy of said speech before enhancement, said processor being further configured to redistribute energy from high energy parts of the speech to low energy parts using said energy banking box.
11. A speech intelligibility enhancing device according to claim 1, wherein the processor is configured to estimate the maximum probability of voicing when applying the spectral shaping filter, and wherein the device is configured to update the maximum probability of voicing every m seconds, wherein m is a value from 2 to 10.
12. A speech intelligibility enhancing device according to claim 1, wherein the noise reduction method operates on the amplitude spectrum of the speech.
13. A speech intelligibility enhancing device according to claim 12, wherein the noise reduction method estimates the phase of the speech signal.
14. A speech intelligibility enhancing device according to claim 13, wherein the noise reduction method produces a phase-aware estimate of the magnitude of the speech signal.
15. A method for enhancing speech intelligibility, the method comprising: receiving speech to be enhanced at a speech input; and converting the speech to be enhanced to enhanced speech; wherein converting said speech comprises: applying a noise reduction method to the speech to increase the signal to noise ratio of the speech; determining a measurement of the noise of the speech; applying a spectral shaping filter to the output of said noise reduction method; and applying dynamic range compression to the output of said spectral shaping filter; wherein the dynamic range compression comprises a speech control parameter; said speech control parameter being determined by the measurement of the noise of the speech; and wherein the dynamic range compression is dependent on the speech control parameter.
16. A method for enhancing speech intelligibility according to claim 15, wherein the method further comprises: receiving real-time information concerning a noisy environment in which the enhanced speech is to be output, at an environmental noise input; and wherein converting said speech further comprises: measuring the environmental signal to noise ratio at the environmental noise input; and wherein at least one of said spectral shaping filter and the dynamic range compression comprises an output environment control parameter; and wherein at least one output environment control parameter is updated in real time according to the measured environmental signal to noise ratio.
17. A method for enhancing speech intelligibility according to claim 15, wherein said measurement of the noise of the speech is the signal to noise ratio; and the speech control parameter is determined by the speech signal to noise ratio.
18. A method for enhancing speech intelligibility according to claim 15, wherein the speech control parameter is a threshold, wherein zero gain is applied to the output of said spectral shaping filter by an input/output envelope characteristic when the output of said spectral shaping filter is below the threshold.
19. A method for enhancing speech intelligibility according to claim 15, wherein the noise reduction method: estimates the phase of the speech signal; and produces a phase-aware estimate of the magnitude of the speech signal.
20. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 15.
GB1505361.4A 2015-03-27 2015-03-27 A speech processing device Active GB2536727B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1505361.4A GB2536727B (en) 2015-03-27 2015-03-27 A speech processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1505361.4A GB2536727B (en) 2015-03-27 2015-03-27 A speech processing device

Publications (3)

Publication Number Publication Date
GB201505361D0 GB201505361D0 (en) 2015-05-13
GB2536727A GB2536727A (en) 2016-09-28
GB2536727B true GB2536727B (en) 2019-10-30

Family

ID=53178291

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1505361.4A Active GB2536727B (en) 2015-03-27 2015-03-27 A speech processing device

Country Status (1)

Country Link
GB (1) GB2536727B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10170137B2 (en) 2017-05-18 2019-01-01 International Business Machines Corporation Voice signal component forecaster
CN110931033B (en) * 2019-11-27 2022-02-18 深圳市悦尔声学有限公司 Voice focusing enhancement method for microphone built-in earphone
CN113035222B (en) * 2021-02-26 2023-10-27 北京安声浩朗科技有限公司 Voice noise reduction method and device, filter determination method and voice interaction equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100318353A1 (en) * 2009-06-16 2010-12-16 Bizjak Karl M Compressor augmented array processing
US8010366B1 (en) * 2007-03-20 2011-08-30 Neurotone, Inc. Personal hearing suite
GB2520048A (en) * 2013-11-07 2015-05-13 Toshiba Res Europ Ltd Speech processing system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8010366B1 (en) * 2007-03-20 2011-08-30 Neurotone, Inc. Personal hearing suite
US20100318353A1 (en) * 2009-06-16 2010-12-16 Bizjak Karl M Compressor augmented array processing
GB2520048A (en) * 2013-11-07 2015-05-13 Toshiba Res Europ Ltd Speech processing system

Also Published As

Publication number Publication date
GB2536727A (en) 2016-09-28
GB201505361D0 (en) 2015-05-13

Similar Documents

Publication Publication Date Title
US10636433B2 (en) Speech processing system for enhancing speech to be outputted in a noisy environment
AU2009278263B2 (en) Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
US9117455B2 (en) Adaptive voice intelligibility processor
RU2467406C2 (en) Method and apparatus for supporting speech perceptibility in multichannel ambient sound with minimum effect on surround sound system
EP2149986B1 (en) An apparatus for processing an audio signal and method thereof
Ma et al. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions
CN106663450B (en) Method and apparatus for evaluating quality of degraded speech signal
CN112242147A (en) Voice gain control method and computer storage medium
US20230154459A1 (en) Pre-processing for automatic speech recognition
EP2943954B1 (en) Improving speech intelligibility in background noise by speech-intelligibility-dependent amplification
GB2536727B (en) A speech processing device
GB2536729A (en) A speech processing system and a speech processing method
KR102718917B1 (en) Detection of fricatives in speech signals
CN115881080A (en) Acoustic feedback processing method and device in voice communication system
Hendriks et al. Speech reinforcement in noisy reverberant conditions under an approximation of the short-time SII
BRPI0911932B1 (en) EQUIPMENT AND METHOD FOR PROCESSING AN AUDIO SIGNAL FOR VOICE INTENSIFICATION USING A FEATURE EXTRACTION