Nothing Special   »   [go: up one dir, main page]

CN108520756B - Method and device for separating speaker voice - Google Patents

Method and device for separating speaker voice Download PDF

Info

Publication number
CN108520756B
CN108520756B CN201810231676.XA CN201810231676A CN108520756B CN 108520756 B CN108520756 B CN 108520756B CN 201810231676 A CN201810231676 A CN 201810231676A CN 108520756 B CN108520756 B CN 108520756B
Authority
CN
China
Prior art keywords
audio signal
audio
processing
carrying
sound source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810231676.XA
Other languages
Chinese (zh)
Other versions
CN108520756A (en
Inventor
孙学京
刘恩
张晨
张兴涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tuoling Inc
Original Assignee
Beijing Tuoling Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tuoling Inc filed Critical Beijing Tuoling Inc
Priority to CN201810231676.XA priority Critical patent/CN108520756B/en
Publication of CN108520756A publication Critical patent/CN108520756A/en
Application granted granted Critical
Publication of CN108520756B publication Critical patent/CN108520756B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Stereophonic System (AREA)

Abstract

The invention discloses a method and a device for separating speaker voice, wherein the method comprises the following steps: acquiring an audio signal with a preset format; preprocessing the audio signal to obtain a processed first audio signal; carrying out audio separation processing on the first audio signal to obtain second audio signals of speakers in different directions; enhancing the second audio signal to obtain enhanced third audio signals of speakers in different directions; outputting the third audio signal. By adopting the technical scheme of the invention, the audio signals of a plurality of speakers without directions can be quickly and accurately separated.

Description

Method and device for separating speaker voice
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method and a device for separating voices of speakers.
Background
With the development of scientific technology, the pursuit of various fields for audio quality is higher and higher, the acquisition ways of various audio documents are richer and richer, the data volume is increased explosively, and the management of the audio documents is more and more difficult. In recent years, audio search technology has been studied to manage multimedia audio documents such as telephone audio, broadcast audio, and conference audio. The conference voice is most difficult to retrieve because the conference voice document contains a plurality of channels and more speakers.
Existing audio separation methods are mainly divided into single-channel (microphone) techniques and multi-channel (microphone) techniques. The single microphone technology mainly comprises an audio separation method based on a model and a separation method based on a distance scale; the multi-microphone technology mainly comprises a beam forming separation method and a blind source separation method.
The model-based audio separation method comprises two steps of training and identifying: in the training process, after the characteristic extraction is carried out on the input audio, further training is carried out, and a trained model is stored; and in the identification process, after the input audio is subjected to feature extraction and speaker separation and speaker clustering, matching calculation is further performed with a stored model, each speaker is judged, and finally, the separated audio signal is obtained. The separation method based on the distance scale calculates the distance between two adjacent signals with a certain window length at the left and the right of each point, and further compares the distance with a set threshold value to obtain the jumping point of the audio signal, thereby obtaining the separated audio signal. The beam forming separation method obtains the audio signals of each speaker by carrying out sound source positioning on the input audio in real time and further carrying out enhancement processing according to the speaker orientation. The blind source separation method is used for carrying out blind source separation processing on input audio so as to obtain audio signals of all speakers.
However, the model-based separation method requires a long continuous speaking time of each speaker in the conversation and has too high algorithm complexity; the separation method based on the distance scale has the problems of redundant segmentation points with excessive detection number and the like. Methods such as a beam forming separation method and a blind source separation method mainly aim at processing a linear microphone array and a planar microphone array, and the processing effect in a complex environment is not enough.
Therefore, under a complex environment, the audio signals of multiple speakers without directions are separated relatively quickly and accurately, which is a technical problem to be solved at present.
Disclosure of Invention
The invention aims to provide a method and a device for separating the voices of speakers, which realize the rapid and accurate separation of audio signals of a plurality of speakers without directions.
In order to achieve the above object, the present invention provides a method for separating speaker voice, comprising:
acquiring an audio signal with a preset format;
preprocessing the audio signal to obtain a processed first audio signal;
carrying out audio separation processing on the first audio signal to obtain second audio signals of speakers in different directions;
enhancing the second audio signal to obtain enhanced third audio signals of speakers in different directions;
outputting the third audio signal.
Further, in the method, the preprocessing the audio signal to obtain a processed first audio signal includes:
obtaining a placing mode parameter and a surrounding environment parameter of the wheat array;
according to the placing mode parameters of the microphone array, carrying out conversion processing on the audio signals to obtain converted audio signals located on the same plane;
performing time-frequency transformation on the converted audio signal to obtain a frequency domain signal corresponding to the converted audio signal;
according to the ambient environment parameters, carrying out audio enhancement processing on the frequency domain signal to obtain an enhanced frequency domain signal;
and performing time-frequency inverse transformation on the enhanced frequency domain signal to obtain a time domain signal serving as the first audio signal.
Further, in the method, performing audio separation processing on the first audio signal to obtain second audio signals of speakers in different directions includes:
acquiring a sound source positioning result and a speaker recognition result corresponding to the first audio signal according to the first audio signal;
and carrying out audio separation processing on the first audio signal according to the sound source positioning result and the speaker identification result to obtain the second audio signal.
Further, in the method, obtaining a sound source localization result and a speaker recognition result corresponding to the first audio signal according to the first audio signal includes:
carrying out voice detection processing on the first audio signal to obtain a detection result;
according to the detection result, carrying out sound source positioning processing on the first audio signal to obtain a sound source positioning result;
and carrying out speaker recognition processing on the first audio signal according to a preset recognition model to obtain the speaker recognition result.
Further, in the method, performing audio separation processing on the first audio signal according to the sound source localization result and the speaker recognition result to obtain the second audio signal includes:
and carrying out audio separation processing on the first audio signal by utilizing a beam forming method according to the sound source positioning result and the speaker identification result to obtain the second audio signal.
Further, in the method, performing audio separation processing on the first audio signal according to the sound source localization result and the speaker recognition result to obtain the second audio signal includes:
selecting an audio separation method corresponding to the sound source positioning result;
and according to the speaker identification result, carrying out audio separation processing on the first audio signal by using the audio separation method to obtain the second audio signal.
Further, in the method, performing enhancement processing on the second audio signal to obtain an enhanced third audio signal includes:
and based on the speaker recognition result, carrying out smoothing processing and audio conversion point position correction processing on the second audio signal to obtain the third audio signal.
The invention also provides a speaker voice separation device, comprising:
the acquisition module is used for acquiring an audio signal in a preset format;
the preprocessing module is used for preprocessing the audio signal to obtain a processed first audio signal;
the audio separation module is used for carrying out audio separation processing on the first audio signal to obtain second audio signals of speakers in different directions;
the enhancement processing module is used for carrying out enhancement processing on the second audio signal to obtain an enhanced third audio signal;
and the output module is used for outputting the third audio signal.
The method and the device for separating the speeches of the speakers can be used for preprocessing the audio signals with the preset format to obtain the processed first audio signals, carrying out audio separation processing on the first audio signals to obtain the second audio signals of the speakers in different directions, carrying out enhancement processing on the second audio signals to obtain the enhanced third audio signals of the speakers in different directions, and outputting the third audio signals, thereby realizing the rapid and accurate separation of the audio signals of a plurality of speakers without directions.
Drawings
FIG. 1 is a flow chart of a method of speaker voice separation according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a microphone array arrangement for acquiring four audio signals according to the present invention;
FIG. 3 is a schematic structural diagram of an embodiment of a speaker voice separation apparatus according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present embodiments will be described clearly and completely with reference to the following embodiments of the present invention and the accompanying drawings. It is obvious that the described embodiments are only a part of the present embodiments, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without any creative effort belong to the protection scope of the embodiments.
The terms first, second and the like in the description and in the claims, and in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated herein.
The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Example 1
Fig. 1 is a flowchart of an embodiment of a method for separating a speaker voice according to the present invention, and as shown in fig. 1, the method for separating a speaker voice of the present embodiment may specifically include the following steps:
100. and acquiring an audio signal with a preset format.
The audio signal of the preset format in the present embodiment may be an audio signal of an Ambisonic a format. The audio signal in the Ambisonic a format is four audio signals (Left-Front-Up (LFU), Right-Front-Down (RFD), Left-Back-Down (LBD), Right-Back-Up (RBU)). Fig. 2 is a schematic diagram of a microphone array placement method for acquiring four audio signals according to the present invention.
101. And preprocessing the acquired audio signal to obtain a processed first audio signal.
In a specific implementation process, when an audio signal in a preset format is acquired, a placement mode parameter and a surrounding environment parameter of a microphone array may be acquired, so that the acquired audio signal in the preset format is converted according to the placement mode parameter of the microphone array to obtain a converted audio signal located on the same plane, time-frequency conversion is performed on the converted audio signal to obtain a frequency domain signal corresponding to the converted audio signal, audio enhancement processing is performed on the frequency domain signal according to the surrounding environment parameter to obtain an enhanced frequency domain signal, and further time-frequency inverse conversion is performed on the frequency domain signal to obtain a time domain signal, which is used as a first audio signal.
For example, after the placement mode of the microphone array is obtained, the audio signals may be rotated according to the formula (1) based on the placement mode of the microphone array, so that the obtained audio signals are located on the same plane.
Figure BDA0001602753080000051
Wherein, A is a transformation matrix:
Figure BDA0001602753080000061
wherein, thetahIs a first angle, thetapTo a pitch angle, θbIs an angle of inclination, f (θ)hpb) Is equal to thetah、θpAnd thetabA function of the correlation.
After the converted signal is obtained, time-frequency transform processing can be performed on the converted signal path by using methods such as Discrete Fourier transform (FFT) and Fast Fourier Transform (FFT). Taking DFT as an example, the converted signal may be subjected to time-frequency transform processing according to equation (2):
Figure BDA0001602753080000062
wherein n is a time domain index value, k is a frequency domain index value, L is an audio processing frame length, and L is a frame length of the audio processing framefIs the length of time-frequency transformation, j is the imaginary unit, M is the number of sound channels, x (n) is the audio time domain sample value, and X (k) is the audio frequency domain coefficient.
After obtaining the frequency domain signal, the noise energy spectrum can be estimated from the 4-channel audio signal by Reverberation Time (RT)60) And estimating a reverberation Energy spectrum by using the parameters and Direct-to-Reverberant Energy Ratio (DRR) parameters, and further performing audio enhancement processing path by path based on the estimated noise Energy spectrum and reverberation Energy spectrum, so that the obtained frequency domain signal is subjected to processing such as denoising and dereverberation, and the obtained frequency domain signal is enhanced.
In the embodiment, the received multi-channel audio signal can be preprocessed according to the placement mode parameters and the surrounding environment parameters of the microphone array, so that the influence of the environment on the subsequent audio separation processing is reduced.
102. And carrying out audio separation processing on the first audio signal to obtain second audio signals of speakers in different directions.
In this embodiment, after obtaining the first audio signal, the sound source positioning result and the speaker recognition result corresponding to the first audio signal may be obtained according to the first audio signal, and the audio separation processing may be performed on the first audio signal according to the sound source positioning result and the speaker recognition result, so as to obtain the second audio signals of speakers in different directions.
In a specific implementation process, the voice detection processing may be performed on the first audio signal to obtain a corresponding detection result, so as to perform sound source localization processing on the first audio signal according to the detection result to obtain a sound source localization result, and perform speaker recognition processing on the first audio signal according to a preset recognition model to obtain a speaker recognition result.
For example, the sound source localization may be implemented by using a Multiple Signal Classification (MUSIC) algorithm, Generalized Cross Correlation (GCC), and the like, and the GCC may specifically be implemented as follows:
a) and (3) respectively calculating the cross correlation of each path of audio according to the formula:
Figure BDA0001602753080000071
wherein, K1Is an initial frequency point, K2Is a cut-off frequency point.
b) Smoothing is performed based on the voice detection result according to formula (4):
Gsm(i,j)=Gsm(i,j)*fsm+(1-fsm)*G(i,j) (4)
wherein f issmAs a smoothing factor:
Figure BDA0001602753080000072
and Vad is a voice detection processing result.
c) And further processing the smoothed cross-correlation function to obtain a sound source positioning result.
In this embodiment, the speaker recognition may be performed based on a Model to obtain a speaker recognition result, such as a Gaussian Mixed Model (GMM), a Hidden Markov Model (HMM), a Deep Neural Network (DNN), and the like.
After the sound source positioning result and the speaker recognition result are obtained, a beam forming mode can be adopted to carry out audio separation processing on the first path of audio signal to obtain second audio signals of speakers in different directions.
Or selecting an audio separation method corresponding to the sound source positioning result, and performing audio separation processing on the first audio signal by using the audio separation method according to the speaker identification result to obtain second audio signals of speakers in different directions.
For example, the audio separation process can be performed using equation (5) to obtain the second audio signals of speakers with different orientations.
Figure BDA0001602753080000081
Wherein, VdoaWeighting factors in the direction of the sound source:
Figure BDA0001602753080000082
τ is the time delay, S is the number of sound sources, VspeWeighting factors for a single sound source.
When the S is more than 1, the S is,
Figure BDA0001602753080000083
the audio signal of the sound source direction can be obtained by using a beam forming method. When S is less than or equal to 1, Vdoa=VspeFor example, a setting of (1, 0, 0, 0) indicates that the 1 st audio is used as the separated audio signal.
102. And enhancing the second audio signals of the speakers in different directions to obtain enhanced third audio signals of the speakers in different directions.
For example, based on the speaker recognition result, smoothing processing and correction processing of the audio conversion point position may be performed on the second audio signals of speakers in different directions to obtain third audio signals of speakers in different directions, so as to ensure audio continuity.
103. And outputting the third audio signal.
The main body of the method for separating the speaker voice in this embodiment may be a device for separating the speaker voice, and the device for separating the speaker voice may be integrated by software, for example, the device for separating the speaker voice may be an application, and the present invention is not limited thereto.
The method for separating the voices of the speakers obtains the audio signals in the preset format, obtains the processed first audio signals by preprocessing the audio signals, obtains the second audio signals of the speakers in different directions by performing audio separation processing on the first audio signals, performs enhancement processing on the second audio signals to obtain the enhanced third audio signals of the speakers in different directions, and outputs the third audio signals, thereby realizing the rapid and accurate separation of the audio signals of a plurality of speakers without directions.
Example 2
Fig. 3 is a schematic structural diagram of an embodiment of a speaker voice separation apparatus according to the present invention, and as shown in fig. 3, the speaker voice separation apparatus of the present embodiment may include an obtaining module 10, a preprocessing module 11, an audio separation module 12, an enhancement processing module 13, and an output module 14.
The obtaining module 10 is configured to obtain an audio signal in a preset format.
The audio signal of the preset format in the present embodiment may be an audio signal of an Ambisonic a format. The audio signal in the Ambisonic a format is four audio signals (Left-Front-Up (LFU), Right-Front-Down (RFD), Left-Back-Down (LBD), Right-Back-Up (RBU)). FIG. 2 is a schematic diagram of the placement of a microphone array for acquiring four audio signals according to the present invention
The preprocessing module 11 is configured to preprocess the received audio signal to obtain a processed first audio signal. Specifically, the preprocessing module 11 may obtain a placing mode parameter and a surrounding environment parameter of the wheat matrix; converting the multi-channel audio signals according to the placement mode parameters of the microphone array to obtain converted audio signals located on the same plane; performing time-frequency transformation on the converted signal to obtain a frequency domain signal corresponding to the converted signal; according to the ambient environment parameters, carrying out audio enhancement processing on the frequency domain signal to obtain an enhanced frequency domain signal; and carrying out time-frequency inverse transformation on the enhanced audio signal to obtain an audio time domain signal serving as a first audio signal.
And the audio separation module 12 is configured to perform audio separation processing on the first audio signal to obtain second audio signals of speakers in different directions. Specifically, the audio separation module 12 may obtain a sound source positioning result and a speaker recognition result corresponding to the first audio signal according to the first audio signal, for example, perform voice detection processing on the first audio signal to obtain a detection result; according to the detection result, carrying out sound source positioning processing on the first audio signal to obtain a sound source positioning result; and carrying out speaker recognition processing on the first audio signal according to a preset recognition model to obtain a speaker recognition result.
The audio separation module 12 can also perform audio separation processing on the first audio signal according to the sound source positioning result and the speaker recognition result to obtain second audio signals of speakers in different directions. For example, according to the sound source positioning result and the speaker recognition result, the first audio signal may be subjected to audio separation processing by using a beam forming technique, so as to obtain a second audio signal of the speaker without using the direction. Or selecting an audio separation method corresponding to the sound source positioning result; and according to the speaker identification result, carrying out audio separation processing on the first audio signal by using an audio separation method to obtain second audio signals of speakers in different directions.
And the enhancement processing module 13 is configured to perform enhancement processing on the second audio signals of speakers in different directions to obtain enhanced third audio signals of speakers in different directions. Specifically, the enhancement processing module 13 may perform smoothing processing and audio conversion point position correction processing on the second audio signal based on the speaker recognition result, so as to obtain third audio signals of speakers in different directions.
And the output module 14 is used for outputting third audio signals of speakers in different directions.
The mechanism for separating the audio signal by using the modules in the speaker voice separation apparatus of this embodiment is the same as the mechanism for separating the audio signal in the embodiment shown in fig. 1, and reference may be made to the description of the embodiment shown in fig. 1 for details, which is not described herein again.
The speaker voice separation device of the embodiment acquires audio signals in a preset format, obtains the processed first audio signals by preprocessing the audio signals, performs audio separation processing on the first audio signals to obtain second audio signals of speakers in different directions, performs enhancement processing on the second audio signals to obtain enhanced third audio signals of the speakers in different directions, and outputs the third audio signals, thereby realizing rapid and accurate separation of the audio signals of a plurality of speakers without directions.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (8)

1. A method for speaker voice separation, comprising:
acquiring an audio signal with a preset format;
preprocessing the audio signal to obtain a processed first audio signal;
after the placing mode of the microphone array is obtained, carrying out rotation processing on the audio signals according to a formula (1) based on the placing mode of the microphone array, and enabling the obtained audio signals to be located on the same plane;
Figure FDA0002575796730000011
wherein, A is a transformation matrix:
Figure FDA0002575796730000012
wherein, thetahIs a first angle, thetapTo a pitch angle, θbIs an angle of inclination, f (θ)hpb) Is equal to thetah、θpAnd thetabA function of the correlation;
after the converted signal is obtained, performing time-frequency conversion processing on the converted signal according to the formula (2):
Figure FDA0002575796730000013
wherein n is a time domain index value, k is a frequency domain index value, L is an audio processing frame length, and L is a frame length of the audio processing framefIs the length of time frequency transformation, j is imaginary unit, M is sound channel number, x (n) is audio time domain sample value, and X (k) is audio frequency domain coefficient;
carrying out audio separation processing on the first audio signal to obtain second audio signals of speakers in different directions;
carrying out voice detection processing on the first audio signal to obtain a corresponding detection result, and carrying out sound source positioning processing on the first audio signal according to the detection result;
and (3) realizing sound source positioning by adopting generalized cross-correlation:
a) and (3) respectively calculating the cross correlation of each path of audio according to the formula:
Figure FDA0002575796730000021
wherein, K1Is an initial frequency point, K2Is a cut-off frequency point;
b) smoothing is performed based on the voice detection result according to formula (4):
Gsm(i,j)=Gsm(i,j)*fsm+(1-fsm)*G(i,j) (4)
wherein f issmAs a smoothing factor:
Figure FDA0002575796730000022
vad is the speech detection processing result;
c) further processing the smoothed cross-correlation function to obtain a sound source positioning result;
carrying out audio separation processing by using a formula (5) to obtain second audio signals of speakers in different directions;
Figure FDA0002575796730000023
wherein, VdoaWeighting factors in the direction of the sound source:
Figure FDA0002575796730000024
τ is the time delay, S is the number of sound sources, VspeWeighting factors for a single sound source;
when the S is more than 1, the S is,
Figure FDA0002575796730000025
obtaining an audio signal in the sound source direction by adopting a beam forming method; when S is less than or equal to 1, Vdoa=VspeAdopting the 1 st path of audio as the separated audio signal;
enhancing the second audio signal to obtain enhanced third audio signals of speakers in different directions;
outputting the third audio signal.
2. The method of claim 1, wherein pre-processing the audio signal to obtain a processed first audio signal comprises:
obtaining a placing mode parameter and a surrounding environment parameter of the wheat array;
according to the placing mode parameters of the microphone array, carrying out conversion processing on the audio signals to obtain converted audio signals located on the same plane;
performing time-frequency transformation on the converted audio signal to obtain a frequency domain signal corresponding to the converted audio signal;
according to the ambient environment parameters, carrying out audio enhancement processing on the frequency domain signal to obtain an enhanced frequency domain signal;
and performing time-frequency inverse transformation on the enhanced frequency domain signal to obtain a time domain signal serving as the first audio signal.
3. The method of claim 1 or 2, wherein performing audio separation on the first audio signal to obtain a second audio signal of a speaker in different directions comprises:
acquiring a sound source positioning result and a speaker recognition result corresponding to the first audio signal according to the first audio signal;
and carrying out audio separation processing on the first audio signal according to the sound source positioning result and the speaker identification result to obtain the second audio signal.
4. The method of claim 3, wherein obtaining the sound source localization result and the speaker recognition result corresponding to the first audio signal according to the first audio signal comprises:
carrying out voice detection processing on the first audio signal to obtain a detection result;
according to the detection result, carrying out sound source positioning processing on the first audio signal to obtain a sound source positioning result;
and carrying out speaker recognition processing on the first audio signal according to a preset recognition model to obtain the speaker recognition result.
5. The method of claim 3, wherein performing audio separation processing on the first audio signal according to the sound source localization result and the speaker recognition result to obtain the second audio signal comprises:
and carrying out audio separation processing on the first audio signal by utilizing a beam forming method according to the sound source positioning result and the speaker identification result to obtain the second audio signal.
6. The method of claim 3, wherein performing audio separation processing on the first audio signal according to the sound source localization result and the speaker recognition result to obtain the second audio signal comprises:
selecting an audio separation method corresponding to the sound source positioning result;
and according to the speaker identification result, carrying out audio separation processing on the first audio signal by using the audio separation method to obtain the second audio signal.
7. The method of claim 3, wherein performing enhancement processing on the second audio signal to obtain an enhanced third audio signal comprises:
and based on the speaker recognition result, carrying out smoothing processing and audio conversion point position correction processing on the second audio signal to obtain the third audio signal.
8. An apparatus for separating speaker's voice based on the method for separating speaker's voice of claim 1, comprising:
the acquisition module is used for acquiring an audio signal in a preset format;
the preprocessing module is used for preprocessing the audio signal to obtain a processed first audio signal;
the audio separation module is used for carrying out audio separation processing on the first audio signal to obtain second audio signals of speakers in different directions;
the enhancement processing module is used for carrying out enhancement processing on the second audio signal to obtain an enhanced third audio signal;
and the output module is used for outputting the third audio signal.
CN201810231676.XA 2018-03-20 2018-03-20 Method and device for separating speaker voice Active CN108520756B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810231676.XA CN108520756B (en) 2018-03-20 2018-03-20 Method and device for separating speaker voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810231676.XA CN108520756B (en) 2018-03-20 2018-03-20 Method and device for separating speaker voice

Publications (2)

Publication Number Publication Date
CN108520756A CN108520756A (en) 2018-09-11
CN108520756B true CN108520756B (en) 2020-09-01

Family

ID=63433795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810231676.XA Active CN108520756B (en) 2018-03-20 2018-03-20 Method and device for separating speaker voice

Country Status (1)

Country Link
CN (1) CN108520756B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110459239A (en) * 2019-03-19 2019-11-15 深圳壹秘科技有限公司 Role analysis method, apparatus and computer readable storage medium based on voice data
CN111899758B (en) * 2020-09-07 2024-01-30 腾讯科技(深圳)有限公司 Voice processing method, device, equipment and storage medium
CN112382306B (en) * 2020-12-02 2022-05-10 思必驰科技股份有限公司 Method and device for separating speaker audio
CN112634935B (en) * 2021-03-10 2021-06-11 北京世纪好未来教育科技有限公司 Voice separation method and device, electronic equipment and readable storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1818909A1 (en) * 2004-12-03 2007-08-15 HONDA MOTOR CO., Ltd. Voice recognition system
CN101720558A (en) * 2007-04-19 2010-06-02 埃波斯开发有限公司 Voice and position localization
CN102831898A (en) * 2012-08-31 2012-12-19 厦门大学 Microphone array voice enhancement device with sound source direction tracking function and method thereof
CN103456312A (en) * 2013-08-29 2013-12-18 太原理工大学 Single channel voice blind separation method based on computational auditory scene analysis
CN103811020A (en) * 2014-03-05 2014-05-21 东北大学 A kind of intelligent voice processing method
CN104049235A (en) * 2014-06-23 2014-09-17 河北工业大学 Microphone array in sound source orienting device
CN104936091A (en) * 2015-05-14 2015-09-23 科大讯飞股份有限公司 Intelligent interaction method and system based on circle microphone array
CN105120421A (en) * 2015-08-21 2015-12-02 北京时代拓灵科技有限公司 Method and apparatus of generating virtual surround sound
CN106098075A (en) * 2016-08-08 2016-11-09 腾讯科技(深圳)有限公司 Audio collection method and apparatus based on microphone array
CN106816156A (en) * 2017-02-04 2017-06-09 北京时代拓灵科技有限公司 A kind of enhanced method and device of audio quality

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105355203B (en) * 2015-11-03 2019-03-12 重庆码头联智科技有限公司 The device and method of phonetic decision is carried out by gravity sensor intelligent wearable device
CN105872940B (en) * 2016-06-08 2017-11-17 北京时代拓灵科技有限公司 A kind of virtual reality sound field generation method and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1818909A1 (en) * 2004-12-03 2007-08-15 HONDA MOTOR CO., Ltd. Voice recognition system
CN101720558A (en) * 2007-04-19 2010-06-02 埃波斯开发有限公司 Voice and position localization
CN102831898A (en) * 2012-08-31 2012-12-19 厦门大学 Microphone array voice enhancement device with sound source direction tracking function and method thereof
CN103456312A (en) * 2013-08-29 2013-12-18 太原理工大学 Single channel voice blind separation method based on computational auditory scene analysis
CN103811020A (en) * 2014-03-05 2014-05-21 东北大学 A kind of intelligent voice processing method
CN104049235A (en) * 2014-06-23 2014-09-17 河北工业大学 Microphone array in sound source orienting device
CN104936091A (en) * 2015-05-14 2015-09-23 科大讯飞股份有限公司 Intelligent interaction method and system based on circle microphone array
CN105120421A (en) * 2015-08-21 2015-12-02 北京时代拓灵科技有限公司 Method and apparatus of generating virtual surround sound
CN106098075A (en) * 2016-08-08 2016-11-09 腾讯科技(深圳)有限公司 Audio collection method and apparatus based on microphone array
CN106816156A (en) * 2017-02-04 2017-06-09 北京时代拓灵科技有限公司 A kind of enhanced method and device of audio quality

Also Published As

Publication number Publication date
CN108520756A (en) 2018-09-11

Similar Documents

Publication Publication Date Title
CN108520756B (en) Method and device for separating speaker voice
US8249867B2 (en) Microphone array based speech recognition system and target speech extracting method of the system
CN101136199B (en) Voice data processing method and equipment
Delcroix et al. Compact network for speakerbeam target speaker extraction
US8364483B2 (en) Method for separating source signals and apparatus thereof
CN111179911A (en) Target voice extraction method, device, equipment, medium and joint training method
US9099093B2 (en) Apparatus and method of improving intelligibility of voice signal
Kang et al. Multimodal speaker diarization of real-world meetings using d-vectors with spatial features
Delcroix et al. Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation
CN111261145B (en) Voice processing device, equipment and training method thereof
CN110858476A (en) Sound collection method and device based on microphone array
Mun et al. The sound of my voice: Speaker representation loss for target voice separation
Quan et al. Multi-channel narrow-band deep speech separation with full-band permutation invariant training
Wang et al. Exploring end-to-end multi-channel ASR with bias information for meeting transcription
Venkatesan et al. Binaural classification-based speech segregation and robust speaker recognition system
CN112908340A (en) Global-local windowing-based sound feature rapid extraction method
Marti et al. Automatic speech recognition in cocktail-party situations: A specific training for separated speech
CN115171716B (en) A method, system and electronic device for continuous speech separation based on spatial feature clustering
KR101023211B1 (en) Microphone array based speech recognition system and target speech extraction method in the system
CN118212929A (en) A personalized Ambisonics speech enhancement method
Nakamura et al. Improving separation of overlapped speech for meeting conversations using uncalibrated microphone array
Yoshioka et al. Picknet: Real-time channel selection for ad hoc microphone arrays
Srinivas et al. Speaker-independent Japanese isolated speech word recognition using TDRC features
Venkatesan et al. Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest
Chen et al. Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant