CN118280384A - Multi-sound source signal separation system and method based on microphone array - Google Patents
Multi-sound source signal separation system and method based on microphone array Download PDFInfo
- Publication number
- CN118280384A CN118280384A CN202410594815.0A CN202410594815A CN118280384A CN 118280384 A CN118280384 A CN 118280384A CN 202410594815 A CN202410594815 A CN 202410594815A CN 118280384 A CN118280384 A CN 118280384A
- Authority
- CN
- China
- Prior art keywords
- sound source
- audio
- active
- microphone array
- target sound
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000000926 separation method Methods 0.000 title claims abstract description 40
- 239000013598 vector Substances 0.000 claims abstract description 41
- 238000013135 deep learning Methods 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 12
- 230000009467 reduction Effects 0.000 claims abstract description 11
- 238000001228 spectrum Methods 0.000 claims description 50
- 239000011159 matrix material Substances 0.000 claims description 33
- 238000010586 diagram Methods 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 14
- 238000005457 optimization Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 4
- 238000013136 deep learning model Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 239000012634 fragment Substances 0.000 claims 6
- 230000007613 environmental effect Effects 0.000 claims 1
- 238000009499 grossing Methods 0.000 claims 1
- 230000036962 time dependent Effects 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 7
- 230000004807 localization Effects 0.000 abstract description 7
- 230000005236 sound signal Effects 0.000 abstract description 4
- 230000008901 benefit Effects 0.000 abstract description 3
- 238000013528 artificial neural network Methods 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 6
- 238000010587 phase diagram Methods 0.000 description 4
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000012880 independent component analysis Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S5/00—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
- G01S5/18—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
本发明公开了一种基于麦克风阵列的多声源信号分离系统及方法,包括:分频MUSIC声源定位模块,用于识别目标声源数量与方位;活跃目标声源识别模块,用于识别音频信号中活跃片段的起始与终止时间,以及其中发声的目标声源;分频波束成形模块,对活跃片段中每一个发声的目标声源,进行分频段的导向向量优化与波束成形,得到定向增强的音频;深度学习降噪模块,对定向增强的音频进行降噪处理以获得分离结果。本发明通过分频段地优化导向向量并进行波束成形来实现高质量定向增强,并利用神经网络进行进一步降噪以提升分离效果,成功地结合了信号处理和深度学习在多声源分离任务中的优势,有效地提升了多声源分离的质量与灵活性与适应性。
The present invention discloses a multi-source signal separation system and method based on a microphone array, including: a frequency-divided MUSIC sound source localization module, used to identify the number and orientation of target sound sources; an active target sound source identification module, used to identify the start and end time of active segments in audio signals, and the target sound sources therein; a frequency-divided beamforming module, for optimizing the steering vector and beamforming the target sound sources in the active segments in frequency divisions to obtain directionally enhanced audio; and a deep learning noise reduction module, for performing noise reduction processing on the directionally enhanced audio to obtain a separation result. The present invention achieves high-quality directional enhancement by optimizing the steering vector in frequency divisions and performing beamforming, and further noise reduction is performed using a neural network to improve the separation effect. It successfully combines the advantages of signal processing and deep learning in the task of multi-source separation, and effectively improves the quality, flexibility and adaptability of multi-source separation.
Description
技术领域Technical Field
本发明属于智能感知与边缘计算技术领域,具体涉及一种基于麦克风阵列的多声源信号分离系统及方法。The present invention belongs to the field of intelligent perception and edge computing technology, and specifically relates to a multi-sound source signal separation system and method based on a microphone array.
背景技术Background technique
随着音频数据的处理技术的发展,音频信号在自然语言处理、人机交互等多方面有了新的用处,例如通过语音转写程序完成从音频到文字的自动笔录功能以及在智能手机、智能音箱等智能设备的语音控制功能。在这些新兴的对音频的应用之中,往往对音频的质量都有一定的要求,它们都需要清晰的、高信噪比的音频信号来确保应用的鲁棒性与效果。但是实际的使用场景并非都是无噪声或者低噪声的单声源环境,很多场合下环境之中存在多个发声声源。因此,为了保证这些应用在实际环境之中的稳定性与效果,需要通过某种途径来从多声源环境之中将多个声源分离出来。目前已经有不少工作研究了多声源分离任务,但这些方法都存在一定的问题。With the development of audio data processing technology, audio signals have new uses in natural language processing, human-computer interaction and other aspects. For example, the automatic transcription function from audio to text can be completed through voice transcription programs, as well as the voice control function of smart devices such as smartphones and smart speakers. In these emerging audio applications, there are often certain requirements for the quality of audio. They all require clear, high signal-to-noise ratio audio signals to ensure the robustness and effectiveness of the application. However, the actual usage scenarios are not all noise-free or low-noise single-source environments. In many cases, there are multiple sound sources in the environment. Therefore, in order to ensure the stability and effectiveness of these applications in the actual environment, it is necessary to separate multiple sound sources from the multi-source environment through some means. At present, there have been many works studying the multi-source separation task, but these methods all have certain problems.
目前,现有主流的多声源分离算法包括:At present, the mainstream multi-source separation algorithms include:
1、基于信号处理的方法,如独立成分分析、非负矩阵分解、波束成形等;此类方法的优势在于灵活性,例如波束成形可以对任意多的声源进行分离,但是缺陷也很明显,它们往往需要对一些参数的统计特性进行估计,这导致了这类方法在处理实际场景下的多声源分离问题时,效果不理想;并且此类处理方法不可避免地受到多径效应的影响,在非开阔的环境之中的效果较差。1. Methods based on signal processing, such as independent component analysis, non-negative matrix decomposition, beamforming, etc.; the advantage of such methods lies in flexibility. For example, beamforming can separate any number of sound sources, but the defects are also obvious. They often require the estimation of the statistical characteristics of some parameters, which leads to such methods not being ideal when dealing with the problem of multiple sound source separation in actual scenarios; and such processing methods are inevitably affected by the multipath effect, and the effect is poor in non-open environments.
2、基于深度学习的端到端方法,如Wave U-Net、Dual-path RNN等;这类方法的优势在于处理如低噪声两声源这类简单的分离问题效果较好,但是问题也很明显,它们往往灵活性较差,只能够分离固定的二声源或三声源,难以处理更多声源时。2. End-to-end methods based on deep learning, such as Wave U-Net, Dual-path RNN, etc.; the advantage of this type of method is that it is effective in dealing with simple separation problems such as two low-noise sound sources, but the problem is also obvious. They often have poor flexibility and can only separate two or three fixed sound sources. It is difficult to deal with more sound sources.
因此,基于以上考虑,有必要提出一种新的声源分离系统及方法,能够克服多径效应带来的虚声源问题,即使在多径效应明显的狭小环境之中,依然具备良好的鲁棒性与分离质量;同时,应当能够足够灵活,在场景中声源数量提升时仍能够具备良好的鲁棒性与分离质量。Therefore, based on the above considerations, it is necessary to propose a new sound source separation system and method that can overcome the virtual sound source problem caused by the multipath effect, and still have good robustness and separation quality even in a small environment where the multipath effect is obvious; at the same time, it should be flexible enough to still have good robustness and separation quality when the number of sound sources in the scene increases.
发明内容Summary of the invention
针对于上述现有技术的不足,本发明的目的在于提供一种基于麦克风阵列的多声源信号分离系统及方法,以解决现有的多声源分离方法中难以解决多径效应、难以应对声源数量变化,只能够处理固定数量声源的问题。本发明方法通过处理采集到的多声道音频来分离位于用户指定的方向上的目标声源的声音信号。In view of the above-mentioned deficiencies in the prior art, the purpose of the present invention is to provide a multi-source signal separation system and method based on a microphone array to solve the problem that the existing multi-source separation method is difficult to solve the multipath effect, difficult to cope with the change in the number of sound sources, and can only process a fixed number of sound sources. The method of the present invention separates the sound signal of the target sound source located in the direction specified by the user by processing the collected multi-channel audio.
为达到上述目的,本发明采用的技术方案如下:To achieve the above object, the technical solution adopted by the present invention is as follows:
本发明的一种基于麦克风阵列的多声源信号分离系统,包括:分频MUSIC声源定位模块、活跃目标声源识别模块、分频波束成形模块及深度学习降噪模块;A multi-sound source signal separation system based on a microphone array of the present invention comprises: a frequency-divided MUSIC sound source localization module, an active target sound source identification module, a frequency-divided beamforming module and a deep learning noise reduction module;
分频MUSIC声源定位模块,利用分频MUSIC算法处理麦克风阵列采集到的多声道音频数据,获取目标声源相较于麦克风阵列的方位角,实现对目标声源的定位;The frequency-division MUSIC sound source localization module uses the frequency-division MUSIC algorithm to process the multi-channel audio data collected by the microphone array, obtains the azimuth of the target sound source compared to the microphone array, and realizes the localization of the target sound source;
活跃目标声源识别模块,利用基准声道中采集到的音频来识别声音能量强的活跃片段的起始时间与终止时间,并进一步识别活跃片段中发声的目标声源的方位;The active target sound source identification module uses the audio collected in the reference channel to identify the start and end time of the active segment with strong sound energy, and further identifies the direction of the target sound source in the active segment;
分频波束成形模块,利用识别到的活跃片段,对每一个在其中发声的目标声源进行分频段的导向向量优化与波束成形,以获取定向增强后的音频;The frequency division beamforming module uses the identified active segments to perform frequency division steering vector optimization and beamforming for each target sound source in the segment to obtain directionally enhanced audio.
深度学习降噪模块,利用深度学习算法对得到的定向增强的音频进行深度学习降噪,降噪后的音频作为最终的分离结果。The deep learning denoising module uses a deep learning algorithm to perform deep learning denoising on the directionally enhanced audio, and the denoised audio is used as the final separation result.
进一步地,所述麦克风阵列由一个以上的麦克风按照固定的几何结构排列而成。Furthermore, the microphone array is composed of more than one microphone arranged in a fixed geometric structure.
进一步地,需要分离的目标声源数量不超过麦克风阵列的数量。Furthermore, the number of target sound sources that need to be separated does not exceed the number of microphone arrays.
进一步地,所述定向增强的音频指被增强方向上声源声音的信噪高于原始音频中的音频。Furthermore, the directionally enhanced audio refers to audio in which the signal-to-noise of the sound source in the enhanced direction is higher than that in the original audio.
本发明的一种基于麦克风阵列的多声源信号分离方法,基于上述系统,步骤如下:A method for separating multiple sound source signals based on a microphone array of the present invention is based on the above system and comprises the following steps:
1)对目标声源进行识别定位,获取目标声源相较于麦克风阵列的方位角度;1) Identify and locate the target sound source and obtain the azimuth angle of the target sound source relative to the microphone array;
2)识别采集到的音频中活跃片段的起始时间与终止时间,并进一步识别活跃片段中发声的目标声源的方位;2) Identify the start time and end time of the active segment in the collected audio, and further identify the direction of the target sound source in the active segment;
3)分频段地优化对活跃片段中每个发声的目标声源进行波束成形时所使用的导向向量并利用优化的导向向量进行分频段的波束成形,得到定向增强的音频;3) optimizing the steering vector used for beamforming each target sound source in the active segment in frequency bands and performing beamforming in frequency bands using the optimized steering vectors to obtain directionally enhanced audio;
4)对定向增强的音频进行降噪处理,处理结果作为最终的分离结果。4) Perform noise reduction processing on the directionally enhanced audio, and the processing result is used as the final separation result.
进一步地,所述步骤1)具体包括:Furthermore, the step 1) specifically includes:
11)采集一段所有目标声源同时发声的音频片段;11) Collect an audio clip in which all target sound sources sound simultaneously;
12)使用分频MUSIC算法对步骤11)中采集到的音频数据进行处理,计算出场景中目标声源的数量Ns以及各目标声源相较于麦克风阵列的方位角As,其中,n的值由计算出的目标声源数量决定,1≤n≤Ns。12) Use the frequency division MUSIC algorithm to process the audio data collected in step 11) to calculate the number N s of target sound sources in the scene and the azimuth angle A s of each target sound source relative to the microphone array. The value of n is determined by the calculated number of target sound sources, 1≤n≤N s .
进一步地,所述步骤12)的具体方法为:Furthermore, the specific method of step 12) is:
121)选取麦克风阵列中靠近阵列几何布局中心的麦克风为基准麦克风;121) selecting a microphone in the microphone array that is close to the center of the array geometric layout as a reference microphone;
122)计算导向向量矩阵Vecsteering,矩阵的尺寸为(FR,AR),其中FR表示用户设置的频率分辨率,AR表示用户设置的角度分辨率;Vecsteering[freq,angle]表示当声源处于角度angle时,频段freq上的导向向量;通过以下公式计算:122) Calculate the steering vector matrix Vec steering , the size of the matrix is (FR, AR), where FR represents the frequency resolution set by the user, and AR represents the angle resolution set by the user; Vec steering [freq, angle] represents the steering vector on the frequency band freq when the sound source is at angle angle; it is calculated by the following formula:
其中,j为虚数符号,Avgfreq表示频段freq的平均值,表示当声源处于角度分量angle时,第i个麦克风相较于基准麦克风在垂直于声源所处方位角方向上的距离差,1≤i≤Nm;Among them, j is the imaginary number symbol, Avg freq represents the average value of the frequency band freq, It indicates the distance difference between the i-th microphone and the reference microphone in the direction perpendicular to the azimuth angle of the sound source when the sound source is at the angle component angle, 1≤i≤N m ;
123)对采集到的音频片段执行短时傅里叶变换操作,得到多维矩阵Arrfft,其中保存着音频片段中的各个声道中不同频段上的能量随时间的变化;123) performing a short-time Fourier transform operation on the collected audio clip to obtain a multidimensional matrix Arr fft , which stores the energy changes over time in different frequency bands in each channel of the audio clip;
124)从多维矩阵Arrfft中分频段地提取出描述所有声道能量随时间的变化的矩阵,频段f所对应的矩阵记为Arrf;计算Arrf的协方差矩阵的特征值与特征向量,并结合导向向量矩阵计算出频段上f的总MUSIC谱;选取矩阵Arrf中相互正交的两组声道所对应的子集,分别计算两个子集的协方差矩阵的特征值与特征向量,并结合导向向量矩阵计算出频段f上两组正交声道各自对应的MUSIC谱;累加所有频段上的总MUSIC谱以及两组正交声道各自对应的MUSIC谱,得到最终使用的总MUSIC谱以及两组正交声道所对应的MUSIC谱;124) extracting the matrix describing the change of energy of all channels over time from the multidimensional matrix Arr fft in frequency bands, and recording the matrix corresponding to the frequency band f as Arr f ; calculating the eigenvalues and eigenvectors of the covariance matrix of Arr f , and calculating the total MUSIC spectrum of f on the frequency band in combination with the steering vector matrix; selecting subsets corresponding to two mutually orthogonal groups of channels in the matrix Arr f , respectively calculating the eigenvalues and eigenvectors of the covariance matrices of the two subsets, and calculating the MUSIC spectra corresponding to the two groups of orthogonal channels on the frequency band f in combination with the steering vector matrix; accumulating the total MUSIC spectra on all frequency bands and the MUSIC spectra corresponding to the two groups of orthogonal channels, and obtaining the final total MUSIC spectrum and the MUSIC spectrum corresponding to the two groups of orthogonal channels;
125)统计步骤124)中计算出的总MUSIC谱中的所有峰值,对于每一个峰值,如果在两组正交声道所对应的MUSIC谱的对应位置上同时存在峰值,则表示该峰值对应一个目标声源,峰值的位置代表声源相较于麦克风阵列的方位角。125) Count all the peaks in the total MUSIC spectrum calculated in step 124). For each peak, if there is a peak at the corresponding position of the MUSIC spectrum corresponding to the two sets of orthogonal channels, it means that the peak corresponds to a target sound source, and the position of the peak represents the azimuth of the sound source relative to the microphone array.
进一步地,所述步骤2)具体包括:Furthermore, the step 2) specifically includes:
21)从麦克风阵列中读取一段数据,并使用其识别活跃片段的起始时间与终止时间;21) Read a segment of data from the microphone array and use it to identify the start and end times of the active segment;
22)识别出活跃片段的起始时间与终止时间后,进一步识别活跃片段中发声的目标声源数量Nas以及其对应的方位角的集合 22) After identifying the start time and end time of the active segment, further identify the number of target sound sources N as and the corresponding azimuth set in the active segment
进一步地,所述步骤21)具体包括:Furthermore, the step 21) specifically includes:
211)从所取的音频片段中取长度为l字节的未处理的数据帧并计算数据帧中采样单元幅值的平均值,如果该平均值超过环境噪声的平均强度An的k倍以上,则该数据帧为一个活跃帧,否则该数据帧为一个不活跃帧;其中l与k为用户设置的参数,l能够整除读取的音频片段的大小;211) taking an unprocessed data frame of length l bytes from the taken audio segment and calculating the average value of the amplitude of the sampling unit in the data frame. If the average value exceeds k times the average intensity An of the ambient noise, the data frame is an active frame, otherwise the data frame is an inactive frame; wherein l and k are parameters set by the user, and l can divide the size of the read audio segment;
212)活跃片段由连续的活跃帧或不活跃帧组成,且第一个帧和最后一个帧为活跃帧;在检测到一个活跃帧后,开始构造活跃片段;活跃片段由从识别到的第一个活跃帧的起始时间Tb开始到Tb+Ts结束的时间范围内的第一个活跃帧与最后一个活跃帧以及二者之间的所有活跃帧与不活跃帧构成;Ts为用户设置的代表活跃片段最长时间的值。212) An active segment consists of continuous active frames or inactive frames, and the first frame and the last frame are active frames; after an active frame is detected, the construction of an active segment begins; the active segment consists of the first active frame and the last active frame within the time range from the start time T b of the first active frame identified to the end time T b + T s , as well as all active frames and inactive frames between the two; T s is the value set by the user representing the maximum time of the active segment.
进一步地,所述步骤22)具体包括:Furthermore, the step 22) specifically includes:
221)选取麦克风阵列中靠近阵列几何布局中心的麦克风为基准麦克风;221) selecting a microphone in the microphone array that is close to the center of the array geometric layout as a reference microphone;
222)利用活跃片段计算出总MUSIC谱以及两组正交声道所对应的MUISC谱,计算方法与步骤122)-步骤124)相同;222) using the active segment to calculate the total MUSIC spectrum and the MUSIC spectrum corresponding to the two sets of orthogonal channels, the calculation method is the same as step 122) to step 124);
223)统计步骤222)中总MUSIC谱中的所有峰值,对于每一个峰值,如果在两组正交声道所对应的MUSIC谱中的对应位置上均存在峰值,则表示该峰值对应一个发声声源,该峰值所处的位置对应发声声源相较于麦克风阵列的方位角;取所有发声声源的方位角与目标声源方位角集合As的交集,得到活跃片段中发声的目标声源的方位角集合其中Nas是交集的大小,表示发声的目标声源的数量。223) Count all the peaks in the total MUSIC spectrum in step 222), for each peak, if there are peaks at corresponding positions in the MUSIC spectrum corresponding to the two sets of orthogonal channels, it means that the peak corresponds to a sound source, and the position of the peak corresponds to the azimuth of the sound source relative to the microphone array; take the intersection of the azimuths of all sound sources and the target sound source azimuth set As, and obtain the azimuth set of the target sound source that sounds in the active segment Where Nas is the size of the intersection, indicating the number of target sound sources emitting the sound.
进一步地,所述步骤3)具体包括:Furthermore, the step 3) specifically includes:
31)分频段地优化对步骤2)中识别到的每一个发声的目标声源进行波束成型时使用的导向向量以削弱波束成形对其他发声声源的增强;31) optimizing the steering vector used in beamforming for each target sound source identified in step 2) in frequency bands to weaken the enhancement of other sound sources by beamforming;
32)利用步骤31)中得到的导向向量对各频段的频谱图进行波束成形,并对波束成形结果进行正则化以实现平滑;将波束成形结果按权重进行调整与组合,随后对计算组合结果进行逆傅里叶变换,得到定向增强的音频。32) Using the steering vector obtained in step 31), the spectrum diagram of each frequency band is beamformed, and the beamforming result is regularized to achieve smoothness; the beamforming result is adjusted and combined according to the weight, and then the calculated combination result is inverse Fourier transformed to obtain directionally enhanced audio.
进一步地,所述步骤31)的具体方法为:Furthermore, the specific method of step 31) is:
311)计算出导向向量矩阵Vecsteering,与步骤122)的方法相同;311) calculating the steering vector matrix Vec steering in the same way as step 122);
312)针对某一个发声目标声源计算波束成形时,其余的发声声源被视为干扰声源,如果不存在干扰声源,则直接使用目标声源的方位角对应的导向向量进行波束成形;否则,分频段地调整波束成形的导向向量;312) When calculating beamforming for a certain sound target sound source, the remaining sound sources are regarded as interference sound sources. If there is no interference sound source, the steering vector corresponding to the azimuth angle of the target sound source is directly used for beamforming; otherwise, the steering vector of beamforming is adjusted in frequency bands;
在频段f上对方位角θ进行波束成型时,处于方位角θinfer上的干扰声源的增强系数 的计算方式如下:When beamforming is performed on the azimuth angle θ in the frequency band f, the enhancement coefficient of the interfering sound source at the azimuth angle θ infer is The calculation method is as follows:
当有多个干扰声源时,计算每个干扰声源的增强系数并求和,作为总的干扰系数,记作 When there are multiple interference sound sources, the enhancement coefficient of each interference sound source is calculated and summed up as the total interference coefficient, recorded as
313)对于每个发声的目标声源θtarget,在一定的限制条件下求解最优化问题得到在频段f上的角度调整差值Δθ;再通过频段f与角度θtarget+Δθ在Vecsteering中索引出导向向量作为在频段f上对目标声源θtarget进行波束成形时的导向向量。313) For each target sound source θ target , solve the optimization problem under certain constraints The angle adjustment difference Δθ on the frequency band f is obtained. Then, the steering vector is indexed in Vec steering through the frequency band f and the angle θ target + Δθ as the steering vector for beamforming the target sound source θ target on the frequency band f.
进一步地,所述步骤32)具体包括:Furthermore, the step 32) specifically includes:
321)计算活跃片段的频谱图组Specraw,Specraw由各个声道的频谱图组成, 321) Calculate the spectrogram group Spec raw of the active segment, Spec raw consists of the spectrograms of each channel,
322)对各频段对应的频谱图进行波束成形;具体为:从频谱图组中提取出频段f的对应部分以利用步骤313)中得到的导向向量作为权重,对进行加权求和,得到对处于方位角θtarget上的声源在频段f上的波束成形结果;随后将结果除以目标方向上的增强系数进行归一化,得到归一化后的波束成形结果 322) beamforming the spectrum graph corresponding to each frequency band; specifically: extracting the corresponding part of the frequency band f from the spectrum graph group Using the steering vector obtained in step 313) as the weight, Perform weighted summation to obtain the beamforming result of the sound source at the azimuth angle θ target in the frequency band f; then divide the result by the enhancement coefficient in the target direction Normalize and get the normalized beamforming result
323)将所有频段的归一化后的波束成形结果按权重进行拼接组合,得到最终的波束成形结果Specbf(θtarget);每个频段对应的权重取决于频段f本身以及目标角度θtarget;具体为:利用所有声道的数据,计算出频段f对应的MUSIC谱,计算过程同步骤12);再取频段f对应的MUSIC谱上,目标角度θtarget对应的谱值作为权重;323) The normalized beamforming results of all frequency bands are spliced and combined according to the weights to obtain the final beamforming result Spec bf (θ target ); the weight corresponding to each frequency band depends on the frequency band f itself and the target angle θ target ; specifically: using the data of all channels, the MUSIC spectrum corresponding to the frequency band f is calculated, and the calculation process is the same as step 12); then the spectrum value corresponding to the target angle θ target on the MUSIC spectrum corresponding to the frequency band f is taken as the weight;
324)对Specbf(θtar get)进行逆傅里叶变换,得到对θtar get方向上定向增强的音频。324) Perform inverse Fourier transform on Spec bf (θ tar get ) to obtain audio with directionality enhancement in the direction of θ tar get .
进一步地,所述步骤4)具体包括:Furthermore, the step 4) specifically includes:
41)取活跃声源片段的波束成形结果以及任一声道采集得到的原始未分离的音频数据进行傅里叶变换并进行取模操作,将取模操作的结果拼接成双通道图像;41) taking the beamforming result of the active sound source segment and the original unseparated audio data collected by any channel, performing Fourier transform and modulo operation, and splicing the result of the modulo operation into a dual-channel image;
42)利用深度学习模型计算步骤41)中的双通道图像,得到降噪后的音频时频图的幅值部分Amp;42) Using the deep learning model to calculate the dual-channel image in step 41), the amplitude part Amp of the denoised audio frequency graph is obtained;
43)利用步骤32)中得到的波束成形结果,对步骤42)中得到的时频图幅值部分进行相位填充,得到最终的分离结果。43) Using the beamforming result obtained in step 32), phase filling is performed on the amplitude part of the time-frequency diagram obtained in step 42) to obtain the final separation result.
进一步地,所述步骤42)中使用的深度模型为Dual-Path GAN,Dual-Path GAN由生成器和判别器两部分组成,在生成器中,双通道图像首先经过密卷积模块组成的Encoder进行编码,得到中间结果;随后使用两组连续的四个conformer模块来分别处理中间结果与中间结果的转置,处理结果将会被拼接在一起,作为Decoder的输入;Decoder利用密卷积来对其输入进行解码,并使用卷积模块来对输出进行降维,得到降噪后的音频时频图的幅值部分Amp;判别器用于模型的训练过程中,判别器将生成器生成的幅值图Amp与波束成形结果的幅值图Ampbf拼接成双通道图片作为输入,通过连续的卷积模块生成判断结果,以用来优化模型;Furthermore, the deep model used in the step 42) is Dual-Path GAN, which consists of a generator and a discriminator. In the generator, the dual-channel image is first encoded by an encoder composed of dense convolution modules to obtain an intermediate result; then two groups of four consecutive conformer modules are used to process the intermediate result and the transposition of the intermediate result respectively, and the processing results will be spliced together as the input of the decoder; the decoder uses dense convolution to decode its input, and uses a convolution module to reduce the dimension of the output to obtain the amplitude part Amp of the audio time-frequency diagram after denoising; the discriminator is used in the training process of the model, and the discriminator splices the amplitude map Amp generated by the generator and the amplitude map Amp bf of the beamforming result into a dual-channel image as input, and generates a judgment result through a continuous convolution module to optimize the model;
密卷积模块由Dense Net和卷积模块组合而成的模块,和卷积模块一样,密卷积模块对输入进行卷积操作,将卷积结果和与其串联的在其前面的所有密卷积模块的输入相拼接,组成密卷积模块的输出。The dense convolution module is a module composed of a Dense Net and a convolution module. Like the convolution module, the dense convolution module performs a convolution operation on the input, concatenates the convolution result with the input of all the dense convolution modules in front of it, and forms the output of the dense convolution module.
进一步地,所述步骤43)的具体方法为:Furthermore, the specific method of step 43) is:
431)对波束成形的结果进行傅里叶变换,得到频谱图;再取频谱图的相位信息,得到相位图Phase,相位图的尺寸和频谱图的尺寸相同;431) Performing Fourier transform on the beamforming result to obtain a spectrum diagram; then taking the phase information of the spectrum diagram to obtain a phase diagram Phase, the size of the phase diagram is the same as the size of the spectrum diagram;
432)将Amp与Phase对应位置上的元素相乘,得到降噪后的音频的时频图;432) Multiply the elements at corresponding positions of Amp and Phase to obtain a time-frequency diagram of the denoised audio;
433)对降噪后的音频的时频图进行逆傅里叶变换,得到降噪后的结果。433) Perform inverse Fourier transform on the time-frequency diagram of the denoised audio to obtain a denoised result.
本发明的有益效果:Beneficial effects of the present invention:
1、本发明基于波束成形实现多声源分离,对麦克风阵列的几何布局无特殊要求,同时对被分离的声源的数量限制低,能够使用任意麦克风阵列对不定数量的声源进行分离,提高了多声源分离的灵活性。1. The present invention realizes multi-sound source separation based on beamforming, has no special requirements on the geometric layout of the microphone array, and has low restrictions on the number of separated sound sources. Any microphone array can be used to separate an indefinite number of sound sources, thereby improving the flexibility of multi-sound source separation.
2、本发明通过分频波束成形提升了波束成形定向分离的效果,并使用Dual-PathGAN增强了定向分离的结果,有效地提升了多声源分离结果的质量,并提升了多声源分离的适应性,能够在复杂多径环境中保持分离质量;2. The present invention improves the effect of beamforming directional separation through frequency division beamforming, and uses Dual-PathGAN to enhance the result of directional separation, effectively improving the quality of multi-source separation results, and improving the adaptability of multi-source separation, and can maintain separation quality in complex multipath environments;
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明系统的结构框图。FIG. 1 is a structural block diagram of the system of the present invention.
图2为本发明方法的流程图。FIG. 2 is a flow chart of the method of the present invention.
图3为本发明中深度模型为Dual-Path GAN的结构图。FIG3 is a structural diagram of the deep model of the present invention, which is Dual-Path GAN.
具体实施方式Detailed ways
为了便于本领域技术人员的理解,下面结合实施例与附图对本发明作进一步的说明,实施方式提及的内容并非对本发明的限定。In order to facilitate the understanding of those skilled in the art, the present invention is further described below in conjunction with embodiments and drawings. The contents mentioned in the implementation modes are not intended to limit the present invention.
参照图1所示,本发明的一种基于麦克风阵列的多声源信号分离系统,包括:分频MUSIC声源定位模块、活跃目标声源识别模块、分频波束成形模块及深度学习降噪模块;1 , a multi-sound source signal separation system based on a microphone array of the present invention includes: a frequency-divided MUSIC sound source localization module, an active target sound source identification module, a frequency-divided beamforming module, and a deep learning noise reduction module;
分频MUSIC声源定位模块,利用分频MUSIC算法处理麦克风阵列采集到的多声道音频数据,获取目标声源相较于麦克风阵列的方位角,实现对目标声源的定位;The frequency-division MUSIC sound source localization module uses the frequency-division MUSIC algorithm to process the multi-channel audio data collected by the microphone array, obtains the azimuth of the target sound source compared to the microphone array, and realizes the localization of the target sound source;
活跃目标声源识别模块,利用基准声道中采集到的音频来识别声音能量强的活跃片段的起始时间与终止时间,并进一步识别活跃片段中发声的目标声源的方位;The active target sound source identification module uses the audio collected in the reference channel to identify the start and end time of the active segment with strong sound energy, and further identifies the direction of the target sound source in the active segment;
分频波束成形模块,利用识别到的活跃片段,对每一个在其中发声的目标声源进行分频段的导向向量优化与波束成形,以获取定向增强后的音频;The frequency division beamforming module uses the identified active segments to perform frequency division steering vector optimization and beamforming for each target sound source in the segment to obtain directionally enhanced audio.
深度学习降噪模块,利用深度学习算法对得到的定向增强的音频进行深度学习降噪,降噪后的音频作为最终的分离结果。The deep learning denoising module uses a deep learning algorithm to perform deep learning denoising on the directionally enhanced audio, and the denoised audio is used as the final separation result.
其中,所述麦克风阵列由一个以上的麦克风按照固定的几何结构排列而成。The microphone array is composed of more than one microphone arranged in a fixed geometric structure.
其中,需要分离的目标声源数量不超过麦克风阵列的数量。The number of target sound sources that need to be separated does not exceed the number of microphone arrays.
其中,所述定向增强的音频指被增强方向上声源声音的信噪高于原始音频中的音频。The directionally enhanced audio refers to audio in which the signal-to-noise ratio of the sound source in the enhanced direction is higher than that in the original audio.
参照图2所示,本发明的一种基于麦克风阵列的多声源信号分离方法,基于上述系统,步骤如下:2, a method for separating multiple sound source signals based on a microphone array according to the present invention is based on the above system, and the steps are as follows:
1)对目标声源进行识别定位,获取目标声源相较于麦克风阵列的方位角度;具体包括:1) Identify and locate the target sound source and obtain the azimuth angle of the target sound source relative to the microphone array; specifically including:
11)采集一段所有目标声源同时发声的音频片段;11) Collect an audio clip in which all target sound sources sound simultaneously;
12)使用分频MUSIC算法处理步骤11)中采集到的音频数据,计算出场景中目标声源的数量Ns以及各目标声源相较于麦克风阵列的方位角As,其中,n的值由计算出的目标声源数量决定,1≤n≤Ns。12) Use the frequency division MUSIC algorithm to process the audio data collected in step 11) to calculate the number N s of target sound sources in the scene and the azimuth angle A s of each target sound source relative to the microphone array. The value of n is determined by the calculated number of target sound sources, 1≤n≤N s .
其中,所述步骤12)的具体方法为:Wherein, the specific method of step 12) is:
121)选取麦克风阵列中靠近阵列几何布局中心的麦克风为基准麦克风;121) selecting a microphone in the microphone array that is close to the center of the array geometric layout as a reference microphone;
122)计算导向向量矩阵Vecsteering,矩阵的尺寸为(FR,AR),其中FR表示用户设置的频率分辨率,AR表示用户设置的角度分辨率;Vecsteering[freq,angle]表示当声源处于角度angle时,频段freq上的导向向量;通过以下公式计算:122) Calculate the steering vector matrix Vec steering , the size of the matrix is (FR, AR), where FR represents the frequency resolution set by the user, and AR represents the angle resolution set by the user; Vec steering [freq, angle] represents the steering vector on the frequency band freq when the sound source is at angle angle; it is calculated by the following formula:
其中,j为虚数符号,Avgfreq表示频段freq的平均值,表示当声源处于角度分量angle时,第i个麦克风相较于基准麦克风在垂直于声源所处方位角方向上的距离差,1≤i≤Nm;Among them, j is the imaginary number symbol, Avg freq represents the average value of the frequency band freq, It indicates the distance difference between the i-th microphone and the reference microphone in the direction perpendicular to the azimuth angle of the sound source when the sound source is at the angle component angle, 1≤i≤N m ;
123)对采集到的音频片段执行短时傅里叶变换操作,得到多维矩阵Arrfft,其中保存着音频片段中的各个声道中不同频段上的能量随时间的变化;123) performing a short-time Fourier transform operation on the collected audio clip to obtain a multidimensional matrix Arr fft , which stores the energy changes over time in different frequency bands in each channel of the audio clip;
124)从多维矩阵Arrfft中分频段地提取出描述所有声道能量随时间的变化的矩阵,频段f所对应的矩阵记为Arrf;计算Arrf的协方差矩阵的特征值与特征向量,并结合导向向量矩阵计算出频段上f的总MUSIC谱;选取矩阵Arrf中相互正交的两组声道所对应的子集,分别计算两个子集的协方差矩阵的特征值与特征向量,并结合导向向量矩阵计算出频段f上两组正交声道各自对应的MUSIC谱;累加所有频段上的总MUSIC谱以及两组正交声道各自对应的MUSIC谱,得到最终使用的总MUSIC谱以及两组正交声道所对应的MUSIC谱;124) extracting the matrix describing the change of energy of all channels over time from the multidimensional matrix Arr fft in frequency bands, and recording the matrix corresponding to the frequency band f as Arr f ; calculating the eigenvalues and eigenvectors of the covariance matrix of Arr f , and calculating the total MUSIC spectrum of f on the frequency band in combination with the steering vector matrix; selecting subsets corresponding to two mutually orthogonal groups of channels in the matrix Arr f , respectively calculating the eigenvalues and eigenvectors of the covariance matrices of the two subsets, and calculating the MUSIC spectra corresponding to the two groups of orthogonal channels on the frequency band f in combination with the steering vector matrix; accumulating the total MUSIC spectra on all frequency bands and the MUSIC spectra corresponding to the two groups of orthogonal channels, and obtaining the final total MUSIC spectrum and the MUSIC spectrum corresponding to the two groups of orthogonal channels;
125)统计步骤124)中计算出的总MUSIC谱中的所有峰值,对于每一个峰值,如果在两组正交声道所对应的MUSIC谱的对应位置上同时存在峰值,则表示该峰值对应一个目标声源,峰值的位置代表声源相较于麦克风阵列的方位角。125) Count all the peaks in the total MUSIC spectrum calculated in step 124). For each peak, if there is a peak at the corresponding position of the MUSIC spectrum corresponding to the two sets of orthogonal channels, it means that the peak corresponds to a target sound source, and the position of the peak represents the azimuth of the sound source relative to the microphone array.
2)识别采集到的音频中活跃片段的起始时间与终止时间,并进一步识别活跃片段中发声的目标声源的方位;具体包括:2) Identify the start time and end time of the active segment in the collected audio, and further identify the direction of the target sound source in the active segment; specifically including:
21)从麦克风阵列中读取一段数据,并使用其识别活跃片段的起始时间与终止时间;21) Read a segment of data from the microphone array and use it to identify the start and end times of the active segment;
22)识别出活跃片段的起始时间与终止时间后,进一步识别活跃片段中发声的目标声源数量Nas以及其对应的方位角的集合 22) After identifying the start time and end time of the active segment, further identify the number of target sound sources N as and the corresponding azimuth set in the active segment
其中,所述步骤21)具体包括:Wherein, the step 21) specifically includes:
211)从所取的音频片段中取长度为l字节的未处理的数据帧并计算数据帧中采样单元幅值的平均值,如果该平均值超过环境噪声的平均强度An的k倍以上,则该数据帧为一个活跃帧,否则该数据帧为一个不活跃帧;其中l与k为用户设置的参数,l能够整除读取的音频片段的大小;211) taking an unprocessed data frame of length l bytes from the taken audio segment and calculating the average value of the amplitude of the sampling unit in the data frame. If the average value exceeds k times the average intensity An of the ambient noise, the data frame is an active frame, otherwise the data frame is an inactive frame; wherein l and k are parameters set by the user, and l can divide the size of the read audio segment;
212)活跃片段由连续的活跃帧或不活跃帧组成,且第一个帧和最后一个帧为活跃帧;在检测到一个活跃帧后,开始构造活跃片段;活跃片段由从识别到的第一个活跃帧的起始时间Tb开始到Tb+Ts结束的时间范围内的第一个活跃帧与最后一个活跃帧以及二者之间的所有活跃帧与不活跃帧构成;Ts为用户设置的代表活跃片段最长时间的值。212) An active segment consists of continuous active frames or inactive frames, and the first frame and the last frame are active frames; after an active frame is detected, the construction of an active segment begins; the active segment consists of the first active frame and the last active frame within the time range from the start time T b of the first active frame identified to the end time T b + T s , as well as all active frames and inactive frames between the two; T s is the value set by the user representing the maximum time of the active segment.
其中,所述步骤22)具体包括:Wherein, the step 22) specifically includes:
221)选取麦克风阵列中靠近阵列几何布局中心的麦克风为基准麦克风;221) selecting a microphone in the microphone array that is close to the center of the array geometric layout as a reference microphone;
222)利用活跃片段计算出总MUSIC谱以及两组正交声道所对应的MUISC谱,计算方法与步骤122)-步骤124)相同;222) using the active segment to calculate the total MUSIC spectrum and the MUSIC spectrum corresponding to the two sets of orthogonal channels, the calculation method is the same as step 122) to step 124);
223)统计步骤222)中总MUSIC谱中的所有峰值,对于每一个峰值,如果在两组正交声道所对应的MUSIC谱中的对应位置上均存在峰值,则表示该峰值对应一个发声声源,该峰值所处的位置对应发声声源相较于麦克风阵列的方位角;取所有发声声源的方位角与目标声源方位角集合As的交集,得到活跃片段中发声的目标声源的方位角集合其中Nas是交集的大小,表示发声的目标声源的数量。223) Count all the peaks in the total MUSIC spectrum in step 222), for each peak, if there are peaks at the corresponding positions in the MUSIC spectrum corresponding to the two sets of orthogonal channels, it means that the peak corresponds to a sound source, and the position of the peak corresponds to the azimuth of the sound source relative to the microphone array; take the intersection of the azimuths of all sound sources and the target sound source azimuth set As , and obtain the azimuth set of the target sound source that sounds in the active segment Where Nas is the size of the intersection, indicating the number of target sound sources emitting the sound.
3)分频段地优化对活跃片段中每个发声的目标声源进行波束成形时所使用的导向向量并利用优化的导向向量进行分频段的波束成形,得到定向增强的音频;具体包括:3) optimizing the steering vector used for beamforming each target sound source in the active segment in frequency bands and performing beamforming in frequency bands using the optimized steering vectors to obtain directionally enhanced audio; specifically including:
31)分频段地优化对步骤2)中识别到的每一个发声的目标声源进行波束成型时使用的导向向量以削弱波束成形对其他发声声源的增强;31) optimizing the steering vector used in beamforming for each target sound source identified in step 2) in frequency bands to weaken the enhancement of other sound sources by beamforming;
32)利用步骤31)中得到的导向向量对各频段的频谱图进行波束成形,并对波束成形结果进行正则化以实现平滑;将波束成形结果按权重进行调整与组合,随后对计算组合结果进行逆傅里叶变换,得到定向增强的音频。32) Using the steering vector obtained in step 31), the spectrum diagram of each frequency band is beamformed, and the beamforming result is regularized to achieve smoothness; the beamforming result is adjusted and combined according to the weight, and then the calculated combination result is inverse Fourier transformed to obtain directionally enhanced audio.
其中,所述步骤31)的具体方法为:Wherein, the specific method of step 31) is:
311)计算出导向向量矩阵Vecsteering,与步骤122)的方法相同;311) calculating the steering vector matrix Vec steering in the same way as step 122);
312)针对某一个发声目标声源计算波束成形时,其余的发声声源被视为干扰声源,如果不存在干扰声源,则直接使用目标声源的方位角对应的导向向量进行波束成形;否则,(为了避免波束成形增强干扰声源)分频段地调整波束成形的导向向量;312) When calculating beamforming for a certain sound target sound source, the remaining sound sources are regarded as interference sound sources. If there is no interference sound source, the steering vector corresponding to the azimuth angle of the target sound source is directly used for beamforming; otherwise, the steering vector of beamforming is adjusted in frequency bands (in order to avoid beamforming enhancing the interference sound source);
在频段f上对方位角θ进行波束成型时,处于方位角θinfer上的干扰声源的增强系数的计算方式如下:When beamforming is performed on the azimuth angle θ in the frequency band f, the enhancement coefficient of the interfering sound source at the azimuth angle θ infer is The calculation method is as follows:
当有多个干扰声源时,计算每个干扰声源的增强系数并求和,作为总的干扰系数,记作 When there are multiple interference sound sources, the enhancement coefficient of each interference sound source is calculated and summed up as the total interference coefficient, recorded as
313)对于每个发声的目标声源θtarget,在一定的限制条件(如)下求解最优化问题得到在频段f上的角度调整差值Δθ;再通过频段f与角度θtarget+Δθ在Vecsteering中索引出导向向量作为在频段f上对目标声源θtarget进行波束成形时的导向向量。313) For each target sound source θ target , under certain constraints (such as ) to solve the optimization problem The angle adjustment difference Δθ on the frequency band f is obtained. Then, the steering vector is indexed in Vec steering through the frequency band f and the angle θ target + Δθ as the steering vector for beamforming the target sound source θ target on the frequency band f.
其中,所述步骤32)具体包括:Wherein, the step 32) specifically includes:
321)计算活跃片段的频谱图组Specraw,Specraw由各个声道的频谱图组成, 321) Calculate the spectrogram group Spec raw of the active segment, Spec raw is composed of the spectrograms of each channel,
322)对各频段对应的频谱图进行波束成形;具体为:从频谱图组中提取出频段f的对应部分以利用步骤313)中得到的导向向量作为权重,对进行加权求和,得到对处于方位角θtarget上的声源在频段f上的波束成形结果;随后将结果除以目标方向上的增强系数进行归一化,得到归一化后的波束成形结果 322) beamforming the spectrum graph corresponding to each frequency band; specifically: extracting the corresponding part of the frequency band f from the spectrum graph group Using the steering vector obtained in step 313) as the weight, Perform weighted summation to obtain the beamforming result of the sound source at the azimuth angle θ target in the frequency band f; then divide the result by the enhancement coefficient in the target direction Normalize and get the normalized beamforming result
323)将所有频段的归一化后的波束成形结果按权重进行拼接组合,得到最终的波束成形结果Specbf(θtarget);每个频段对应的权重取决于频段f本身以及目标角度θtarget;具体为:利用所有声道的数据,计算出频段f对应的MUSIC谱,计算过程同步骤12);再取频段f对应的MUSIC谱上,目标角度θtarget对应的谱值作为权重;323) The normalized beamforming results of all frequency bands are spliced and combined according to the weights to obtain the final beamforming result Spec bf (θ target ); the weight corresponding to each frequency band depends on the frequency band f itself and the target angle θ target ; specifically: using the data of all channels, the MUSIC spectrum corresponding to the frequency band f is calculated, and the calculation process is the same as step 12); then the spectrum value corresponding to the target angle θ target on the MUSIC spectrum corresponding to the frequency band f is taken as the weight;
324)对Specbf(θtarget)进行逆傅里叶变换,得到对θtarget方向上定向增强的音频。324) performs an inverse Fourier transform on Spec bf (θ target ) to obtain audio with directionality enhancement in the θ target direction.
4)对定向增强的音频进行降噪处理,处理结果作为最终的分离结果;具体包括:4) Performing noise reduction processing on the directional enhanced audio, and the processing result is used as the final separation result; specifically including:
41)取活跃声源片段的波束成形结果以及任一声道采集得到的原始未分离的音频数据进行傅里叶变换并进行取模操作,将取模操作的结果拼接成双通道图像;41) taking the beamforming result of the active sound source segment and the original unseparated audio data collected by any channel, performing Fourier transform and modulo operation, and splicing the result of the modulo operation into a dual-channel image;
42)利用深度学习模型计算步骤41)中的双通道图像,得到降噪后的音频时频图的幅值部分Amp;42) Using the deep learning model to calculate the dual-channel image in step 41), the amplitude part Amp of the denoised audio frequency graph is obtained;
43)利用步骤32)中得到的波束成形结果,对步骤42)中得到的时频图幅值部分进行相位填充,得到最终的分离结果。43) Using the beamforming result obtained in step 32), phase filling is performed on the amplitude part of the time-frequency diagram obtained in step 42) to obtain the final separation result.
参照图3所示,所述步骤42)中使用的深度模型为Dual-Path GAN,Dual-Path GAN由生成器和判别器两部分组成,在生成器中,双通道图像首先经过密卷积模块组成的Encoder进行编码,得到中间结果;随后使用两组连续的四个conformer模块来分别处理中间结果与中间结果的转置,处理结果将会被拼接在一起,作为Decoder的输入;Decoder利用密卷积来对其输入进行解码,并使用卷积模块来对输出进行降维,得到降噪后的音频时频图的幅值部分Amp;判别器用于模型的训练过程中,判别器将生成器生成的幅值图Amp与波束成形结果的幅值图Ampbf拼接成双通道图片作为输入,通过连续的卷积模块生成判断结果,以用来优化模型;As shown in FIG3 , the deep model used in step 42) is Dual-Path GAN, which consists of a generator and a discriminator. In the generator, the dual-channel image is first encoded by an encoder composed of dense convolution modules to obtain an intermediate result; then two groups of four consecutive conformer modules are used to process the intermediate result and the transposition of the intermediate result respectively, and the processing results will be spliced together as the input of the decoder; the decoder uses dense convolution to decode its input, and uses a convolution module to reduce the dimension of the output to obtain the amplitude part Amp of the audio time-frequency diagram after denoising; the discriminator is used in the training process of the model, and the discriminator splices the amplitude map Amp generated by the generator and the amplitude map Amp bf of the beamforming result into a dual-channel image as input, and generates a judgment result through continuous convolution modules to optimize the model;
密卷积模块是由Dense Net和卷积模块组合而成的模块,和卷积模块一样,密卷积模块对输入进行卷积操作,将卷积结果和与其串联的在其前面的所有密卷积模块的输入相拼接,组成密卷积模块的输出。The dense convolution module is a module composed of a Dense Net and a convolution module. Like the convolution module, the dense convolution module performs a convolution operation on the input, concatenates the convolution result with the input of all the dense convolution modules in front of it, and forms the output of the dense convolution module.
其中,所述步骤43)的具体方法为:Wherein, the specific method of step 43) is:
431)对波束成形的结果进行傅里叶变换,得到频谱图;再取频谱图的相位信息,得到相位图Phase,相位图的尺寸和频谱图的尺寸相同;431) Performing Fourier transform on the beamforming result to obtain a spectrum diagram; then taking the phase information of the spectrum diagram to obtain a phase diagram Phase, the size of the phase diagram is the same as the size of the spectrum diagram;
432)将Amp与Phase对应位置上的元素相乘,得到降噪后的音频的时频图;432) multiplying the elements at corresponding positions of Amp and Phase to obtain a time-frequency diagram of the denoised audio;
433)对降噪后的音频的时频图进行逆傅里叶变换,得到降噪后的结果。433) Perform inverse Fourier transform on the time-frequency diagram of the denoised audio to obtain a denoised result.
本发明具体应用途径很多,以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以作出若干改进,这些改进也应视为本发明的保护范围。The present invention has many specific application paths. The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements can be made without departing from the principle of the present invention. These improvements should also be regarded as the protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410594815.0A CN118280384A (en) | 2024-05-14 | 2024-05-14 | Multi-sound source signal separation system and method based on microphone array |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410594815.0A CN118280384A (en) | 2024-05-14 | 2024-05-14 | Multi-sound source signal separation system and method based on microphone array |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118280384A true CN118280384A (en) | 2024-07-02 |
Family
ID=91642009
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410594815.0A Pending CN118280384A (en) | 2024-05-14 | 2024-05-14 | Multi-sound source signal separation system and method based on microphone array |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118280384A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119012078A (en) * | 2024-10-24 | 2024-11-22 | 深圳市嘀嘟科技有限公司 | Directional audio transmission method and system based on beam forming |
-
2024
- 2024-05-14 CN CN202410594815.0A patent/CN118280384A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119012078A (en) * | 2024-10-24 | 2024-11-22 | 深圳市嘀嘟科技有限公司 | Directional audio transmission method and system based on beam forming |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10901063B2 (en) | Localization algorithm for sound sources with known statistics | |
JP6807029B2 (en) | Sound source separators and methods, and programs | |
US9008329B1 (en) | Noise reduction using multi-feature cluster tracker | |
EP2628316B1 (en) | Apparatus and method for deriving a directional information and computer program product | |
CN110223708B (en) | Speech enhancement method based on speech processing and related equipment | |
CN106504763A (en) | Multi-target Speech Enhancement Method Based on Microphone Array Based on Blind Source Separation and Spectral Subtraction | |
US11817112B2 (en) | Method, device, computer readable storage medium and electronic apparatus for speech signal processing | |
US9426564B2 (en) | Audio processing device, method and program | |
CN102421050A (en) | Apparatus and method for enhancing audio quality using non-uniform configuration of microphones | |
CN108091345B (en) | A Binaural Speech Separation Method Based on Support Vector Machine | |
US10477309B2 (en) | Sound field reproduction device, sound field reproduction method, and program | |
Wang et al. | Deep learning assisted time-frequency processing for speech enhancement on drones | |
US10410641B2 (en) | Audio source separation | |
WO2016056410A1 (en) | Sound processing device, method, and program | |
CN108520756B (en) | Method and device for separating speaker voice | |
CN118280384A (en) | Multi-sound source signal separation system and method based on microphone array | |
US9966081B2 (en) | Method and apparatus for synthesizing separated sound source | |
Hemavathi et al. | Voice conversion spoofing detection by exploring artifacts estimates | |
Gul et al. | Clustering of spatial cues by semantic segmentation for anechoic binaural source separation | |
CN111929638A (en) | Voice direction of arrival estimation method and device | |
Yamaoka et al. | CNN-based virtual microphone signal estimation for MPDR beamforming in underdetermined situations | |
Muñoz-Montoro et al. | Ambisonics domain singing voice separation combining deep neural network and direction aware multichannel NMF | |
Lluís et al. | Direction specific ambisonics source separation with end-to-end deep learning | |
CN117169812A (en) | Sound source positioning method based on deep learning and beam forming | |
Yeow et al. | Real-Time Sound Event Localization and Detection: Deployment Challenges on Edge Devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |