CN118280384A

CN118280384A - Multi-sound source signal separation system and method based on microphone array

Info

Publication number: CN118280384A
Application number: CN202410594815.0A
Authority: CN
Inventors: 谢磊; 颜亚杰; 王楚豫
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2024-05-14
Filing date: 2024-05-14
Publication date: 2024-07-02

Abstract

The present invention discloses a multi-source signal separation system and method based on a microphone array, including: a frequency-divided MUSIC sound source localization module, used to identify the number and orientation of target sound sources; an active target sound source identification module, used to identify the start and end time of active segments in audio signals, and the target sound sources therein; a frequency-divided beamforming module, for optimizing the steering vector and beamforming the target sound sources in the active segments in frequency divisions to obtain directionally enhanced audio; and a deep learning noise reduction module, for performing noise reduction processing on the directionally enhanced audio to obtain a separation result. The present invention achieves high-quality directional enhancement by optimizing the steering vector in frequency divisions and performing beamforming, and further noise reduction is performed using a neural network to improve the separation effect. It successfully combines the advantages of signal processing and deep learning in the task of multi-source separation, and effectively improves the quality, flexibility and adaptability of multi-source separation.

Description

A multi-sound source signal separation system and method based on microphone array

技术领域Technical Field

本发明属于智能感知与边缘计算技术领域，具体涉及一种基于麦克风阵列的多声源信号分离系统及方法。The present invention belongs to the field of intelligent perception and edge computing technology, and specifically relates to a multi-sound source signal separation system and method based on a microphone array.

背景技术Background technique

随着音频数据的处理技术的发展，音频信号在自然语言处理、人机交互等多方面有了新的用处，例如通过语音转写程序完成从音频到文字的自动笔录功能以及在智能手机、智能音箱等智能设备的语音控制功能。在这些新兴的对音频的应用之中，往往对音频的质量都有一定的要求，它们都需要清晰的、高信噪比的音频信号来确保应用的鲁棒性与效果。但是实际的使用场景并非都是无噪声或者低噪声的单声源环境，很多场合下环境之中存在多个发声声源。因此，为了保证这些应用在实际环境之中的稳定性与效果，需要通过某种途径来从多声源环境之中将多个声源分离出来。目前已经有不少工作研究了多声源分离任务，但这些方法都存在一定的问题。With the development of audio data processing technology, audio signals have new uses in natural language processing, human-computer interaction and other aspects. For example, the automatic transcription function from audio to text can be completed through voice transcription programs, as well as the voice control function of smart devices such as smartphones and smart speakers. In these emerging audio applications, there are often certain requirements for the quality of audio. They all require clear, high signal-to-noise ratio audio signals to ensure the robustness and effectiveness of the application. However, the actual usage scenarios are not all noise-free or low-noise single-source environments. In many cases, there are multiple sound sources in the environment. Therefore, in order to ensure the stability and effectiveness of these applications in the actual environment, it is necessary to separate multiple sound sources from the multi-source environment through some means. At present, there have been many works studying the multi-source separation task, but these methods all have certain problems.

目前，现有主流的多声源分离算法包括：At present, the mainstream multi-source separation algorithms include:

1、基于信号处理的方法，如独立成分分析、非负矩阵分解、波束成形等；此类方法的优势在于灵活性，例如波束成形可以对任意多的声源进行分离，但是缺陷也很明显，它们往往需要对一些参数的统计特性进行估计，这导致了这类方法在处理实际场景下的多声源分离问题时，效果不理想；并且此类处理方法不可避免地受到多径效应的影响，在非开阔的环境之中的效果较差。1. Methods based on signal processing, such as independent component analysis, non-negative matrix decomposition, beamforming, etc.; the advantage of such methods lies in flexibility. For example, beamforming can separate any number of sound sources, but the defects are also obvious. They often require the estimation of the statistical characteristics of some parameters, which leads to such methods not being ideal when dealing with the problem of multiple sound source separation in actual scenarios; and such processing methods are inevitably affected by the multipath effect, and the effect is poor in non-open environments.

2、基于深度学习的端到端方法，如Wave U-Net、Dual-path RNN等；这类方法的优势在于处理如低噪声两声源这类简单的分离问题效果较好，但是问题也很明显，它们往往灵活性较差，只能够分离固定的二声源或三声源，难以处理更多声源时。2. End-to-end methods based on deep learning, such as Wave U-Net, Dual-path RNN, etc.; the advantage of this type of method is that it is effective in dealing with simple separation problems such as two low-noise sound sources, but the problem is also obvious. They often have poor flexibility and can only separate two or three fixed sound sources. It is difficult to deal with more sound sources.

因此，基于以上考虑，有必要提出一种新的声源分离系统及方法，能够克服多径效应带来的虚声源问题，即使在多径效应明显的狭小环境之中，依然具备良好的鲁棒性与分离质量；同时，应当能够足够灵活，在场景中声源数量提升时仍能够具备良好的鲁棒性与分离质量。Therefore, based on the above considerations, it is necessary to propose a new sound source separation system and method that can overcome the virtual sound source problem caused by the multipath effect, and still have good robustness and separation quality even in a small environment where the multipath effect is obvious; at the same time, it should be flexible enough to still have good robustness and separation quality when the number of sound sources in the scene increases.

发明内容Summary of the invention

针对于上述现有技术的不足，本发明的目的在于提供一种基于麦克风阵列的多声源信号分离系统及方法，以解决现有的多声源分离方法中难以解决多径效应、难以应对声源数量变化，只能够处理固定数量声源的问题。本发明方法通过处理采集到的多声道音频来分离位于用户指定的方向上的目标声源的声音信号。In view of the above-mentioned deficiencies in the prior art, the purpose of the present invention is to provide a multi-source signal separation system and method based on a microphone array to solve the problem that the existing multi-source separation method is difficult to solve the multipath effect, difficult to cope with the change in the number of sound sources, and can only process a fixed number of sound sources. The method of the present invention separates the sound signal of the target sound source located in the direction specified by the user by processing the collected multi-channel audio.

为达到上述目的，本发明采用的技术方案如下：To achieve the above object, the technical solution adopted by the present invention is as follows:

本发明的一种基于麦克风阵列的多声源信号分离系统，包括：分频MUSIC声源定位模块、活跃目标声源识别模块、分频波束成形模块及深度学习降噪模块；A multi-sound source signal separation system based on a microphone array of the present invention comprises: a frequency-divided MUSIC sound source localization module, an active target sound source identification module, a frequency-divided beamforming module and a deep learning noise reduction module;

分频MUSIC声源定位模块，利用分频MUSIC算法处理麦克风阵列采集到的多声道音频数据，获取目标声源相较于麦克风阵列的方位角，实现对目标声源的定位；The frequency-division MUSIC sound source localization module uses the frequency-division MUSIC algorithm to process the multi-channel audio data collected by the microphone array, obtains the azimuth of the target sound source compared to the microphone array, and realizes the localization of the target sound source;

活跃目标声源识别模块，利用基准声道中采集到的音频来识别声音能量强的活跃片段的起始时间与终止时间，并进一步识别活跃片段中发声的目标声源的方位；The active target sound source identification module uses the audio collected in the reference channel to identify the start and end time of the active segment with strong sound energy, and further identifies the direction of the target sound source in the active segment;

分频波束成形模块，利用识别到的活跃片段，对每一个在其中发声的目标声源进行分频段的导向向量优化与波束成形，以获取定向增强后的音频；The frequency division beamforming module uses the identified active segments to perform frequency division steering vector optimization and beamforming for each target sound source in the segment to obtain directionally enhanced audio.

深度学习降噪模块，利用深度学习算法对得到的定向增强的音频进行深度学习降噪，降噪后的音频作为最终的分离结果。The deep learning denoising module uses a deep learning algorithm to perform deep learning denoising on the directionally enhanced audio, and the denoised audio is used as the final separation result.

进一步地，所述麦克风阵列由一个以上的麦克风按照固定的几何结构排列而成。Furthermore, the microphone array is composed of more than one microphone arranged in a fixed geometric structure.

进一步地，需要分离的目标声源数量不超过麦克风阵列的数量。Furthermore, the number of target sound sources that need to be separated does not exceed the number of microphone arrays.

进一步地，所述定向增强的音频指被增强方向上声源声音的信噪高于原始音频中的音频。Furthermore, the directionally enhanced audio refers to audio in which the signal-to-noise of the sound source in the enhanced direction is higher than that in the original audio.

本发明的一种基于麦克风阵列的多声源信号分离方法，基于上述系统，步骤如下：A method for separating multiple sound source signals based on a microphone array of the present invention is based on the above system and comprises the following steps:

1)对目标声源进行识别定位，获取目标声源相较于麦克风阵列的方位角度；1) Identify and locate the target sound source and obtain the azimuth angle of the target sound source relative to the microphone array;

2)识别采集到的音频中活跃片段的起始时间与终止时间，并进一步识别活跃片段中发声的目标声源的方位；2) Identify the start time and end time of the active segment in the collected audio, and further identify the direction of the target sound source in the active segment;

3)分频段地优化对活跃片段中每个发声的目标声源进行波束成形时所使用的导向向量并利用优化的导向向量进行分频段的波束成形，得到定向增强的音频；3) optimizing the steering vector used for beamforming each target sound source in the active segment in frequency bands and performing beamforming in frequency bands using the optimized steering vectors to obtain directionally enhanced audio;

4)对定向增强的音频进行降噪处理，处理结果作为最终的分离结果。4) Perform noise reduction processing on the directionally enhanced audio, and the processing result is used as the final separation result.

进一步地，所述步骤1)具体包括：Furthermore, the step 1) specifically includes:

11)采集一段所有目标声源同时发声的音频片段；11) Collect an audio clip in which all target sound sources sound simultaneously;

12)使用分频MUSIC算法对步骤11)中采集到的音频数据进行处理，计算出场景中目标声源的数量N_s以及各目标声源相较于麦克风阵列的方位角A_s，其中，n的值由计算出的目标声源数量决定,1≤n≤N_s。12) Use the frequency division MUSIC algorithm to process the audio data collected in step 11) to calculate the number N _s of target sound sources in the scene and the azimuth angle A _s of each target sound source relative to the microphone array. The value of n is determined by the calculated number of target sound sources, 1≤n≤N _s .

进一步地，所述步骤12)的具体方法为：Furthermore, the specific method of step 12) is:

121)选取麦克风阵列中靠近阵列几何布局中心的麦克风为基准麦克风；121) selecting a microphone in the microphone array that is close to the center of the array geometric layout as a reference microphone;

122)计算导向向量矩阵Vec_steering,矩阵的尺寸为(FR,AR)，其中FR表示用户设置的频率分辨率，AR表示用户设置的角度分辨率；Vec_steering[freq,angle]表示当声源处于角度angle时，频段freq上的导向向量；通过以下公式计算：122) Calculate the steering vector matrix Vec _steering , the size of the matrix is (FR, AR), where FR represents the frequency resolution set by the user, and AR represents the angle resolution set by the user; Vec _steering [freq, angle] represents the steering vector on the frequency band freq when the sound source is at angle angle; it is calculated by the following formula:

其中，j为虚数符号，Avg_freq表示频段freq的平均值，表示当声源处于角度分量angle时，第i个麦克风相较于基准麦克风在垂直于声源所处方位角方向上的距离差，1≤i≤N_m；Among them, j is the imaginary number symbol, Avg _freq represents the average value of the frequency band freq, It indicates the distance difference between the i-th microphone and the reference microphone in the direction perpendicular to the azimuth angle of the sound source when the sound source is at the angle component angle, 1≤i≤N _m ;

123)对采集到的音频片段执行短时傅里叶变换操作，得到多维矩阵Arr_fft，其中保存着音频片段中的各个声道中不同频段上的能量随时间的变化；123) performing a short-time Fourier transform operation on the collected audio clip to obtain a multidimensional matrix Arr _fft , which stores the energy changes over time in different frequency bands in each channel of the audio clip;

124)从多维矩阵Arr_fft中分频段地提取出描述所有声道能量随时间的变化的矩阵，频段f所对应的矩阵记为Arr_f；计算Arr_f的协方差矩阵的特征值与特征向量，并结合导向向量矩阵计算出频段上f的总MUSIC谱；选取矩阵Arr_f中相互正交的两组声道所对应的子集，分别计算两个子集的协方差矩阵的特征值与特征向量，并结合导向向量矩阵计算出频段f上两组正交声道各自对应的MUSIC谱；累加所有频段上的总MUSIC谱以及两组正交声道各自对应的MUSIC谱，得到最终使用的总MUSIC谱以及两组正交声道所对应的MUSIC谱；124) extracting the matrix describing the change of energy of all channels over time from the multidimensional matrix Arr _fft in frequency bands, and recording the matrix corresponding to the frequency band f as Arr _f ; calculating the eigenvalues and eigenvectors of the covariance matrix of Arr _f , and calculating the total MUSIC spectrum of f on the frequency band in combination with the steering vector matrix; selecting subsets corresponding to two mutually orthogonal groups of channels in the matrix Arr _f , respectively calculating the eigenvalues and eigenvectors of the covariance matrices of the two subsets, and calculating the MUSIC spectra corresponding to the two groups of orthogonal channels on the frequency band f in combination with the steering vector matrix; accumulating the total MUSIC spectra on all frequency bands and the MUSIC spectra corresponding to the two groups of orthogonal channels, and obtaining the final total MUSIC spectrum and the MUSIC spectrum corresponding to the two groups of orthogonal channels;

125)统计步骤124)中计算出的总MUSIC谱中的所有峰值，对于每一个峰值，如果在两组正交声道所对应的MUSIC谱的对应位置上同时存在峰值，则表示该峰值对应一个目标声源，峰值的位置代表声源相较于麦克风阵列的方位角。125) Count all the peaks in the total MUSIC spectrum calculated in step 124). For each peak, if there is a peak at the corresponding position of the MUSIC spectrum corresponding to the two sets of orthogonal channels, it means that the peak corresponds to a target sound source, and the position of the peak represents the azimuth of the sound source relative to the microphone array.

进一步地，所述步骤2)具体包括：Furthermore, the step 2) specifically includes:

21)从麦克风阵列中读取一段数据，并使用其识别活跃片段的起始时间与终止时间；21) Read a segment of data from the microphone array and use it to identify the start and end times of the active segment;

22)识别出活跃片段的起始时间与终止时间后，进一步识别活跃片段中发声的目标声源数量N_as以及其对应的方位角的集合 22) After identifying the start time and end time of the active segment, further identify the number of target sound sources N _as and the corresponding azimuth set in the active segment

进一步地，所述步骤21)具体包括：Furthermore, the step 21) specifically includes:

211)从所取的音频片段中取长度为l字节的未处理的数据帧并计算数据帧中采样单元幅值的平均值，如果该平均值超过环境噪声的平均强度A_n的k倍以上，则该数据帧为一个活跃帧，否则该数据帧为一个不活跃帧；其中l与k为用户设置的参数，l能够整除读取的音频片段的大小；211) taking an unprocessed data frame of length l bytes from the taken audio segment and calculating the average value of the amplitude of the sampling unit in the data frame. If the average value exceeds k times the average intensity _An of the ambient noise, the data frame is an active frame, otherwise the data frame is an inactive frame; wherein l and k are parameters set by the user, and l can divide the size of the read audio segment;

212)活跃片段由连续的活跃帧或不活跃帧组成，且第一个帧和最后一个帧为活跃帧；在检测到一个活跃帧后，开始构造活跃片段；活跃片段由从识别到的第一个活跃帧的起始时间T_b开始到T_b+T_s结束的时间范围内的第一个活跃帧与最后一个活跃帧以及二者之间的所有活跃帧与不活跃帧构成；T_s为用户设置的代表活跃片段最长时间的值。212) An active segment consists of continuous active frames or inactive frames, and the first frame and the last frame are active frames; after an active frame is detected, the construction of an active segment begins; the active segment consists of the first active frame and the last active frame within the time range from the start time T _b of the first active frame identified to the end time T _b + T _s , as well as all active frames and inactive frames between the two; T _s is the value set by the user representing the maximum time of the active segment.

进一步地，所述步骤22)具体包括：Furthermore, the step 22) specifically includes:

221)选取麦克风阵列中靠近阵列几何布局中心的麦克风为基准麦克风；221) selecting a microphone in the microphone array that is close to the center of the array geometric layout as a reference microphone;

222)利用活跃片段计算出总MUSIC谱以及两组正交声道所对应的MUISC谱，计算方法与步骤122)-步骤124)相同；222) using the active segment to calculate the total MUSIC spectrum and the MUSIC spectrum corresponding to the two sets of orthogonal channels, the calculation method is the same as step 122) to step 124);

223)统计步骤222)中总MUSIC谱中的所有峰值，对于每一个峰值，如果在两组正交声道所对应的MUSIC谱中的对应位置上均存在峰值，则表示该峰值对应一个发声声源，该峰值所处的位置对应发声声源相较于麦克风阵列的方位角；取所有发声声源的方位角与目标声源方位角集合A_s的交集，得到活跃片段中发声的目标声源的方位角集合其中N_as是交集的大小，表示发声的目标声源的数量。223) Count all the peaks in the total MUSIC spectrum in step 222), for each peak, if there are peaks at corresponding positions in the MUSIC spectrum corresponding to the two sets of orthogonal channels, it means that the peak corresponds to a sound source, and the position of the peak corresponds to the azimuth of the sound source relative to the microphone array; take the intersection of the azimuths of all sound sources and the target sound source _azimuth set As, and obtain the azimuth set of the target sound source that sounds in the active segment Where _Nas is the size of the intersection, indicating the number of target sound sources emitting the sound.

进一步地，所述步骤3)具体包括：Furthermore, the step 3) specifically includes:

31)分频段地优化对步骤2)中识别到的每一个发声的目标声源进行波束成型时使用的导向向量以削弱波束成形对其他发声声源的增强；31) optimizing the steering vector used in beamforming for each target sound source identified in step 2) in frequency bands to weaken the enhancement of other sound sources by beamforming;

32)利用步骤31)中得到的导向向量对各频段的频谱图进行波束成形，并对波束成形结果进行正则化以实现平滑；将波束成形结果按权重进行调整与组合，随后对计算组合结果进行逆傅里叶变换，得到定向增强的音频。32) Using the steering vector obtained in step 31), the spectrum diagram of each frequency band is beamformed, and the beamforming result is regularized to achieve smoothness; the beamforming result is adjusted and combined according to the weight, and then the calculated combination result is inverse Fourier transformed to obtain directionally enhanced audio.

进一步地，所述步骤31)的具体方法为：Furthermore, the specific method of step 31) is:

311)计算出导向向量矩阵Vec_steering，与步骤122)的方法相同；311) calculating the steering vector matrix Vec _steering in the same way as step 122);

312)针对某一个发声目标声源计算波束成形时，其余的发声声源被视为干扰声源，如果不存在干扰声源，则直接使用目标声源的方位角对应的导向向量进行波束成形；否则，分频段地调整波束成形的导向向量；312) When calculating beamforming for a certain sound target sound source, the remaining sound sources are regarded as interference sound sources. If there is no interference sound source, the steering vector corresponding to the azimuth angle of the target sound source is directly used for beamforming; otherwise, the steering vector of beamforming is adjusted in frequency bands;

在频段f上对方位角θ进行波束成型时，处于方位角θ_infer上的干扰声源的增强系数的计算方式如下：When beamforming is performed on the azimuth angle θ in the frequency band f, the enhancement coefficient of the interfering sound source at the azimuth angle θ _{infer is} The calculation method is as follows:

当有多个干扰声源时，计算每个干扰声源的增强系数并求和，作为总的干扰系数，记作 When there are multiple interference sound sources, the enhancement coefficient of each interference sound source is calculated and summed up as the total interference coefficient, recorded as

313)对于每个发声的目标声源θ_target,在一定的限制条件下求解最优化问题得到在频段f上的角度调整差值Δθ；再通过频段f与角度θ_target+Δθ在Vec_steering中索引出导向向量作为在频段f上对目标声源θ_target进行波束成形时的导向向量。313) For each target sound source θ _target , solve the optimization problem under certain constraints The angle adjustment difference Δθ on the frequency band f is obtained. Then, the steering vector is indexed in Vec _steering through the frequency band f and the angle θ _target + Δθ as the steering vector for beamforming the target sound source θ _target on the frequency band f.

进一步地，所述步骤32)具体包括：Furthermore, the step 32) specifically includes:

321)计算活跃片段的频谱图组Spec_raw，Spec_raw由各个声道的频谱图组成， 321) Calculate the spectrogram group Spec _raw of the active segment, Spec _raw consists of the spectrograms of each channel,

322)对各频段对应的频谱图进行波束成形；具体为：从频谱图组中提取出频段f的对应部分以利用步骤313)中得到的导向向量作为权重，对进行加权求和，得到对处于方位角θ_target上的声源在频段f上的波束成形结果；随后将结果除以目标方向上的增强系数进行归一化，得到归一化后的波束成形结果 322) beamforming the spectrum graph corresponding to each frequency band; specifically: extracting the corresponding part of the frequency band f from the spectrum graph group Using the steering vector obtained in step 313) as the weight, Perform weighted summation to obtain the beamforming result of the sound source at the azimuth angle θ _target in the frequency band f; then divide the result by the enhancement coefficient in the target direction Normalize and get the normalized beamforming result

323)将所有频段的归一化后的波束成形结果按权重进行拼接组合，得到最终的波束成形结果Spec_bf(θ_target)；每个频段对应的权重取决于频段f本身以及目标角度θ_target；具体为：利用所有声道的数据，计算出频段f对应的MUSIC谱，计算过程同步骤12)；再取频段f对应的MUSIC谱上，目标角度θ_target对应的谱值作为权重；323) The normalized beamforming results of all frequency bands are spliced and combined according to the weights to obtain the final beamforming result Spec _bf (θ _target ); the weight corresponding to each frequency band depends on the frequency band f itself and the target angle θ _target ; specifically: using the data of all channels, the MUSIC spectrum corresponding to the frequency band f is calculated, and the calculation process is the same as step 12); then the spectrum value corresponding to the target angle θ _target on the MUSIC spectrum corresponding to the frequency band f is taken as the weight;

324)对Spec_bf(θ_{tar get})进行逆傅里叶变换，得到对θ_{tar get}方向上定向增强的音频。324) Perform inverse Fourier transform on Spec _bf (θ _{tar get} ) to obtain audio with directionality enhancement in the direction of θ _{tar get} .

进一步地，所述步骤4)具体包括：Furthermore, the step 4) specifically includes:

41)取活跃声源片段的波束成形结果以及任一声道采集得到的原始未分离的音频数据进行傅里叶变换并进行取模操作，将取模操作的结果拼接成双通道图像；41) taking the beamforming result of the active sound source segment and the original unseparated audio data collected by any channel, performing Fourier transform and modulo operation, and splicing the result of the modulo operation into a dual-channel image;

42)利用深度学习模型计算步骤41)中的双通道图像，得到降噪后的音频时频图的幅值部分Amp；42) Using the deep learning model to calculate the dual-channel image in step 41), the amplitude part Amp of the denoised audio frequency graph is obtained;

43)利用步骤32)中得到的波束成形结果，对步骤42)中得到的时频图幅值部分进行相位填充，得到最终的分离结果。43) Using the beamforming result obtained in step 32), phase filling is performed on the amplitude part of the time-frequency diagram obtained in step 42) to obtain the final separation result.

进一步地，所述步骤42)中使用的深度模型为Dual-Path GAN，Dual-Path GAN由生成器和判别器两部分组成，在生成器中，双通道图像首先经过密卷积模块组成的Encoder进行编码，得到中间结果；随后使用两组连续的四个conformer模块来分别处理中间结果与中间结果的转置，处理结果将会被拼接在一起，作为Decoder的输入；Decoder利用密卷积来对其输入进行解码，并使用卷积模块来对输出进行降维，得到降噪后的音频时频图的幅值部分Amp；判别器用于模型的训练过程中，判别器将生成器生成的幅值图Amp与波束成形结果的幅值图Amp_bf拼接成双通道图片作为输入，通过连续的卷积模块生成判断结果，以用来优化模型；Furthermore, the deep model used in the step 42) is Dual-Path GAN, which consists of a generator and a discriminator. In the generator, the dual-channel image is first encoded by an encoder composed of dense convolution modules to obtain an intermediate result; then two groups of four consecutive conformer modules are used to process the intermediate result and the transposition of the intermediate result respectively, and the processing results will be spliced together as the input of the decoder; the decoder uses dense convolution to decode its input, and uses a convolution module to reduce the dimension of the output to obtain the amplitude part Amp of the audio time-frequency diagram after denoising; the discriminator is used in the training process of the model, and the discriminator splices the amplitude map Amp generated by the generator and the amplitude map Amp _bf of the beamforming result into a dual-channel image as input, and generates a judgment result through a continuous convolution module to optimize the model;

密卷积模块由Dense Net和卷积模块组合而成的模块，和卷积模块一样，密卷积模块对输入进行卷积操作，将卷积结果和与其串联的在其前面的所有密卷积模块的输入相拼接，组成密卷积模块的输出。The dense convolution module is a module composed of a Dense Net and a convolution module. Like the convolution module, the dense convolution module performs a convolution operation on the input, concatenates the convolution result with the input of all the dense convolution modules in front of it, and forms the output of the dense convolution module.

进一步地，所述步骤43)的具体方法为：Furthermore, the specific method of step 43) is:

431)对波束成形的结果进行傅里叶变换，得到频谱图；再取频谱图的相位信息，得到相位图Phase，相位图的尺寸和频谱图的尺寸相同；431) Performing Fourier transform on the beamforming result to obtain a spectrum diagram; then taking the phase information of the spectrum diagram to obtain a phase diagram Phase, the size of the phase diagram is the same as the size of the spectrum diagram;

432)将Amp与Phase对应位置上的元素相乘，得到降噪后的音频的时频图；432) Multiply the elements at corresponding positions of Amp and Phase to obtain a time-frequency diagram of the denoised audio;

433)对降噪后的音频的时频图进行逆傅里叶变换，得到降噪后的结果。433) Perform inverse Fourier transform on the time-frequency diagram of the denoised audio to obtain a denoised result.

本发明的有益效果：Beneficial effects of the present invention:

1、本发明基于波束成形实现多声源分离，对麦克风阵列的几何布局无特殊要求，同时对被分离的声源的数量限制低，能够使用任意麦克风阵列对不定数量的声源进行分离，提高了多声源分离的灵活性。1. The present invention realizes multi-sound source separation based on beamforming, has no special requirements on the geometric layout of the microphone array, and has low restrictions on the number of separated sound sources. Any microphone array can be used to separate an indefinite number of sound sources, thereby improving the flexibility of multi-sound source separation.

2、本发明通过分频波束成形提升了波束成形定向分离的效果，并使用Dual-PathGAN增强了定向分离的结果，有效地提升了多声源分离结果的质量，并提升了多声源分离的适应性，能够在复杂多径环境中保持分离质量；2. The present invention improves the effect of beamforming directional separation through frequency division beamforming, and uses Dual-PathGAN to enhance the result of directional separation, effectively improving the quality of multi-source separation results, and improving the adaptability of multi-source separation, and can maintain separation quality in complex multipath environments;

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明系统的结构框图。FIG. 1 is a structural block diagram of the system of the present invention.

图2为本发明方法的流程图。FIG. 2 is a flow chart of the method of the present invention.

图3为本发明中深度模型为Dual-Path GAN的结构图。FIG3 is a structural diagram of the deep model of the present invention, which is Dual-Path GAN.

具体实施方式Detailed ways

为了便于本领域技术人员的理解，下面结合实施例与附图对本发明作进一步的说明，实施方式提及的内容并非对本发明的限定。In order to facilitate the understanding of those skilled in the art, the present invention is further described below in conjunction with embodiments and drawings. The contents mentioned in the implementation modes are not intended to limit the present invention.

参照图1所示，本发明的一种基于麦克风阵列的多声源信号分离系统，包括：分频MUSIC声源定位模块、活跃目标声源识别模块、分频波束成形模块及深度学习降噪模块；1 , a multi-sound source signal separation system based on a microphone array of the present invention includes: a frequency-divided MUSIC sound source localization module, an active target sound source identification module, a frequency-divided beamforming module, and a deep learning noise reduction module;

其中，所述麦克风阵列由一个以上的麦克风按照固定的几何结构排列而成。The microphone array is composed of more than one microphone arranged in a fixed geometric structure.

其中，需要分离的目标声源数量不超过麦克风阵列的数量。The number of target sound sources that need to be separated does not exceed the number of microphone arrays.

其中，所述定向增强的音频指被增强方向上声源声音的信噪高于原始音频中的音频。The directionally enhanced audio refers to audio in which the signal-to-noise ratio of the sound source in the enhanced direction is higher than that in the original audio.

参照图2所示，本发明的一种基于麦克风阵列的多声源信号分离方法，基于上述系统，步骤如下：2, a method for separating multiple sound source signals based on a microphone array according to the present invention is based on the above system, and the steps are as follows:

1)对目标声源进行识别定位，获取目标声源相较于麦克风阵列的方位角度；具体包括：1) Identify and locate the target sound source and obtain the azimuth angle of the target sound source relative to the microphone array; specifically including:

12)使用分频MUSIC算法处理步骤11)中采集到的音频数据，计算出场景中目标声源的数量N_s以及各目标声源相较于麦克风阵列的方位角A_s，其中，n的值由计算出的目标声源数量决定,1≤n≤N_s。12) Use the frequency division MUSIC algorithm to process the audio data collected in step 11) to calculate the number N _s of target sound sources in the scene and the azimuth angle A _s of each target sound source relative to the microphone array. The value of n is determined by the calculated number of target sound sources, 1≤n≤N _s .

其中，所述步骤12)的具体方法为：Wherein, the specific method of step 12) is:

2)识别采集到的音频中活跃片段的起始时间与终止时间，并进一步识别活跃片段中发声的目标声源的方位；具体包括：2) Identify the start time and end time of the active segment in the collected audio, and further identify the direction of the target sound source in the active segment; specifically including:

其中，所述步骤21)具体包括：Wherein, the step 21) specifically includes:

其中，所述步骤22)具体包括：Wherein, the step 22) specifically includes:

223)统计步骤222)中总MUSIC谱中的所有峰值，对于每一个峰值，如果在两组正交声道所对应的MUSIC谱中的对应位置上均存在峰值，则表示该峰值对应一个发声声源，该峰值所处的位置对应发声声源相较于麦克风阵列的方位角；取所有发声声源的方位角与目标声源方位角集合A_s的交集，得到活跃片段中发声的目标声源的方位角集合其中N_as是交集的大小，表示发声的目标声源的数量。223) Count all the peaks in the total MUSIC spectrum in step 222), for each peak, if there are peaks at the corresponding positions in the MUSIC spectrum corresponding to the two sets of orthogonal channels, it means that the peak corresponds to a sound source, and the position of the peak corresponds to the azimuth of the sound source relative to the microphone array; take the intersection of the azimuths of all sound sources and the target sound source azimuth set _As , and obtain the azimuth set of the target sound source that sounds in the active segment Where _Nas is the size of the intersection, indicating the number of target sound sources emitting the sound.

3)分频段地优化对活跃片段中每个发声的目标声源进行波束成形时所使用的导向向量并利用优化的导向向量进行分频段的波束成形，得到定向增强的音频；具体包括：3) optimizing the steering vector used for beamforming each target sound source in the active segment in frequency bands and performing beamforming in frequency bands using the optimized steering vectors to obtain directionally enhanced audio; specifically including:

其中，所述步骤31)的具体方法为：Wherein, the specific method of step 31) is:

312)针对某一个发声目标声源计算波束成形时，其余的发声声源被视为干扰声源，如果不存在干扰声源，则直接使用目标声源的方位角对应的导向向量进行波束成形；否则，(为了避免波束成形增强干扰声源)分频段地调整波束成形的导向向量；312) When calculating beamforming for a certain sound target sound source, the remaining sound sources are regarded as interference sound sources. If there is no interference sound source, the steering vector corresponding to the azimuth angle of the target sound source is directly used for beamforming; otherwise, the steering vector of beamforming is adjusted in frequency bands (in order to avoid beamforming enhancing the interference sound source);

313)对于每个发声的目标声源θ_target,在一定的限制条件(如)下求解最优化问题得到在频段f上的角度调整差值Δθ；再通过频段f与角度θ_target+Δθ在Vec_steering中索引出导向向量作为在频段f上对目标声源θ_target进行波束成形时的导向向量。313) For each target sound source θ _target , under certain constraints (such as ) to solve the optimization problem The angle adjustment difference Δθ on the frequency band f is obtained. Then, the steering vector is indexed in Vec _steering through the frequency band f and the angle θ _target + Δθ as the steering vector for beamforming the target sound source θ _target on the frequency band f.

其中，所述步骤32)具体包括：Wherein, the step 32) specifically includes:

321)计算活跃片段的频谱图组Spec_raw，Spe_craw由各个声道的频谱图组成， 321) Calculate the spectrogram group Spec _raw of the active segment, Spec _raw is composed of the spectrograms of each channel,

324)对Spec_bf(θ_target)进行逆傅里叶变换，得到对θ_target方向上定向增强的音频。324) performs an inverse Fourier transform on Spec _bf (θ _target ) to obtain audio with directionality enhancement in the θ _target direction.

4)对定向增强的音频进行降噪处理，处理结果作为最终的分离结果；具体包括：4) Performing noise reduction processing on the directional enhanced audio, and the processing result is used as the final separation result; specifically including:

参照图3所示，所述步骤42)中使用的深度模型为Dual-Path GAN，Dual-Path GAN由生成器和判别器两部分组成，在生成器中，双通道图像首先经过密卷积模块组成的Encoder进行编码，得到中间结果；随后使用两组连续的四个conformer模块来分别处理中间结果与中间结果的转置，处理结果将会被拼接在一起，作为Decoder的输入；Decoder利用密卷积来对其输入进行解码，并使用卷积模块来对输出进行降维，得到降噪后的音频时频图的幅值部分Amp；判别器用于模型的训练过程中，判别器将生成器生成的幅值图Amp与波束成形结果的幅值图Amp_bf拼接成双通道图片作为输入，通过连续的卷积模块生成判断结果，以用来优化模型；As shown in FIG3 , the deep model used in step 42) is Dual-Path GAN, which consists of a generator and a discriminator. In the generator, the dual-channel image is first encoded by an encoder composed of dense convolution modules to obtain an intermediate result; then two groups of four consecutive conformer modules are used to process the intermediate result and the transposition of the intermediate result respectively, and the processing results will be spliced together as the input of the decoder; the decoder uses dense convolution to decode its input, and uses a convolution module to reduce the dimension of the output to obtain the amplitude part Amp of the audio time-frequency diagram after denoising; the discriminator is used in the training process of the model, and the discriminator splices the amplitude map Amp generated by the generator and the amplitude map Amp _bf of the beamforming result into a dual-channel image as input, and generates a judgment result through continuous convolution modules to optimize the model;

密卷积模块是由Dense Net和卷积模块组合而成的模块，和卷积模块一样，密卷积模块对输入进行卷积操作，将卷积结果和与其串联的在其前面的所有密卷积模块的输入相拼接，组成密卷积模块的输出。The dense convolution module is a module composed of a Dense Net and a convolution module. Like the convolution module, the dense convolution module performs a convolution operation on the input, concatenates the convolution result with the input of all the dense convolution modules in front of it, and forms the output of the dense convolution module.

其中，所述步骤43)的具体方法为：Wherein, the specific method of step 43) is:

432)将Amp与Phase对应位置上的元素相乘，得到降噪后的音频的时频图；432) multiplying the elements at corresponding positions of Amp and Phase to obtain a time-frequency diagram of the denoised audio;

本发明具体应用途径很多，以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以作出若干改进，这些改进也应视为本发明的保护范围。The present invention has many specific application paths. The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements can be made without departing from the principle of the present invention. These improvements should also be regarded as the protection scope of the present invention.

Claims

1. A microphone array-based multi-sound source signal separation system, comprising: the system comprises a frequency division MUSIC sound source positioning module, an active target sound source identification module, a frequency division beam forming module and a deep learning noise reduction module;

The frequency division MUSIC sound source positioning module processes multichannel audio data acquired by the microphone array by utilizing a frequency division MUSIC algorithm, acquires azimuth angles of the target sound source compared with the microphone array, and realizes positioning of the target sound source;

the active target sound source identification module is used for identifying the starting time and the ending time of an active fragment with strong sound energy by utilizing the audio acquired in the reference sound channel, and further identifying the azimuth of a target sound source sounding in the active fragment;

The frequency division beam forming module is used for carrying out frequency division guiding vector optimization and beam forming on each target sound source sounding in the frequency division beam forming module by utilizing the identified active fragments so as to obtain directional enhanced audio;

and the deep learning noise reduction module is used for carrying out deep learning noise reduction on the obtained directional enhanced audio by using a deep learning algorithm, and the noise reduced audio is used as a final separation result.

2. The microphone array-based multi-source signal separation system of claim 1 wherein the directionally-enhanced audio refers to audio in which the signal-to-noise of the source sound in the enhanced direction is higher than in the original audio.

3. A method of separating multiple sound source signals based on a microphone array, based on the system of any of claims 1-2, characterized by the steps of:

1) Identifying and positioning a target sound source, and acquiring the azimuth angle of the target sound source compared with a microphone array;

2) Identifying the starting time and the ending time of an active segment in the acquired audio, and further identifying the azimuth of a target sound source sounding in the active segment;

3) Optimizing a steering vector used when carrying out beam forming on each sounding target sound source in the active segment in a frequency division manner, and carrying out beam forming in the frequency division manner by utilizing the optimized steering vector to obtain directional enhanced audio;

4) And carrying out noise reduction treatment on the audio with enhanced orientation, wherein the treatment result is used as a final separation result.

4. The method for separating multiple sound sources based on microphone array according to claim 3, wherein the step 1) specifically comprises:

11 Collecting a section of audio frequency fragments which are simultaneously sounded by all target sound sources;

12 Processing the audio data acquired in step 11) using a frequency division MUSIC algorithm, calculating the number N _s of target sound sources in the scene and the azimuth angle a _s of each target sound source compared to the microphone array, Wherein the value of N is determined by the calculated number of target sound sources, and N is more than or equal to 1 and less than or equal to N _s.

5. The microphone array-based multi-sound source signal separation method according to claim 4, wherein the specific method of step 12) is as follows:

121 Selecting a microphone in the microphone array, which is close to the geometric layout center of the array, as a reference microphone;

122A vector matrix Vec _steering, the matrix being of size (FR, AR), where FR represents the frequency resolution set by the user and AR represents the angular resolution set by the user; vec _steering [ freq, angle ] represents the steering vector over the frequency band freq when the sound source is at an angle; calculated by the following formula:

Where j is an imaginary symbol, avg _freq represents the average value of the frequency band freq, Indicating that when the sound source is at an angle component angle, the i-th microphone has a distance difference of 1.ltoreq.i.ltoreq.N _m in a direction perpendicular to the prescribed azimuth angle of the sound source as compared with the reference microphone;

123 Performing a short-time fourier transform operation on the acquired audio segment to obtain a multi-dimensional matrix Arr _fft in which the time-dependent changes of energy in different frequency bands in each channel in the audio segment are stored;

124 Extracting a matrix describing the change of energy of all channels along time in a frequency band manner from a multidimensional matrix Arr _fft, and marking a matrix corresponding to a frequency band f as Arr _f; calculating eigenvalues and eigenvectors of covariance matrix of Arr _f, and calculating total MUSIC spectrum of f on the frequency band by combining with the vector matrix; selecting subsets corresponding to two groups of mutually orthogonal channels in a matrix Arr _f, respectively calculating eigenvalues and eigenvectors of covariance matrixes of the two subsets, and calculating MUSIC spectrums corresponding to the two groups of orthogonal channels on a frequency band f by combining a guide vector matrix; accumulating the total MUSIC spectrum on all frequency bands and the MUSIC spectrums corresponding to the two groups of orthogonal channels respectively to obtain the total MUSIC spectrum which is finally used and the MUSIC spectrums corresponding to the two groups of orthogonal channels;

125 Statistics of all peaks in the total MUSIC spectrum calculated in step 124), for each peak, if there is a peak at the corresponding position of the MUSIC spectrum corresponding to the two sets of orthogonal channels at the same time, it indicates that the peak corresponds to a target sound source, and the position of the peak represents the azimuth angle of the sound source compared to the microphone array.

6. The method for separating multiple sound sources based on microphone array according to claim 3, wherein the step 2) specifically comprises:

21 Reading a piece of data from the microphone array and using it to identify a start time and an end time of the active segment;

22 After identifying the start time and the end time of the active segment, further identifying the number N _as of target sound sources sounding in the active segment and the set of azimuth angles corresponding to the number N _as

7. The microphone array-based multi-sound source signal separation method according to claim 6, wherein the step 21) specifically includes:

211 Taking an unprocessed data frame with the length of l bytes from the taken audio fragment and calculating the average value of the amplitude of a sampling unit in the data frame, wherein if the average value exceeds more than k times of the average intensity A _n of environmental noise, the data frame is an active frame, otherwise, the data frame is an inactive frame; wherein l and k are parameters set by a user, and l can divide the size of the read audio fragment;

212 The active segment is composed of continuous active frames or inactive frames, and the first frame and the last frame are active frames; after an active frame is detected, starting to construct an active segment; the active segment is composed of a first active frame and a last active frame and all active frames and inactive frames therebetween in a time range from a start time T _b of the identified first active frame to an end time T _b+T_s; t _s is a value set by the user that represents the longest time of the active segment.

8. The method for separating multiple sound sources based on microphone array according to claim 3, wherein the step 3) specifically comprises:

31 Optimizing steering vectors used in beamforming each of the sound-emitting target sound sources identified in step 2) in a frequency-division manner to attenuate enhancement of other sound-emitting sound sources by beamforming;

32 Carrying out beam forming on the spectrogram of each frequency band by utilizing the guide vector obtained in the step 31), and regularizing the beam forming result to realize smoothing; and adjusting and combining the beam forming result according to the weight, and then performing inverse Fourier transform on the calculated combined result to obtain the directional enhanced audio.

9. The microphone array-based multi-sound source signal separation method according to claim 8, wherein the step 4) specifically includes:

41 Taking the beam forming result of the active sound source segment, carrying out Fourier transform on the original unseparated audio data acquired by any sound channel, carrying out mode taking operation, and splicing the result of the mode taking operation into a double-channel image;

42 Calculating the two-channel image in the step 41) by using a deep learning model to obtain an amplitude part Amp of the noise-reduced audio time-frequency diagram;

43 Using the beam forming result obtained in step 32), phase filling is carried out on the time-frequency diagram amplitude value part obtained in step 42), and a final separation result is obtained.

10. The microphone array-based multi-sound source signal separation method according to claim 9, wherein the depth model used in the step 42) is Dual-Path GAN, and the Dual-Path GAN is composed of a generator and a discriminator, and in the generator, the Dual-channel image is first encoded by Encoder composed of a close convolution module to obtain an intermediate result; then using two groups of four conformer modules to process the intermediate result and the transposition of the intermediate result respectively, and splicing the processed results together to be used as the input of a Decoder; the Decoder decodes the input by utilizing the dense convolution, and uses a convolution module to reduce the dimension of the output to obtain the amplitude part Amp of the noise-reduced audio time-frequency diagram; the discriminator is used for the training process of the model, the discriminator is used for splicing the amplitude diagram Amp generated by the generator and the amplitude diagram Amp _bf of the beam forming result into a double-channel picture as input, and the continuous convolution module is used for generating a judging result so as to optimize the model;

The Dense convolution module is formed by combining a Dense Net module and a convolution module, and as the convolution module, the Dense convolution module carries out convolution operation on the input, and the convolution result is spliced with the input of all the Dense convolution modules connected in series in front of the Dense convolution module to form the output of the Dense convolution module.