Nothing Special   »   [go: up one dir, main page]

CN114783450A - Audio processing method, device, computing equipment and medium - Google Patents

Audio processing method, device, computing equipment and medium Download PDF

Info

Publication number
CN114783450A
CN114783450A CN202210351876.5A CN202210351876A CN114783450A CN 114783450 A CN114783450 A CN 114783450A CN 202210351876 A CN202210351876 A CN 202210351876A CN 114783450 A CN114783450 A CN 114783450A
Authority
CN
China
Prior art keywords
audio
channel audio
sample
frequency
component
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210351876.5A
Other languages
Chinese (zh)
Inventor
赵翔宇
刘华平
曹偲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Netease Cloud Music Technology Co Ltd
Original Assignee
Hangzhou Netease Cloud Music Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Netease Cloud Music Technology Co Ltd filed Critical Hangzhou Netease Cloud Music Technology Co Ltd
Priority to CN202210351876.5A priority Critical patent/CN114783450A/en
Publication of CN114783450A publication Critical patent/CN114783450A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S5/00Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation 
    • H04S5/005Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation  of the pseudo five- or more-channel type, e.g. virtual surround

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Stereophonic System (AREA)

Abstract

The embodiment of the disclosure provides an audio processing method, an audio processing device, a computing device and a medium. Determining a main component audio frequency and a surrounding component audio frequency corresponding to the target multi-channel audio frequency through frequency band weight parameters which are obtained through training of a first audio network and respectively correspond to frequency bands of the main component audio frequency and the surrounding component audio frequency of the target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed, thereby obtaining rendered audio based on the surround component audio and target mapping parameters trained through the second audio network, further, based on the main component audio and the rendering audio corresponding to the target multi-channel audio, the target multi-channel audio corresponding to the audio to be processed is obtained, the audio processing is carried out by adopting the parameters obtained by the audio processing network training, the problem of separation degree reduction caused by weight imbalance adopted when a plurality of main sound sources exist in the audio and separate signals can be avoided, and therefore the processing effect of the audio processing method can be improved.

Description

Audio processing method, device, computing equipment and medium
Technical Field
Embodiments of the present disclosure relate to the field of multimedia technologies, and in particular, to an audio processing method, an audio processing apparatus, a computing device, and a medium.
Background
This section is intended to provide a background or context to the embodiments of the disclosure. The description herein is not admitted to be prior art by inclusion in this section.
In order to render the audio of ordinary stereo sound into an immersive effect of the surround, it is often necessary to upmix the two-channel audio to the multi-channel audio.
In the related art, when a binaural audio is upmixed, a Principal Component Analysis (PCA) is often used to determine a Principal component of the binaural audio, and then determine a surround component (Ambient) whose Principal component is orthogonal to the Principal component, and then a linear combination of the determined Principal component and the surround component is used as a multi-channel audio to achieve upmixing of the binaural audio.
In the implementation process, it is assumed that the main component and the surround component are completely orthogonal, and therefore, the method is only suitable for processing the audio with only one main sound source in the audio, and when a plurality of main sound sources exist in the audio to be processed, the processing effect of the audio processing method is greatly reduced.
Disclosure of Invention
In view of the problem in the related art that an audio processing method is poor in processing effect when processing to-be-processed audio with a plurality of main sound sources, embodiments of the present disclosure provide at least an audio processing method, an apparatus, a computing device, and a medium.
In a first aspect of embodiments of the present disclosure, there is provided an audio processing method, the method comprising:
determining a main component audio frequency and a surrounding component audio frequency corresponding to a target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed and frequency band weight parameters of frequency bands of the main component audio frequency and the surrounding component audio frequency respectively corresponding to the target multi-channel audio frequency, wherein the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio frequency comprises a target left channel audio frequency part and a target right channel audio frequency part;
acquiring rendered audio based on the surrounding component audio and the target mapping parameter, wherein the target mapping parameter is obtained through training of a second audio network;
and acquiring target multi-channel audio corresponding to the audio to be processed based on the main component audio and the rendering audio corresponding to the target multi-channel audio.
In an embodiment of the present disclosure, determining a principal component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed and frequency band weight parameters of frequency bands to which the principal component audio and the surround component audio respectively correspond to the target multi-channel audio includes:
determining a principal component audio frequency based on the left channel audio frequency, the right channel audio frequency and a frequency band weight parameter corresponding to a frequency band to which the principal component audio frequency belongs;
and determining the surround component audio based on the left channel audio, the right channel audio and the frequency band weight parameter corresponding to the frequency band to which the surround component audio belongs.
In one embodiment of the present disclosure, the principal component audio includes a principal component left channel audio and a principal component right channel audio;
determining the principal component audio based on the left channel audio, the right channel audio and the frequency band weight parameter corresponding to the frequency band to which the principal component audio belongs, including:
based on a first frequency band weight parameter which corresponds to the frequency band of the principal component audio and is used for processing the left channel audio and a second frequency band weight parameter which is used for processing the right channel audio, carrying out weighted summation on the left channel audio and the right channel audio to obtain the principal component left channel audio;
and weighting and summing the left channel audio and the right channel audio based on a third frequency band weight parameter which corresponds to the frequency band to which the principal component audio belongs and is used for processing the left channel audio and a fourth frequency band weight parameter which is used for processing the right channel audio to obtain the principal component right channel audio.
In one embodiment of the present disclosure, surround component audio includes surround component left channel audio and surround component right channel audio;
determining surround component audio based on the left channel audio, the right channel audio, and a band weight parameter corresponding to a band to which the surround component audio belongs, including:
weighting and summing the left channel audio and the right channel audio based on a fifth frequency band weight parameter for processing the left channel audio and a sixth frequency band weight parameter for processing the right channel audio, which correspond to the frequency band to which the surround component audio belongs, to obtain a surround left channel audio;
and weighting and summing the left channel audio and the right channel audio based on a seventh frequency band weight parameter for processing the left channel audio and an eighth frequency band weight parameter for processing the right channel audio, which correspond to the frequency band to which the surround component audio belongs, so as to obtain the surround component left channel audio.
In one embodiment of the present disclosure, surround component audio includes surround component left channel audio and surround component right channel audio;
obtaining rendered audio based on the surround component audio and the target mapping parameters, including:
and performing weighted summation on the encircled divided left channel audio and the encircled divided right channel audio based on a first target mapping parameter used for processing the encircled divided left channel audio in the target mapping parameters and a second target mapping parameter used for processing the encircled divided right channel audio in the target mapping parameters to obtain the rendered audio.
In an embodiment of the present disclosure, acquiring target multi-channel audio corresponding to audio to be processed based on principal component audio and rendering audio includes:
and superposing the main component audio and the rendered audio to obtain target multi-channel audio corresponding to the audio to be processed.
In an embodiment of the present disclosure, the training process of the frequency band weight parameter and the target mapping parameter includes:
obtaining a sample left channel audio and a sample right channel audio based on the sample multi-channel audio;
determining a first sample audio feature, a second sample audio feature, and a third sample audio feature based on the sample left channel audio and the sample right channel audio, the first sample audio feature indicating a power sum of the sample left channel audio and the sample right channel audio, the second sample audio feature indicating a power difference of the sample left channel audio and the sample right channel audio, and the third sample audio feature indicating a real cross-correlation power of the sample left channel audio and the sample right channel audio;
determining a first sample audio and a second sample audio based on a sample left channel audio, a sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing a first sample audio characteristic, a second sample audio characteristic and a third sample audio characteristic through a first audio processing network;
obtaining a predicted rendering audio based on the second sample audio and a predicted mapping parameter obtained by processing the second sample audio through a second audio processing network;
obtaining a predicted multi-channel audio based on the first sample audio and the predicted rendered audio;
the first audio processing network and the second audio processing network are trained based on an objective loss function indicative of a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and an objective mapping parameter.
In one embodiment of the present disclosure, the sample multi-channel audio includes sample front left channel audio, sample front right channel audio, sample center channel audio, sample left surround channel audio, and sample right surround channel audio;
obtaining sample left channel audio and sample right channel audio based on sample multi-channel audio, comprising:
weighting the sample center channel audio and the sample left surround channel audio respectively through preset weighting parameters, and determining the sample left channel audio based on a weighted result and the sample left front channel audio;
weighting the sample center channel audio and the sample right surround channel audio respectively through preset weighting parameters, and determining the sample right channel audio based on a weighted result and the sample right front channel audio.
In one embodiment of the disclosure, determining a first sample audio feature, a second sample audio feature, and a third sample audio feature based on a sample left channel audio and a sample right channel audio includes:
determining a first sample audio feature and a second sample audio feature based on the sample left channel audio, the complex conjugate audio corresponding to the sample left channel audio, the sample right channel audio, the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing parameter;
a third sample audio feature is determined based on a real part portion of a multiplication result of the sample left channel audio and the complex conjugate audio corresponding to the sample right channel audio, and a target smoothing factor.
In one embodiment of the disclosure, before determining the first sample audio and the second sample audio based on the sample left channel audio, the sample right channel audio, and a plurality of predicted band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through the first audio processing network, the method further comprises:
inputting the first sample audio characteristic, the second sample characteristic and the third sample audio characteristic into a first audio processing network, and processing the first sample audio characteristic, the second sample characteristic and the third sample audio characteristic through a first linear transformation layer and a second linear transformation layer which are included in the first audio processing network respectively to obtain a plurality of predicted frequency band weight parameters for obtaining the first sample audio and a plurality of predicted frequency band weight parameters for obtaining the second sample audio.
In an embodiment of the present disclosure, before obtaining the predicted rendered audio based on the second sample audio and the prediction mapping parameter obtained by processing the second sample audio through the second audio processing network, the method further includes:
and inputting the second sample audio into a second audio processing network, and processing the second sample audio through a linear transformation layer included in the second audio processing network to obtain a prediction mapping parameter.
In one embodiment of the disclosure, training a first audio processing network and a second audio processing network based on an objective loss function indicative of a difference between predicted multi-channel audio and sample multi-channel audio, resulting in a frequency band weight parameter and an objective mapping parameter, comprises:
determining a first loss function based on the predicted multi-channel audio and the sample multi-channel audio;
determining a plurality of second loss functions based on the predicted magnitude difference of the multi-channel audio between the channels and the magnitude difference of the sample multi-channel audio between the channels;
determining a target loss function based on the first loss function and the plurality of second loss functions;
and training the first audio processing network and the second audio processing network based on the target loss function until a training cutoff condition is met, and obtaining a frequency band weight parameter and a target mapping parameter.
In a second aspect of embodiments of the present disclosure, there is provided an audio processing apparatus comprising:
the determining module is used for determining a main component audio frequency and a surrounding component audio frequency corresponding to a target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed and frequency band weight parameters of frequency bands of the main component audio frequency and the surrounding component audio frequency respectively corresponding to the target multi-channel audio frequency, the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio frequency comprises a target left channel audio frequency part and a target right channel audio frequency part;
the first acquisition module is used for acquiring rendering audio based on surrounding component audio and target mapping parameters, and the target mapping parameters are obtained through second audio network training;
and the second acquisition module is used for acquiring the target multi-channel audio corresponding to the audio to be processed based on the main component audio and the rendering audio corresponding to the target multi-channel audio.
In an embodiment of the disclosure, the determining module, when configured to determine a main component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed and frequency band weight parameters of frequency bands to which the main component audio and the surround component audio respectively correspond to the target multi-channel audio, includes:
a first determination unit configured to determine a principal component audio based on a left channel audio, a right channel audio, and a frequency band weight parameter corresponding to a frequency band to which the principal component audio belongs;
a second determining unit configured to determine the surround component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the surround component audio belongs.
In one embodiment of the present disclosure, the principal component audio includes a principal component left channel audio and a principal component right channel audio;
a first determining unit, when configured to determine the principal component audio based on the left channel audio, the right channel audio, and the frequency band weight parameter corresponding to the frequency band to which the principal component audio belongs, configured to:
based on a first frequency band weight parameter which corresponds to the frequency band of the principal component audio and is used for processing the left channel audio and a second frequency band weight parameter which is used for processing the right channel audio, carrying out weighted summation on the left channel audio and the right channel audio to obtain the principal component left channel audio;
and weighting and summing the left channel audio and the right channel audio based on a third frequency band weight parameter which corresponds to the frequency band to which the principal component audio belongs and is used for processing the left channel audio and a fourth frequency band weight parameter which is used for processing the right channel audio to obtain the principal component right channel audio.
In one embodiment of the present disclosure, the surround component audio includes surround component left channel audio and surround component right channel audio;
a second determining unit, when configured to determine the surround component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the surround component audio belongs, configured to:
weighting and summing the left channel audio and the right channel audio based on a fifth frequency band weight parameter for processing the left channel audio and a sixth frequency band weight parameter for processing the right channel audio, which correspond to the frequency band to which the surround component audio belongs, to obtain a surround left channel audio;
and weighting and summing the left channel audio and the right channel audio based on a seventh frequency band weight parameter which corresponds to the frequency band of the surround component audio and is used for processing the left channel audio and an eighth frequency band weight parameter which is used for processing the right channel audio to obtain the surround component left channel audio.
In one embodiment of the present disclosure, surround component audio includes surround component left channel audio and surround component right channel audio;
a first obtaining module, when configured to obtain rendered audio based on surround component audio and a target mapping parameter, configured to:
and performing weighted summation on the encircled divided left channel audio and the encircled divided right channel audio based on a first target mapping parameter used for processing the encircled divided left channel audio in the target mapping parameters and a second target mapping parameter used for processing the encircled divided right channel audio in the target mapping parameters to obtain rendered audio.
In an embodiment of the present disclosure, the second obtaining module, when configured to obtain target multi-channel audio corresponding to the audio to be processed based on the principal component audio and the rendered audio, is configured to:
and superposing the principal component audio and the rendering audio to obtain target multi-channel audio corresponding to the audio to be processed.
In one embodiment of the present disclosure, the apparatus further comprises a training module comprising:
a first obtaining unit configured to obtain a sample left channel audio and a sample right channel audio based on a sample multi-channel audio;
a first determining unit configured to determine a first sample audio feature, a second sample audio feature, and a third sample audio feature based on a sample left channel audio and a sample right channel audio, the first sample audio feature indicating a sum of powers of the sample left channel audio and the sample right channel audio, the second sample audio feature indicating a power difference between the sample left channel audio and the sample right channel audio, and the third sample audio feature indicating a real part cross-correlation power of the sample left channel audio and the sample right channel audio;
a second determining unit, configured to determine a first sample audio and a second sample audio based on a sample left channel audio, a sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through a first audio processing network;
a second obtaining unit, configured to obtain a predicted rendered audio based on a second sample audio and a prediction mapping parameter obtained by processing the second sample audio through a second audio processing network;
a third acquisition unit configured to acquire predicted multi-channel audio based on the first sample audio and the predicted rendered audio;
a training unit for training the first audio processing network and the second audio processing network based on a target loss function indicating a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and a target mapping parameter.
In one embodiment of the present disclosure, the sample multi-channel audio includes sample front left channel audio, sample front right channel audio, sample center channel audio, sample left surround channel audio, and sample right surround channel audio;
a first obtaining unit, when configured to obtain a sample left channel audio and a sample right channel audio based on a sample multi-channel audio, configured to:
weighting the sample center channel audio and the sample left surround channel audio respectively through preset weighting parameters, and determining the sample left channel audio based on a weighted result and the sample left front channel audio;
and weighting the sample center channel audio and the sample right surround channel audio respectively through preset weighting parameters, and determining the sample right channel audio based on a weighted result and the sample right front channel audio.
In an embodiment of the disclosure, the first determining unit, when being configured to determine the first sample audio feature, the second sample audio feature and the third sample audio feature based on the sample left channel audio and the sample right channel audio, is configured to:
determining a first sample audio feature and a second sample audio feature based on the sample left channel audio, the complex conjugate audio corresponding to the sample left channel audio, the sample right channel audio, the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing parameter;
a third sample audio feature is determined based on a real part portion of a multiplication result of the sample left channel audio and the complex conjugate audio corresponding to the sample right channel audio, and a target smoothing factor.
In one embodiment of the present disclosure, the training module further comprises:
the first processing unit is configured to input the first sample audio feature, the second sample audio feature, and the third sample audio feature into a first audio processing network, and process the first sample audio feature, the second sample audio feature, and the third sample audio feature through a first linear transformation layer and a second linear transformation layer included in the first audio processing network, respectively, to obtain multiple predicted frequency band weight parameters for obtaining the first sample audio and multiple predicted frequency band weight parameters for obtaining the second sample audio.
In one embodiment of the present disclosure, the apparatus further comprises:
and the second processing unit is used for inputting the second sample audio into the second audio processing network, and processing the second sample audio through a linear transformation layer included in the second audio processing network to obtain the prediction mapping parameters.
In an embodiment of the disclosure, the training unit, when being configured to train the first audio processing network and the second audio processing network based on an objective loss function indicative of a difference between predicted multi-channel audio and sample multi-channel audio, to obtain a frequency band weight parameter and an objective mapping parameter, is configured to:
determining a first loss function based on the predicted multi-channel audio and the sample multi-channel audio;
determining a plurality of second loss functions based on the predicted magnitude difference of the multi-channel audio between the respective channels and the magnitude difference of the sample multi-channel audio between the respective channels;
determining a target loss function based on the first loss function and the plurality of second loss functions;
and training the first audio processing network and the second audio processing network based on the target loss function until a training cutoff condition is met, and obtaining a frequency band weight parameter and a target mapping parameter.
In a third aspect of the disclosed embodiments, a computing device is provided, where the computing device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the operations performed by the audio processing method provided in the first aspect and any embodiment of the first aspect.
In a fourth aspect of embodiments of the present disclosure, a computer-readable storage medium is provided, on which a program is stored, where the program is executed by a processor to perform the operations performed by the audio processing method provided by the first aspect and any embodiment of the first aspect.
In a fifth aspect of embodiments of the present disclosure, a computer program product is provided, which comprises a computer program that, when executed by a processor, performs the operations performed by the audio processing method according to the first aspect and any embodiment of the first aspect.
According to the audio processing method, device, computing equipment, medium and the like of the embodiments of the disclosure, the main component audio and the surround component audio corresponding to the target multi-channel audio are determined based on the left channel audio and the right channel audio corresponding to the audio to be processed and the frequency band weight parameters of the frequency bands to which the main component audio and the surround component audio respectively corresponding to the target multi-channel audio obtained through the first audio network training, so as to obtain the rendered audio based on the surround component audio and the target mapping parameter obtained through the second audio network training, further obtain the target multi-channel audio corresponding to the audio to be processed based on the main component audio and the rendered audio corresponding to the target multi-channel audio, perform audio processing by using the parameters obtained through the audio processing network training, and avoid the problem that the separation degree is reduced due to the weight imbalance when a plurality of main sound source separation signals exist in the audio, therefore, the number of main sound sources in the audio to be processed does not need to be considered in the audio processing process, the accuracy of the target multi-channel audio obtained by processing can be improved, and the processing effect of the audio processing method is improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 is a flow chart illustrating an audio processing method according to an exemplary embodiment of the present disclosure;
FIG. 2 is a schematic diagram illustrating a method for obtaining left and right channel audio based on pending audio according to an exemplary embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a parameter training process according to an exemplary embodiment of the present disclosure;
FIG. 4 is a network architecture diagram of a first audio processing network, shown in accordance with an exemplary embodiment of the present disclosure;
FIG. 5 is a network architecture diagram of a second audio processing network shown in accordance with an exemplary embodiment of the present disclosure;
FIG. 6 is a flow chart illustrating a training process according to an exemplary embodiment of the present disclosure;
FIG. 7 is a block diagram of an audio processing device shown in accordance with an exemplary embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a computer-readable storage medium illustrated in accordance with an exemplary embodiment of the present disclosure;
FIG. 9 is a schematic block diagram of a computing device shown in accordance with an exemplary embodiment of the present disclosure;
in the drawings, like or corresponding reference characters designate like or corresponding parts.
Detailed Description
The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.
According to an embodiment of the disclosure, an audio processing method, an audio processing device, a computing device and a medium are provided. The method described above may be performed by a computing device for processing audio to be processed to obtain target multi-channel audio. The computing device may be a server, such as one server, multiple servers, a server cluster, a cloud computing platform, or the like, and optionally, the computing device may also be a terminal device, such as a smart phone, a tablet computer, a desktop computer, a portable computer, a smart speaker, or the like, where the disclosure does not limit the device type and the device number of the computing device.
Having described an environment in which the present disclosure may be applied, a specific implementation of the present disclosure is described below.
Referring to fig. 1, fig. 1 is a flowchart illustrating an audio processing method according to an exemplary embodiment of the present disclosure, the method including:
step 101, determining a main component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed and frequency band weight parameters of frequency bands to which the main component audio and the surround component audio respectively correspond to the target multi-channel audio, wherein the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio comprises a target left channel audio part and a target right channel audio part.
The audio to be processed may be a dual-channel audio, optionally, the audio to be processed may also be a multi-channel audio, and the number of channels included in the audio to be processed is smaller than the number of channels included in the target multi-channel audio.
And 102, acquiring rendered audio based on the surrounding component audio and the target mapping parameter, wherein the target mapping parameter is obtained through training of a second audio network.
And 103, acquiring target multi-channel audio corresponding to the audio to be processed based on the main component audio and the rendered audio corresponding to the target multi-channel audio.
The method determines the main component audio frequency and the surrounding component audio frequency corresponding to the target multi-channel audio frequency based on the left channel audio frequency and the right channel audio frequency corresponding to the audio frequency to be processed and the frequency band weight parameters of the frequency bands of the main component audio frequency and the surrounding component audio frequency which are obtained by the training of the first audio frequency network and respectively correspond to the target multi-channel audio frequency, thereby obtaining the rendering audio frequency based on the surrounding component audio frequency and the target mapping parameter obtained by the training of the second audio frequency network, further obtaining the target multi-channel audio frequency corresponding to the audio frequency to be processed based on the main component audio frequency and the rendering audio frequency corresponding to the target multi-channel audio frequency, and carrying out audio frequency processing by adopting the parameters obtained by the training of the audio frequency processing network, thereby avoiding the problem that the separation degree is reduced due to the weight imbalance when a plurality of main sound sources exist in the audio frequency, and further avoiding the problem that the number of the main sound sources existing in the audio frequency to be processed need not to be considered in the audio frequency processing process, and further, the accuracy of the target multi-channel audio obtained by processing can be improved, and the processing effect of the audio processing method is improved.
Having described the basic processes of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.
In some embodiments, before step 101, left channel audio and right channel audio may be acquired in advance based on the audio to be processed, so that the acquisition of the target multi-channel audio may be performed based on the left channel audio and the right channel audio. Alternatively, the acquisition of the left channel audio and the right channel audio may be performed by the following procedure.
Under the condition that the audio to be processed is the dual-channel audio, the left channel audio and the right channel audio included in the audio to be processed can be directly acquired.
When the audio to be processed is a multi-channel audio, the audio to be processed may be processed in advance to obtain a left channel audio and a right channel audio corresponding to the audio to be processed.
Taking the audio to be processed as a multi-channel audio including 5 channels as an example, the audio to be processed may include a left front channel audio, a right front channel audio, a center channel audio, a left surround channel audio, and a right surround channel audio, and then the left channel audio and the right channel audio corresponding to the audio to be processed may be obtained through the following processing manners:
weighting the center channel audio and the left surround channel audio respectively through preset weighting parameters, and determining the left channel audio based on a weighted result and the left front channel audio; and weighting the center channel audio and the right surround channel audio respectively through preset weighting parameters, and determining the right channel audio based on the weighted result and the right front channel audio.
For example, the left channel audio and the right channel audio may be determined by the following equations (1) and (2), respectively:
L=FL+a×C+a×RL (1)
R=FR+a×C+a×RR (2)
where L represents a left channel audio, R represents a right channel audio, FL represents a front left channel audio, FR represents a front right channel audio, C represents a center channel audio, RL represents a left surround channel audio, RR represents a right surround channel audio, and a represents a preset weight parameter.
Alternatively, the preset weight parameter may take a value of 0.71. Taking a preset weight parameter of 0.71 as an example, weighting the audio by using 0.71 as the preset weight parameter, which is equivalent to attenuating the audio by 3 decibels (dB), and under the condition that the preset weight parameter is 0.71, referring to fig. 2, a principle of acquiring left channel audio and right channel audio can be shown, fig. 2 is a schematic diagram of acquiring left and right channel audio based on the audio to be processed, shown in fig. 2, when acquiring the left channel audio, summing a result of attenuating the center channel audio by 3dB with the left front channel audio, and summing the summed result with a result of attenuating the left surround channel audio by 3dB, so as to obtain the left channel audio; when the right channel audio is obtained, the result of 3dB attenuation of the center channel audio and the right front channel audio can be summed, and the summed result and the result of 3dB attenuation of the right surround channel audio are summed, so that the right channel audio can be obtained.
The above is only an exemplary way of obtaining left and right channel audio based on audio to be processed, and in more possible implementations, other ways may also be used to obtain left and right channel audio.
After the left and right channel audios corresponding to the audio to be processed are obtained, the target multi-channel audio can be obtained based on the obtained left and right channel audios.
In some embodiments, for step 101, when determining the main component audio and the surround component audio corresponding to the target multi-channel audio based on the left channel audio and the right channel audio corresponding to the audio to be processed and the frequency band weight parameters of the frequency bands to which the main component audio and the surround component audio respectively correspond to the target multi-channel audio, the following steps may be included:
step 1011, determining the principal component audio frequency based on the left channel audio frequency, the right channel audio frequency and the frequency band weight parameter corresponding to the frequency band to which the principal component audio frequency belongs.
The principal component audio may include a principal component left channel audio and a principal component right channel audio. Based on this, for the above step 1011, when determining the principal component audio, the following steps may be included:
and step 1011-1, performing weighted summation on the left channel audio and the right channel audio based on the first frequency band weight parameter for processing the left channel audio corresponding to the frequency band to which the principal component audio belongs and the second frequency band weight parameter for processing the right channel audio to obtain the principal component left channel audio.
For example, the principal component left channel audio may be obtained by the following formula (3):
Figure BDA0003580904380000141
wherein,
Figure BDA0003580904380000142
representing the principal component left channel audio, w1Representing a first frequency band weight parameter, XLRepresenting the left channel audio, w2Representing a second frequency band weight parameter, XRRepresents the right channel audio, b represents the frequency band number, k represents the frequency bin number, and n represents the time series number.
And 1011-2, performing weighted summation on the left channel audio and the right channel audio based on the third frequency band weight parameter which corresponds to the frequency band to which the principal component audio belongs and is used for processing the left channel audio and the fourth frequency band weight parameter which is used for processing the right channel audio to obtain the principal component right channel audio.
For example, the principal component right channel audio may be obtained by the following equation (4):
Figure BDA0003580904380000143
wherein,
Figure BDA0003580904380000144
representing the principal component right channel audio, w3Representing a third frequency band weight parameter, XLRepresenting the left channel audio, w4Representing a fourth frequency band weight parameter, XRRepresents the right channel audio, b represents the frequency band number, k represents the frequency bin number, and n represents the time series number.
Step 1012, determining the surround component audio based on the left channel audio, the right channel audio, and the frequency band weighting parameter corresponding to the frequency band to which the surround component audio belongs.
Wherein the surround component audio may include surround component left channel audio and surround component right channel audio. Based on this, for the above step 1012, when determining the surround component audio, the following steps may be included:
and 1012-1, performing weighted summation on the left channel audio and the right channel audio based on a fifth frequency band weight parameter corresponding to the frequency band to which the surround component audio belongs and used for processing the left channel audio and a sixth frequency band weight parameter used for processing the right channel audio to obtain a surround component left channel audio.
For example, the surround left channel audio may be obtained by the following equation (5):
Figure BDA0003580904380000151
wherein,
Figure BDA0003580904380000152
representing surround-ing divided left channel audio, w5Representing a fifth frequency band weight parameter, XLRepresenting left channel audio, w6Representing a sixth frequency band weight parameter, XRRepresents the right channel audio, b represents the frequency band number, k represents the frequency point number, and n represents the time series number.
And 1012-2, performing weighted summation on the left channel audio and the right channel audio based on a seventh frequency band weight parameter corresponding to the frequency band to which the surround component audio belongs and used for processing the left channel audio and an eighth frequency band weight parameter used for processing the right channel audio to obtain a surround component left channel audio.
For example, the surround-component right-channel audio can be obtained by the following equation (6):
Figure BDA0003580904380000153
wherein,
Figure BDA0003580904380000154
representing surround divided left channel audio, w7Representing a seventh frequency band weight parameter, XLRepresenting the left channel audio, w8Representing the weight parameter, X, of the eighth frequency bandRRepresents the right channel audio, b represents the frequency band number, k represents the frequency point number, and n represents the time series number.
It should be noted that, the reference numerals of the above steps do not limit the execution order of the above two steps, taking step 1011 and step 1012 as an example, optionally, step 1011 may be executed first and then step 1012 is executed, or step 1012 may be executed first and then step 1011 may be executed, or step 1011 and step 1012 may be executed simultaneously, and the present disclosure does not limit which execution order is specifically adopted.
Through the above process, the principal component audio and the surround component audio can be determined according to a linear estimation method, and the principal component audio and the surround component audio can be represented as a linear combination of stereo signals (i.e., multi-channel signals), so that the target multi-channel audio can be determined based on the determined principal component audio and surround component audio.
This is disclosed through the frequency channel weight parameter that utilizes first audio frequency processing network training to obtain, carry out the separation of principal component audio frequency and surround the component audio frequency, need not to acquire the split file, can obtain the very high principal component audio frequency of degree of separation and surround the component audio frequency, in addition, realize the separation of principal component audio frequency and surround the component audio frequency based on audio frequency processing network, the problem that separation signal weight unbalance leads to the degree of separation to reduce when having a plurality of principal sound sources in the audio frequency has been avoided, thereby can improve the audio frequency separation effect.
In some embodiments, for step 102, when the rendered audio is obtained based on the surround component audio and the target mapping parameters, this may be achieved by:
and performing weighted summation on the encircled divided left channel audio and the encircled divided right channel audio based on a first target mapping parameter used for processing the encircled divided left channel audio in the target mapping parameters and a second target mapping parameter used for processing the encircled divided right channel audio in the target mapping parameters to obtain rendered audio.
The surround component audio can improve the diversity of the diffusion signals and create a surround immersion sound effect, and when the surround component audio is rendered into multi-channel audio, the separate diffusion signals with independent channels can be generated through the decorrelator. For example, the surround component signal may be rendered into multi-channel audio by the following equation (7) to obtain rendered audio:
Figure BDA0003580904380000161
wherein,
Figure BDA0003580904380000162
representing rendered audio, oc,1A first target mapping parameter is represented and,
Figure BDA0003580904380000163
representing surround-ing divided left channel audio, oc,2A second target mapping parameter is represented that is,
Figure BDA0003580904380000164
represents surround component right channel audio, b represents a frequency band number, k represents a frequency bin number, and n represents a time series number.
And rendering the surrounding component audio by utilizing the target mapping parameters obtained by training the second audio processing network, so that the rendering effect of the immersive audio can be improved.
In some embodiments, for step 103, when obtaining target multi-channel audio corresponding to the audio to be processed based on the principal component audio and the rendered audio, the following steps may be performed:
and superposing the main component audio and the rendered audio to obtain target multi-channel audio corresponding to the audio to be processed.
For example, the target multi-channel audio may be obtained by the following equation (8):
Figure BDA0003580904380000165
wherein, UcTo represent the target multi-channel audio,
Figure BDA0003580904380000171
representing a principal component audio (including a principal component left channel audio and a principal component right channel audio),
Figure BDA0003580904380000172
represents a rendered audio obtained based on a surround component audio, b represents a frequency band number, k represents a frequency bin number, and n represents a time series number.
Since the principal component audio is an audio signal having directivity, it is necessary to maintain the spatial sound image orientation of the principal component audio when the principal component audio is mapped to multi-channel audio. Thus, the direction of the principal component audio needs to be estimated and then mapped to the left and right channels according to the angle, and the other channels except the left and right channels do not need to retain the principal component audio. Therefore, the principal component audio in the above formula (8) includes only the principal component left channel audio and the principal component right channel audio, and the principal component audio of other channels is mapped to 0, which can be ignored, i.e. it can be ensured that the principal component audio for obtaining the target multi-channel audio can truly restore the spatial position distribution of the original sound field.
The above process is an introduction on how to obtain target multi-channel audio based on audio to be processed by using parameters obtained by pre-training, and the following description is made on a process of obtaining parameters by pre-training.
In some embodiments, the training process of the frequency band weight parameter and the target mapping parameter may refer to fig. 3, fig. 3 is a flowchart illustrating a parameter training process according to an exemplary embodiment of the present disclosure, and as shown in fig. 3, the parameter training process includes:
step 301, obtaining a sample left channel audio and a sample right channel audio based on the sample multi-channel audio.
Wherein the sample multi-channel audio may be audio in a multi-channel data set, and the sample multi-channel audio may include sample front left channel audio, sample front right channel audio, sample center channel audio, sample left surround channel audio, and sample right surround channel audio.
Taking the example that the sample multi-channel audio includes the audio of the above 5 channels, the sample left channel audio and the sample right channel audio can be obtained through the following steps:
step 3011, weighting the sample center channel audio and the sample left surround channel audio respectively by preset weighting parameters, and determining the sample left channel audio based on the weighted result and the sample left front channel audio.
Step 3012, weighting the sample center channel audio and the sample right surround channel audio respectively by preset weighting parameters, and determining a sample right channel audio based on the weighted result and the sample right front channel audio.
The process of obtaining left and right channel audio of the sample through the steps 3011 and 3012 may refer to formula (1) and formula (2) and related descriptions, and is not described herein again.
Step 302, determining a first sample audio feature, a second sample audio feature and a third sample audio feature based on the sample left channel audio and the sample right channel audio, the first sample audio feature being used for indicating a power sum of the sample left channel audio and the sample right channel audio, the second sample audio feature being used for indicating a power difference of the sample left channel audio and the sample right channel audio, and the third sample audio feature being used for indicating a real part cross-correlation power of the sample left channel audio and the sample right channel audio.
In one possible implementation, the first, second and third sample audio features may be determined by:
step 3021, determining a first sample audio feature and a second sample audio feature based on the sample left channel audio, the complex conjugate audio corresponding to the sample left channel audio, the sample right channel audio, the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing parameter.
For example, the first sample audio feature and the second sample audio feature may be determined by the following equation (9) and equation (10), respectively:
Figure BDA0003580904380000181
Figure BDA0003580904380000182
wherein phi isSRepresenting a first sample audio characteristic (i.e., the sum of the powers of the sample left channel audio and the sample right channel audio), ΦDRepresenting a second sample audio characteristic (i.e., the power difference between the sample left channel audio and the sample right channel audio), XLRepresenting the sample left channel audio, and,
Figure BDA0003580904380000183
complex conjugate audio, X, representing sample left channel audioRRepresenting the sample right channel audio, and,
Figure BDA0003580904380000184
the method includes the steps of representing complex conjugate audio of a sample right channel audio, b representing a frequency band number, k representing a frequency point number, n representing a time sequence number, and alpha representing a target smoothing parameter, wherein optionally alpha can be any integer value.
Step 3022, determining a third sample audio characteristic based on the real part of the multiplication result of the sample left channel audio and the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing factor.
For example, the third sample audio feature may be determined by the following equation (11):
Figure BDA0003580904380000185
wherein phiXCRepresenting the third sample audio feature (i.e., the real cross-correlation power of the sample left channel audio feature and the sample right channel), R represents the real portion of the result, XLRepresenting the sample left channel audio, and,
Figure BDA0003580904380000191
the method includes the steps of representing complex conjugate audio of a sample right channel audio, b representing a frequency band number, k representing a frequency point number, n representing a time sequence number, and alpha representing a target smoothing parameter, wherein optionally alpha can be any integer value.
Step 303, determining a first sample audio and a second sample audio based on the sample left channel audio, the sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through the first audio processing network.
Wherein the first sample audio is sample principal component audio and the second sample audio is sample surround component audio.
It should be noted that, before step 303, the first sample audio characteristic, the second sample audio characteristic, and the third sample audio characteristic may be input into the first audio processing network, so that the plurality of predicted frequency band parameters are obtained through the first audio processing network.
In one possible implementation, the first audio processing Network may be a Deep Neural Network (DNN). The first audio processing network may include a first Linear transformation layer and a second Linear transformation layer, and each of the first Linear transformation layer and the second Linear transformation layer may be composed of a Linear transformation layer using a Linear rectification Unit (ReLU) function as an activation function and a Linear transformation layer using a normalization function as an activation function.
Taking the first audio processing network including the first linear transform layer and the second linear transform layer as an example, after the first sample audio feature, the second sample feature, and the third sample audio feature are input into the first audio processing network, the first sample audio feature, the second sample feature, and the third sample audio feature may be processed through the first linear transform layer and the second linear transform layer included in the first audio processing network, respectively, to obtain a plurality of predicted frequency band weight parameters for obtaining the first sample audio and a plurality of predicted frequency band weight parameters for obtaining the second sample audio, so that in step 303, the first sample audio and the second sample audio may be determined based on the plurality of predicted frequency band weight parameters.
Optionally, when the first sample audio feature, the second sample audio feature, and the third sample audio feature are input to the first audio processing network, the target feature may be obtained for the first sample audio feature, the second sample audio feature, and the third sample audio feature, and then the target feature obtained by splicing is input to the first audio processing network.
Referring to fig. 4, fig. 4 is a schematic diagram of a network structure of a first audio processing network shown in the present disclosure according to an exemplary embodiment, and as shown in fig. 4, each of a first linear transformation layer and a second linear transformation layer may be composed of three linear transformation layers using a ReLU function as an activation function and a linear transformation layer using a Sigmoid function as an activation function, when a plurality of predicted band weight parameters are obtained through the first audio processing network shown in fig. 4, a first sample audio feature, a second sample audio feature and a third sample audio feature may be respectively input into the first linear transformation layer and the second linear transformation layer to respectively obtain a plurality of predicted band weight parameters for weighting a sample left audio and a plurality of predicted band weight parameters for weighting a sample right audio through the first linear transformation layer and the second linear transformation layer, the internal processes of the first and second linear conversion layers will be described below.
For the first linear transformation layer, the features of the input layers can be sequentially subjected to linear rectification processing through three linear transformation layers which use a ReLU function as an activation function and are included in the first linear transformation layer to obtain rectified features, then the rectified features are subjected to normalization processing through the linear transformation layers which use a Sigmoid function as the activation function, and a plurality of predicted frequency band weight parameters (including a first predicted frequency band weight parameter w) for obtaining a first sample audio are determined based on the features obtained through the normalization processing1Second predicted frequency band weight parameter w2A third predicted frequency band weight parameter w3The fourth predicted frequency band weight parameter w4)。
For the second linear transformation layer, the features of the input layers can be sequentially subjected to linear rectification processing through three linear transformation layers which use a ReLU function as an activation function and are included in the second linear transformation layer to obtain rectified features, then the rectified features are subjected to normalization processing through the linear transformation layers which use a Sigmoid function as the activation function, and a plurality of predicted frequency band weight parameters (including a fifth predicted frequency band weight parameter w) for obtaining the second sample audio are determined based on the features obtained through the normalization processing5Sixth predicted frequency band weight parameter w6The seventh predicted frequency band weight parameter w7The eighth predicted frequency band weight parameter w8)。
After determining the plurality of predicted frequency band weight parameters for obtaining the first sample audio and the plurality of predicted frequency band weight parameters for obtaining the second sample audio, the first sample audio and the second sample audio may be determined based on the determined predicted frequency band parameters, and the specific determination manner may refer to step 101 and the descriptions of equations (3) to (6), which are not described herein again.
And step 304, obtaining a predicted rendered audio based on the second sample audio and the predicted mapping parameter obtained by processing the second sample audio through the second audio processing network.
It should be noted that, before step 304, a second sample audio may be input into a second audio processing network, so as to obtain the prediction mapping parameters through the second audio processing network.
In one possible implementation, the second audio processing network may be a DNN. Wherein the second audio processing network may comprise a linear transformation layer, which may consist of a linear transformation layer using a ReLU function as an activation function and a linear transformation layer using a Sigmoid function as an activation function.
When the prediction mapping parameters are obtained by the second audio processing network, the second sample audio may be processed by a linear transformation layer included in the second audio processing network to obtain the prediction mapping parameters, so that the obtaining of the rendered audio may be performed based on the prediction mapping parameters in step 304.
Referring to fig. 5, fig. 5 is a schematic network structure diagram of a second audio processing network according to an exemplary embodiment of the present disclosure, and as shown in fig. 5, a linear transformation layer of the second audio processing network is composed of two linear transformation layers using a ReLU function as an activation function and a Sigmoid function as an activation function, when a prediction mapping parameter is obtained through the second audio processing network as shown in fig. 5, a second sample audio represented in a vector form may be input to the linear transformation layer to obtain the prediction mapping parameter through the linear transformation layer, and the internal processing procedure of the linear transformation layer is explained below.
For example, the features of the input layers may be sequentially subjected to linear rectification processing by two linear conversion layers using a ReLU function as an activation function included in the linear conversion layers to obtain rectified features, the rectified features may be further subjected to normalization processing by the linear conversion layers using a Sigmoid function as an activation function, and the prediction mapping parameters (including the first prediction mapping parameter o) may be determined based on the features obtained by the normalization processingc,1And a second prediction mapping parameter oc,2)。
After the prediction mapping parameters are determined, the rendered audio may be obtained based on the determined prediction mapping parameters, and the specific obtaining manner may be as described in step 102 and the description related to the formula (7), which is not described herein again.
Step 305, obtaining a predicted multi-channel audio based on the first sample audio and the predicted rendered audio.
It should be noted that, the implementation process of step 305 may refer to the description related to step 103, and is not described herein again.
Step 306, training the first audio processing network and the second audio processing network based on a target loss function indicating a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and a target mapping parameter.
The above step 306 can be realized by the following steps:
step 3061, a first loss function is determined based on the predicted multi-channel audio and the sample multi-channel audio.
The first loss function may be a minimum Mean-Square Error (MSE) loss function. For example, the first loss function can be seen in equation (12) below:
Figure BDA0003580904380000221
wherein L isCRepresenting a first loss function, YCRepresenting sample multi-channel audio, UCRepresents the predicted multi-channel audio, K represents the number of frequency bands, K represents the frequency bin number, and n represents the time series number.
Step 3062, a plurality of second loss functions are determined based on the predicted magnitude difference between the channels of the multi-channel audio and the magnitude difference between the channels of the sample multi-channel audio.
The second loss function may be an Inter Channel Level Differences (ICLD) loss function. For example, the second loss function can be seen in equation (13) below:
Figure BDA0003580904380000222
wherein L isICLDThe second loss function is represented as a function of,
Figure BDA0003580904380000223
representing the magnitude difference between any two channels of sample multi-channel audio,
Figure BDA0003580904380000224
the method is used for predicting the amplitude difference between any two sound channels of multi-channel audio, K represents the number of frequency bands, K represents frequency point numbers, and n represents time sequence numbers.
Taking sample multi-channel audio and predicted multi-channel audio as examples, which include 7 channels, there are 21 second loss functions between the sample multi-channel audio and the predicted multi-channel audio.
Step 3063, a target loss function is determined based on the first loss function and the plurality of second loss functions.
In one possible implementation, the sum of the first loss function and the plurality of second loss functions may be determined as a target loss function. For example, the target loss function may be determined by the following equation (14):
Figure BDA0003580904380000231
wherein L represents an objective loss function, LCRepresents a first loss function, LICLDRepresenting a second loss function, C represents a plurality of channels.
Step 3064, training the first audio processing network and the second audio processing network based on the target loss function until the training cutoff condition is met, and obtaining a frequency band weight parameter and a target mapping parameter.
It should be noted that, under the condition that the training cutoff condition is satisfied, the trained first audio processing network and the trained second audio processing network may be obtained, the predicted frequency band weight parameter output by the trained first audio processing network is the frequency band weight parameter to be obtained, and the predicted mapping parameter output by the trained second audio processing network is the target mapping parameter to be obtained.
The training cutoff condition may be that a function value of the target loss function satisfies a set condition, or the iteration number reaches the equipment number, optionally, the training cutoff condition may also be other conditions, which is not limited in this disclosure.
It should be noted that the training process shown in fig. 3 is an iterative process, that is, each time a sample left channel audio and a sample right channel audio are obtained through step 301, the obtained sample left channel audio and sample right channel audio are processed through steps 302 to 305 to obtain a predicted multi-channel audio, so that the first audio processing network and the second audio processing network are trained based on the target loss function through step 306, and then the obtained next sample left channel audio and sample right channel audio are processed through steps 302 to 305 to obtain a predicted multi-channel audio, so that the first audio processing network and the second audio processing network obtained through training are continuously trained based on the target loss function through step 306, and so on until the training cutoff condition is met, and obtaining the trained first audio processing network and second audio processing network, thereby obtaining the frequency band weight parameter and the target mapping parameter.
The deep neural network is trained by introducing the deep neural network in a link of separating a main component audio and a surrounding component audio, the deep neural network comprises a plurality of linear transformation layers, the linear rectification processing can be carried out on the input characteristics of each layer through the linear transformation layers included in the deep neural network, after the iterative linear rectification processing is realized through the plurality of linear transformation layers, the audio output by different sound channels can be obtained through the linear transformation layer using a Sigmoid function as an activation function, so that the sound channel separation through the deep neural network can be realized, the advantages of the deep neural network can be utilized, and the deep neural network can be trained, the method and the device improve the separation degree of the audios of different sound channels, and simultaneously can reduce the occurrence of audio distortion and tone quality damage caused by time domain and frequency domain transformation and nonlinear processing of the audios.
In addition, the deep neural network is introduced in the rendering link of the surround component audio, and the introduction of the deep neural network can avoid the situation that the learning factors of partial sound channels are empty when the surround component audio is rendered under the condition that a plurality of main components exist in the audio, so that the method can adapt to the situation that a plurality of main sound sources exist in the audio, and the audio quality of the generated multi-channel audio can be improved.
Based on various optional embodiments involved in the above-mentioned training process, fig. 6 is a flowchart of a training process shown in the present disclosure according to an exemplary embodiment, and fig. 6 is a flowchart of a training process shown in fig. 6, after obtaining sample multi-channel audio from a multi-channel data set, a down-mix may be performed based on the sample multi-channel audio through step 301 to obtain sample left-channel audio and sample right-channel audio, so that a separation of a principal component and a surround component may be performed through steps 302 to 303 to obtain a first sample audio as a principal component audio and a second sample audio as a surround component, and then a mapping may be performed based on the first sample audio, and a rendering may be performed based on the second sample audio through step 304 to be based on a result of the mapping of the first sample audio and a result of the rendering of the second sample audio through step 305, the predicted multi-channel audio is obtained, and then, in step 306, training is performed through back propagation based on the loss function indicating the difference between the predicted multi-channel audio and the sample multi-channel audio, and the specific implementation process may refer to steps 301 to 306, which are not described herein again.
Having described the audio processing method of the exemplary embodiments of the present disclosure, next, the structures of the audio processing apparatus of the exemplary embodiments of the present disclosure and the computing device for implementing the audio processing method will be explained.
Referring to fig. 7, fig. 7 is a block diagram of an audio processing apparatus according to an exemplary embodiment of the present disclosure, the apparatus including:
a determining module 701, configured to determine a main component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed, and frequency band weight parameters of frequency bands to which the main component audio and the surround component audio respectively correspond to the target multi-channel audio, where the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio includes a target left channel audio portion and a target right channel audio portion;
a first obtaining module 702, configured to obtain a rendered audio based on the surround component audio and a target mapping parameter, where the target mapping parameter is obtained through a second audio network training;
the second obtaining module 703 is configured to obtain a target multi-channel audio corresponding to the audio to be processed based on the principal component audio and the rendered audio corresponding to the target multi-channel audio.
In an embodiment of the present disclosure, the determining module 701, when configured to determine a main component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed and a frequency band weight parameter corresponding to a frequency band to which the main component audio and the surround component audio of the target multi-channel audio belong, includes:
a first determination unit configured to determine a principal component audio based on a left channel audio, a right channel audio, and a frequency band weight parameter corresponding to a frequency band to which the principal component audio belongs;
a second determining unit configured to determine the surround component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the surround component audio belongs.
In one embodiment of the present disclosure, the principal component audio includes a principal component left channel audio and a principal component right channel audio;
a first determining unit, when configured to determine the principal component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the principal component audio belongs, configured to:
based on a first frequency band weight parameter which corresponds to the frequency band of the principal component audio and is used for processing the left channel audio and a second frequency band weight parameter which is used for processing the right channel audio, carrying out weighted summation on the left channel audio and the right channel audio to obtain the principal component left channel audio;
and weighting and summing the left channel audio and the right channel audio based on a third frequency band weight parameter which corresponds to the frequency band to which the principal component audio belongs and is used for processing the left channel audio and a fourth frequency band weight parameter which is used for processing the right channel audio to obtain the principal component right channel audio.
In one embodiment of the present disclosure, surround component audio includes surround component left channel audio and surround component right channel audio;
a second determination unit, when configured to determine the surround component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the surround component audio belongs, configured to:
weighting and summing the left channel audio and the right channel audio based on a fifth frequency band weight parameter for processing the left channel audio and a sixth frequency band weight parameter for processing the right channel audio, which correspond to the frequency band to which the surround component audio belongs, to obtain a surround left channel audio;
and weighting and summing the left channel audio and the right channel audio based on a seventh frequency band weight parameter for processing the left channel audio and an eighth frequency band weight parameter for processing the right channel audio, which correspond to the frequency band to which the surround component audio belongs, so as to obtain the surround component left channel audio.
In one embodiment of the present disclosure, the surround component audio includes surround component left channel audio and surround component right channel audio;
a first obtaining module 702, when configured to obtain rendered audio based on surround component audio and the target mapping parameters, is configured to:
and performing weighted summation on the encircled divided left channel audio and the encircled divided right channel audio based on a first target mapping parameter used for processing the encircled divided left channel audio in the target mapping parameters and a second target mapping parameter used for processing the encircled divided right channel audio in the target mapping parameters to obtain rendered audio.
In an embodiment of the present disclosure, the second obtaining module 703, when configured to obtain target multi-channel audio corresponding to the audio to be processed based on the principal component audio and the rendered audio, is configured to:
and superposing the main component audio and the rendered audio to obtain target multi-channel audio corresponding to the audio to be processed.
In one embodiment of the present disclosure, the apparatus further comprises a training module comprising:
a first obtaining unit configured to obtain a sample left channel audio and a sample right channel audio based on a sample multi-channel audio;
a first determining unit, configured to determine, based on a sample left channel audio and a sample right channel audio, a first sample audio feature, a second sample audio feature, and a third sample audio feature, the first sample audio feature being used to indicate a power sum of the sample left channel audio and the sample right channel audio, the second sample audio feature being used to indicate a power difference of the sample left channel audio and the sample right channel audio, and the third sample audio feature being used to indicate a real part cross-correlation power of the sample left channel audio and the sample right channel audio;
a second determining unit, configured to determine a first sample audio and a second sample audio based on a sample left channel audio, a sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through a first audio processing network;
a second obtaining unit, configured to obtain a predicted rendered audio based on a second sample audio and a prediction mapping parameter obtained by processing the second sample audio through a second audio processing network;
a third acquisition unit configured to acquire predicted multi-channel audio based on the first sample audio and the predicted rendered audio;
a training unit for training the first audio processing network and the second audio processing network based on a target loss function indicating a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and a target mapping parameter.
In one embodiment of the present disclosure, the sample multi-channel audio includes sample front left channel audio, sample front right channel audio, sample center channel audio, sample left surround channel audio, and sample right surround channel audio;
a first obtaining unit, when configured to obtain a sample left channel audio and a sample right channel audio based on a sample multi-channel audio, configured to:
weighting the sample center channel audio and the sample left surround channel audio respectively through preset weighting parameters, and determining the sample left channel audio based on a weighted result and the sample left front channel audio;
weighting the sample center channel audio and the sample right surround channel audio respectively through preset weighting parameters, and determining the sample right channel audio based on a weighted result and the sample right front channel audio.
In an embodiment of the disclosure, the first determining unit, when being configured to determine the first sample audio feature, the second sample audio feature and the third sample audio feature based on the sample left channel audio and the sample right channel audio, is configured to:
determining a first sample audio feature and a second sample audio feature based on the sample left channel audio, the complex conjugate audio corresponding to the sample left channel audio, the sample right channel audio, the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing parameter;
a third sample audio feature is determined based on the real part of the multiplication result of the sample left channel audio and the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing factor.
In one embodiment of the present disclosure, the training module further comprises:
the first processing unit is configured to input the first sample audio feature, the second sample feature, and the third sample audio feature into a first audio processing network, and process the first sample audio feature, the second sample feature, and the third sample audio feature through a first linear transformation layer and a second linear transformation layer included in the first audio processing network, respectively, to obtain a plurality of predicted frequency band weight parameters for obtaining the first sample audio and a plurality of predicted frequency band weight parameters for obtaining the second sample audio.
In one embodiment of the present disclosure, the apparatus further comprises:
and the second processing unit is used for inputting the second sample audio into a second audio processing network, and processing the second sample audio through a linear transformation layer included in the second audio processing network to obtain the prediction mapping parameter.
In an embodiment of the disclosure, the training unit, when being configured to train the first audio processing network and the second audio processing network based on an objective loss function indicative of a difference between predicted multi-channel audio and sample multi-channel audio, to obtain a frequency band weight parameter and an objective mapping parameter, is configured to:
determining a first loss function based on the predicted multi-channel audio and the sample multi-channel audio;
determining a plurality of second loss functions based on the predicted magnitude difference of the multi-channel audio between the channels and the magnitude difference of the sample multi-channel audio between the channels;
determining a target loss function based on the first loss function and the plurality of second loss functions;
and training the first audio processing network and the second audio processing network based on the target loss function until a training cutoff condition is met, and obtaining a frequency band weight parameter and a target mapping parameter.
It should be noted that although in the above detailed description several modules/units of the audio processing device are mentioned, this division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more modules/units described above may be embodied in one module/unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module/unit described above may be further divided into embodiments by a plurality of modules/units.
The embodiment of the disclosure also provides a computer readable storage medium. Fig. 8 is a schematic diagram of a computer-readable storage medium shown in the present disclosure according to an exemplary embodiment, as shown in fig. 8, the storage medium has a computer program 801 stored thereon, and when executed by a processor, the computer program 801 may perform an audio processing method provided by any embodiment of the present disclosure.
Embodiments of the present disclosure also provide a computing device that may include a memory for storing computer instructions executable on a processor, the processor for implementing an audio processing method provided by any of the embodiments of the present disclosure when executing the computer instructions. Referring to fig. 9, fig. 9 is a schematic block diagram illustrating a computing device 900 according to an exemplary embodiment of the present disclosure, where the computing device 900 may include, but is not limited to: a processor 910, a memory 920, and a bus 930 that couples various system components including the memory 920 and the processor 910.
The memory 920 stores therein computer instructions executable by the processor 910, so that the processor 910 can perform the audio processing method provided by any embodiment of the present disclosure. The memory 920 may include a random access memory unit RAM921, a cache memory unit 922, and/or a read-only memory unit ROM 923. The memory 920 may further include: program tool 925 having a set of program modules 924, the program modules 924 including but not limited to: an operating system, one or more application programs, other program modules, and program data, one or more combinations of which may comprise an implementation of a network environment.
The bus 930 may include, for example, a data bus, an address bus, a control bus, and the like. The computing device 900 may also communicate with external devices 950, such as a keyboard, bluetooth device, etc., through the I/O interface 940. The computing device 900 may also communicate with one or more networks through a network adapter 960, such as a local area network, a wide area network, a public network, and so forth. As shown in fig. 9, the network adapter 960 may also communicate with other modules of the computing device 900 via the bus 930.
Embodiments of the present disclosure also provide a computer program product, which includes a computer program, and when the program is executed by the processor 910 of the computing device 900, the audio processing method provided by any embodiment of the present disclosure may be implemented.
Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method of audio processing, the method comprising:
determining a main component audio frequency and a surrounding component audio frequency corresponding to a target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed and frequency band weight parameters of frequency bands to which the main component audio frequency and the surrounding component audio frequency respectively correspond to the target multi-channel audio frequency, wherein the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio frequency comprises a target left channel audio frequency part and a target right channel audio frequency part;
acquiring rendered audio based on the surround component audio and a target mapping parameter, wherein the target mapping parameter is obtained through second audio network training;
and acquiring target multi-channel audio corresponding to the audio to be processed based on the main component audio corresponding to the target multi-channel audio and the rendering audio.
2. The method of claim 1, wherein determining the main component audio and the surround component audio corresponding to the target multi-channel audio based on the left channel audio and the right channel audio corresponding to the audio to be processed and the frequency band weight parameters corresponding to the frequency bands to which the main component audio and the surround component audio of the target multi-channel audio belong respectively comprises:
determining a principal component audio based on the left channel audio, the right channel audio and a frequency band weight parameter corresponding to a frequency band to which the principal component audio belongs;
determining the surround component audio based on the left channel audio, the right channel audio, and a band weight parameter corresponding to a band to which the surround component audio belongs.
3. The method of claim 2, wherein the principal component audio comprises a principal component left channel audio and a principal component right channel audio;
the determining the principal component audio based on the left channel audio, the right channel audio, and a frequency band weight parameter corresponding to a frequency band to which the principal component audio belongs includes:
weighting and summing the left channel audio and the right channel audio based on a first frequency band weight parameter which corresponds to a frequency band to which a principal component audio belongs and is used for processing the left channel audio and a second frequency band weight parameter which is used for processing the right channel audio to obtain the principal component left channel audio;
and weighting and summing the left channel audio and the right channel audio based on a third frequency band weight parameter which corresponds to the frequency band of the main component audio and is used for processing the left channel audio and a fourth frequency band weight parameter which is used for processing the right channel audio to obtain the main component right channel audio.
4. The method of claim 2, wherein the surround component audio comprises surround component left channel audio and surround component right channel audio;
the determining the surround component audio based on the left channel audio, the right channel audio, and a band weight parameter corresponding to a band to which the surround component audio belongs includes:
weighting and summing the left channel audio and the right channel audio based on a fifth frequency band weight parameter corresponding to the frequency band of the surround component audio and used for processing the left channel audio and a sixth frequency band weight parameter used for processing the right channel audio to obtain the surround component left channel audio;
and weighting and summing the left channel audio and the right channel audio based on a seventh frequency band weight parameter corresponding to the frequency band of the surround component audio and used for processing the left channel audio and an eighth frequency band weight parameter used for processing the right channel audio to obtain the surround component left channel audio.
5. The method of claim 1, wherein the surround component audio comprises surround component left channel audio and surround component right channel audio;
the obtaining rendered audio based on the surround component audio and the target mapping parameter includes:
and performing weighted summation on the surround partial left channel audio and the surround partial right channel audio based on a first target mapping parameter used for processing the surround partial left channel audio in the target mapping parameters and a second target mapping parameter used for processing the surround partial right channel audio in the target mapping parameters to obtain the rendered audio.
6. The method of claim 1, wherein the training procedure of the frequency band weight parameter and the target mapping parameter comprises:
obtaining a sample left channel audio and a sample right channel audio based on the sample multi-channel audio;
determining, based on the sample left channel audio and the sample right channel audio, a first sample audio feature indicative of a sum of powers of the sample left channel audio and the sample right channel audio, a second sample audio feature indicative of a difference in power of the sample left channel audio and the sample right channel audio, and a third sample audio feature indicative of real cross-correlation powers of the sample left channel audio and the sample right channel audio;
determining a first sample audio and a second sample audio based on the sample left channel audio, the sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through a first audio processing network;
obtaining a predicted rendering audio based on the second sample audio and a predicted mapping parameter obtained by processing the second sample audio through a second audio processing network;
obtaining a predicted multi-channel audio based on the first sample audio and the predicted rendered audio;
training the first and second audio processing networks based on an objective loss function indicative of a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and an objective mapping parameter.
7. The method of claim 6, wherein the sample multi-channel audio comprises sample front left channel audio, sample front right channel audio, sample center channel audio, sample left surround channel audio, and sample right surround channel audio;
the obtaining of sample left channel audio and sample right channel audio based on sample multi-channel audio comprises:
weighting the sample center channel audio and the sample left surround channel audio respectively through preset weighting parameters, and determining the sample left channel audio based on a weighted result and the sample front left channel audio;
weighting the sample center channel audio and the sample right surround channel audio respectively through the preset weighting parameters, and determining the sample right channel audio based on a weighted result and the sample right front channel audio.
8. An audio processing apparatus, characterized in that the apparatus comprises:
the determining module is used for determining a main component audio frequency and a surrounding component audio frequency corresponding to a target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed and frequency band weight parameters of frequency bands of the main component audio frequency and the surrounding component audio frequency respectively corresponding to the target multi-channel audio frequency, wherein the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio frequency comprises a target left channel audio frequency part and a target right channel audio frequency part;
the first obtaining module is used for obtaining rendering audio based on the surrounding component audio and a target mapping parameter, and the target mapping parameter is obtained through second audio network training;
and the second acquisition module is used for acquiring the target multi-channel audio corresponding to the audio to be processed based on the main component audio corresponding to the target multi-channel audio and the rendered audio.
9. A computing device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements operations performed by the audio processing method of any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has a program stored thereon, which is executed by a processor to perform operations performed by the audio processing method according to any one of claims 1 to 7.
CN202210351876.5A 2022-04-02 2022-04-02 Audio processing method, device, computing equipment and medium Pending CN114783450A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210351876.5A CN114783450A (en) 2022-04-02 2022-04-02 Audio processing method, device, computing equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210351876.5A CN114783450A (en) 2022-04-02 2022-04-02 Audio processing method, device, computing equipment and medium

Publications (1)

Publication Number Publication Date
CN114783450A true CN114783450A (en) 2022-07-22

Family

ID=82428158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210351876.5A Pending CN114783450A (en) 2022-04-02 2022-04-02 Audio processing method, device, computing equipment and medium

Country Status (1)

Country Link
CN (1) CN114783450A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100271560A1 (en) * 2009-04-22 2010-10-28 Sony Corporation Audio processing apparatus and audio processing method
CN102682779A (en) * 2012-06-06 2012-09-19 武汉大学 Double-channel encoding and decoding method for 3D audio frequency and codec
US20150088530A1 (en) * 2005-05-26 2015-03-26 Lg Electronics Inc. Method and Apparatus for Decoding an Audio Signal
CN107004427A (en) * 2014-12-12 2017-08-01 华为技术有限公司 Strengthen the signal processing apparatus of speech components in multi-channel audio signal
CN108962268A (en) * 2018-07-26 2018-12-07 广州酷狗计算机科技有限公司 The method and apparatus for determining the audio of monophonic

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150088530A1 (en) * 2005-05-26 2015-03-26 Lg Electronics Inc. Method and Apparatus for Decoding an Audio Signal
US20100271560A1 (en) * 2009-04-22 2010-10-28 Sony Corporation Audio processing apparatus and audio processing method
CN102682779A (en) * 2012-06-06 2012-09-19 武汉大学 Double-channel encoding and decoding method for 3D audio frequency and codec
CN107004427A (en) * 2014-12-12 2017-08-01 华为技术有限公司 Strengthen the signal processing apparatus of speech components in multi-channel audio signal
CN108962268A (en) * 2018-07-26 2018-12-07 广州酷狗计算机科技有限公司 The method and apparatus for determining the audio of monophonic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YONGGUK KIM: "Real-Time Conversion of Stereo Audio to 5.1 Channel Audio for Providing Realistic Sounds", 《INTERNATIONAL JOURNAL OF SIGNAL PROCESSING, IMAGE PROCESSING AND PATTERN RECONGNITION》, vol. 2, no. 4, 31 December 2009 (2009-12-31), pages 85 - 89 *

Similar Documents

Publication Publication Date Title
US10382849B2 (en) Spatial audio processing apparatus
EP3319342B1 (en) Device, method, and program for processing sound
CN108886649B (en) Apparatus, method or computer program for generating a sound field description
US9014377B2 (en) Multichannel surround format conversion and generalized upmix
US10362431B2 (en) Headtracking for parametric binaural output system and method
WO2019193248A1 (en) Spatial audio parameters and associated spatial audio playback
WO2009046225A2 (en) Correlation-based method for ambience extraction from two-channel audio signals
CN105230044A (en) Space audio device
Tylka et al. Performance of linear extrapolation methods for virtual sound field navigation
WO2013090463A1 (en) Audio processing method and audio processing apparatus
US20130044894A1 (en) System and method for efficient sound production using directional enhancement
Wang et al. Convolutive prediction for monaural speech dereverberation and noisy-reverberant speaker separation
Engel et al. Assessing HRTF preprocessing methods for Ambisonics rendering through perceptual models
CN114503606A (en) Audio processing
EP3357259A1 (en) Method and apparatus for generating 3d audio content from two-channel stereo content
JP2006154314A (en) Device, program, and method for sound source separation
KR20170101614A (en) Apparatus and method for synthesizing separated sound source
Manocha et al. DPLM: A deep perceptual spatial-audio localization metric
Luo et al. Rethinking the separation layers in speech separation networks
Pezzoli et al. Spherical-harmonics-based sound field decomposition and multichannel NMF for sound source separation
Krause et al. Data diversity for improving DNN-based localization of concurrent sound events
CN113766396A (en) Loudspeaker control
US10341802B2 (en) Method and apparatus for generating from a multi-channel 2D audio input signal a 3D sound representation signal
CN108028988B (en) Apparatus and method for processing internal channel of low complexity format conversion
CN114783450A (en) Audio processing method, device, computing equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination