CN114783450A

CN114783450A - Audio processing method, device, computing equipment and medium

Info

Publication number: CN114783450A
Application number: CN202210351876.5A
Authority: CN
Inventors: 赵翔宇; 刘华平; 曹偲
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-22

Abstract

The embodiment of the disclosure provides an audio processing method, an audio processing device, a computing device and a medium. Determining a main component audio frequency and a surrounding component audio frequency corresponding to the target multi-channel audio frequency through frequency band weight parameters which are obtained through training of a first audio network and respectively correspond to frequency bands of the main component audio frequency and the surrounding component audio frequency of the target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed, thereby obtaining rendered audio based on the surround component audio and target mapping parameters trained through the second audio network, further, based on the main component audio and the rendering audio corresponding to the target multi-channel audio, the target multi-channel audio corresponding to the audio to be processed is obtained, the audio processing is carried out by adopting the parameters obtained by the audio processing network training, the problem of separation degree reduction caused by weight imbalance adopted when a plurality of main sound sources exist in the audio and separate signals can be avoided, and therefore the processing effect of the audio processing method can be improved.

Description

Audio processing method, device, computing equipment and medium

Technical Field

Embodiments of the present disclosure relate to the field of multimedia technologies, and in particular, to an audio processing method, an audio processing apparatus, a computing device, and a medium.

Background

This section is intended to provide a background or context to the embodiments of the disclosure. The description herein is not admitted to be prior art by inclusion in this section.

In order to render the audio of ordinary stereo sound into an immersive effect of the surround, it is often necessary to upmix the two-channel audio to the multi-channel audio.

In the related art, when a binaural audio is upmixed, a Principal Component Analysis (PCA) is often used to determine a Principal component of the binaural audio, and then determine a surround component (Ambient) whose Principal component is orthogonal to the Principal component, and then a linear combination of the determined Principal component and the surround component is used as a multi-channel audio to achieve upmixing of the binaural audio.

In the implementation process, it is assumed that the main component and the surround component are completely orthogonal, and therefore, the method is only suitable for processing the audio with only one main sound source in the audio, and when a plurality of main sound sources exist in the audio to be processed, the processing effect of the audio processing method is greatly reduced.

Disclosure of Invention

In view of the problem in the related art that an audio processing method is poor in processing effect when processing to-be-processed audio with a plurality of main sound sources, embodiments of the present disclosure provide at least an audio processing method, an apparatus, a computing device, and a medium.

In a first aspect of embodiments of the present disclosure, there is provided an audio processing method, the method comprising:

determining a main component audio frequency and a surrounding component audio frequency corresponding to a target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed and frequency band weight parameters of frequency bands of the main component audio frequency and the surrounding component audio frequency respectively corresponding to the target multi-channel audio frequency, wherein the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio frequency comprises a target left channel audio frequency part and a target right channel audio frequency part;

acquiring rendered audio based on the surrounding component audio and the target mapping parameter, wherein the target mapping parameter is obtained through training of a second audio network;

and acquiring target multi-channel audio corresponding to the audio to be processed based on the main component audio and the rendering audio corresponding to the target multi-channel audio.

In an embodiment of the present disclosure, determining a principal component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed and frequency band weight parameters of frequency bands to which the principal component audio and the surround component audio respectively correspond to the target multi-channel audio includes:

determining a principal component audio frequency based on the left channel audio frequency, the right channel audio frequency and a frequency band weight parameter corresponding to a frequency band to which the principal component audio frequency belongs;

and determining the surround component audio based on the left channel audio, the right channel audio and the frequency band weight parameter corresponding to the frequency band to which the surround component audio belongs.

In one embodiment of the present disclosure, the principal component audio includes a principal component left channel audio and a principal component right channel audio;

determining the principal component audio based on the left channel audio, the right channel audio and the frequency band weight parameter corresponding to the frequency band to which the principal component audio belongs, including:

based on a first frequency band weight parameter which corresponds to the frequency band of the principal component audio and is used for processing the left channel audio and a second frequency band weight parameter which is used for processing the right channel audio, carrying out weighted summation on the left channel audio and the right channel audio to obtain the principal component left channel audio;

and weighting and summing the left channel audio and the right channel audio based on a third frequency band weight parameter which corresponds to the frequency band to which the principal component audio belongs and is used for processing the left channel audio and a fourth frequency band weight parameter which is used for processing the right channel audio to obtain the principal component right channel audio.

In one embodiment of the present disclosure, surround component audio includes surround component left channel audio and surround component right channel audio;

determining surround component audio based on the left channel audio, the right channel audio, and a band weight parameter corresponding to a band to which the surround component audio belongs, including:

weighting and summing the left channel audio and the right channel audio based on a fifth frequency band weight parameter for processing the left channel audio and a sixth frequency band weight parameter for processing the right channel audio, which correspond to the frequency band to which the surround component audio belongs, to obtain a surround left channel audio;

and weighting and summing the left channel audio and the right channel audio based on a seventh frequency band weight parameter for processing the left channel audio and an eighth frequency band weight parameter for processing the right channel audio, which correspond to the frequency band to which the surround component audio belongs, so as to obtain the surround component left channel audio.

obtaining rendered audio based on the surround component audio and the target mapping parameters, including:

and performing weighted summation on the encircled divided left channel audio and the encircled divided right channel audio based on a first target mapping parameter used for processing the encircled divided left channel audio in the target mapping parameters and a second target mapping parameter used for processing the encircled divided right channel audio in the target mapping parameters to obtain the rendered audio.

In an embodiment of the present disclosure, acquiring target multi-channel audio corresponding to audio to be processed based on principal component audio and rendering audio includes:

and superposing the main component audio and the rendered audio to obtain target multi-channel audio corresponding to the audio to be processed.

In an embodiment of the present disclosure, the training process of the frequency band weight parameter and the target mapping parameter includes:

obtaining a sample left channel audio and a sample right channel audio based on the sample multi-channel audio;

determining a first sample audio feature, a second sample audio feature, and a third sample audio feature based on the sample left channel audio and the sample right channel audio, the first sample audio feature indicating a power sum of the sample left channel audio and the sample right channel audio, the second sample audio feature indicating a power difference of the sample left channel audio and the sample right channel audio, and the third sample audio feature indicating a real cross-correlation power of the sample left channel audio and the sample right channel audio;

determining a first sample audio and a second sample audio based on a sample left channel audio, a sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing a first sample audio characteristic, a second sample audio characteristic and a third sample audio characteristic through a first audio processing network;

obtaining a predicted rendering audio based on the second sample audio and a predicted mapping parameter obtained by processing the second sample audio through a second audio processing network;

obtaining a predicted multi-channel audio based on the first sample audio and the predicted rendered audio;

the first audio processing network and the second audio processing network are trained based on an objective loss function indicative of a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and an objective mapping parameter.

In one embodiment of the present disclosure, the sample multi-channel audio includes sample front left channel audio, sample front right channel audio, sample center channel audio, sample left surround channel audio, and sample right surround channel audio;

obtaining sample left channel audio and sample right channel audio based on sample multi-channel audio, comprising:

weighting the sample center channel audio and the sample left surround channel audio respectively through preset weighting parameters, and determining the sample left channel audio based on a weighted result and the sample left front channel audio;

weighting the sample center channel audio and the sample right surround channel audio respectively through preset weighting parameters, and determining the sample right channel audio based on a weighted result and the sample right front channel audio.

In one embodiment of the disclosure, determining a first sample audio feature, a second sample audio feature, and a third sample audio feature based on a sample left channel audio and a sample right channel audio includes:

determining a first sample audio feature and a second sample audio feature based on the sample left channel audio, the complex conjugate audio corresponding to the sample left channel audio, the sample right channel audio, the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing parameter;

a third sample audio feature is determined based on a real part portion of a multiplication result of the sample left channel audio and the complex conjugate audio corresponding to the sample right channel audio, and a target smoothing factor.

In one embodiment of the disclosure, before determining the first sample audio and the second sample audio based on the sample left channel audio, the sample right channel audio, and a plurality of predicted band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through the first audio processing network, the method further comprises:

inputting the first sample audio characteristic, the second sample characteristic and the third sample audio characteristic into a first audio processing network, and processing the first sample audio characteristic, the second sample characteristic and the third sample audio characteristic through a first linear transformation layer and a second linear transformation layer which are included in the first audio processing network respectively to obtain a plurality of predicted frequency band weight parameters for obtaining the first sample audio and a plurality of predicted frequency band weight parameters for obtaining the second sample audio.

In an embodiment of the present disclosure, before obtaining the predicted rendered audio based on the second sample audio and the prediction mapping parameter obtained by processing the second sample audio through the second audio processing network, the method further includes:

and inputting the second sample audio into a second audio processing network, and processing the second sample audio through a linear transformation layer included in the second audio processing network to obtain a prediction mapping parameter.

In one embodiment of the disclosure, training a first audio processing network and a second audio processing network based on an objective loss function indicative of a difference between predicted multi-channel audio and sample multi-channel audio, resulting in a frequency band weight parameter and an objective mapping parameter, comprises:

determining a first loss function based on the predicted multi-channel audio and the sample multi-channel audio;

determining a plurality of second loss functions based on the predicted magnitude difference of the multi-channel audio between the channels and the magnitude difference of the sample multi-channel audio between the channels;

determining a target loss function based on the first loss function and the plurality of second loss functions;

and training the first audio processing network and the second audio processing network based on the target loss function until a training cutoff condition is met, and obtaining a frequency band weight parameter and a target mapping parameter.

In a second aspect of embodiments of the present disclosure, there is provided an audio processing apparatus comprising:

the determining module is used for determining a main component audio frequency and a surrounding component audio frequency corresponding to a target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed and frequency band weight parameters of frequency bands of the main component audio frequency and the surrounding component audio frequency respectively corresponding to the target multi-channel audio frequency, the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio frequency comprises a target left channel audio frequency part and a target right channel audio frequency part;

the first acquisition module is used for acquiring rendering audio based on surrounding component audio and target mapping parameters, and the target mapping parameters are obtained through second audio network training;

and the second acquisition module is used for acquiring the target multi-channel audio corresponding to the audio to be processed based on the main component audio and the rendering audio corresponding to the target multi-channel audio.

In an embodiment of the disclosure, the determining module, when configured to determine a main component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed and frequency band weight parameters of frequency bands to which the main component audio and the surround component audio respectively correspond to the target multi-channel audio, includes:

a first determination unit configured to determine a principal component audio based on a left channel audio, a right channel audio, and a frequency band weight parameter corresponding to a frequency band to which the principal component audio belongs;

a second determining unit configured to determine the surround component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the surround component audio belongs.

a first determining unit, when configured to determine the principal component audio based on the left channel audio, the right channel audio, and the frequency band weight parameter corresponding to the frequency band to which the principal component audio belongs, configured to:

In one embodiment of the present disclosure, the surround component audio includes surround component left channel audio and surround component right channel audio;

a second determining unit, when configured to determine the surround component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the surround component audio belongs, configured to:

and weighting and summing the left channel audio and the right channel audio based on a seventh frequency band weight parameter which corresponds to the frequency band of the surround component audio and is used for processing the left channel audio and an eighth frequency band weight parameter which is used for processing the right channel audio to obtain the surround component left channel audio.

a first obtaining module, when configured to obtain rendered audio based on surround component audio and a target mapping parameter, configured to:

and performing weighted summation on the encircled divided left channel audio and the encircled divided right channel audio based on a first target mapping parameter used for processing the encircled divided left channel audio in the target mapping parameters and a second target mapping parameter used for processing the encircled divided right channel audio in the target mapping parameters to obtain rendered audio.

In an embodiment of the present disclosure, the second obtaining module, when configured to obtain target multi-channel audio corresponding to the audio to be processed based on the principal component audio and the rendered audio, is configured to:

and superposing the principal component audio and the rendering audio to obtain target multi-channel audio corresponding to the audio to be processed.

In one embodiment of the present disclosure, the apparatus further comprises a training module comprising:

a first obtaining unit configured to obtain a sample left channel audio and a sample right channel audio based on a sample multi-channel audio;

a first determining unit configured to determine a first sample audio feature, a second sample audio feature, and a third sample audio feature based on a sample left channel audio and a sample right channel audio, the first sample audio feature indicating a sum of powers of the sample left channel audio and the sample right channel audio, the second sample audio feature indicating a power difference between the sample left channel audio and the sample right channel audio, and the third sample audio feature indicating a real part cross-correlation power of the sample left channel audio and the sample right channel audio;

a second determining unit, configured to determine a first sample audio and a second sample audio based on a sample left channel audio, a sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through a first audio processing network;

a second obtaining unit, configured to obtain a predicted rendered audio based on a second sample audio and a prediction mapping parameter obtained by processing the second sample audio through a second audio processing network;

a third acquisition unit configured to acquire predicted multi-channel audio based on the first sample audio and the predicted rendered audio;

a training unit for training the first audio processing network and the second audio processing network based on a target loss function indicating a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and a target mapping parameter.

a first obtaining unit, when configured to obtain a sample left channel audio and a sample right channel audio based on a sample multi-channel audio, configured to:

and weighting the sample center channel audio and the sample right surround channel audio respectively through preset weighting parameters, and determining the sample right channel audio based on a weighted result and the sample right front channel audio.

In an embodiment of the disclosure, the first determining unit, when being configured to determine the first sample audio feature, the second sample audio feature and the third sample audio feature based on the sample left channel audio and the sample right channel audio, is configured to:

In one embodiment of the present disclosure, the training module further comprises:

the first processing unit is configured to input the first sample audio feature, the second sample audio feature, and the third sample audio feature into a first audio processing network, and process the first sample audio feature, the second sample audio feature, and the third sample audio feature through a first linear transformation layer and a second linear transformation layer included in the first audio processing network, respectively, to obtain multiple predicted frequency band weight parameters for obtaining the first sample audio and multiple predicted frequency band weight parameters for obtaining the second sample audio.

In one embodiment of the present disclosure, the apparatus further comprises:

and the second processing unit is used for inputting the second sample audio into the second audio processing network, and processing the second sample audio through a linear transformation layer included in the second audio processing network to obtain the prediction mapping parameters.

In an embodiment of the disclosure, the training unit, when being configured to train the first audio processing network and the second audio processing network based on an objective loss function indicative of a difference between predicted multi-channel audio and sample multi-channel audio, to obtain a frequency band weight parameter and an objective mapping parameter, is configured to:

determining a plurality of second loss functions based on the predicted magnitude difference of the multi-channel audio between the respective channels and the magnitude difference of the sample multi-channel audio between the respective channels;

In a third aspect of the disclosed embodiments, a computing device is provided, where the computing device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the operations performed by the audio processing method provided in the first aspect and any embodiment of the first aspect.

In a fourth aspect of embodiments of the present disclosure, a computer-readable storage medium is provided, on which a program is stored, where the program is executed by a processor to perform the operations performed by the audio processing method provided by the first aspect and any embodiment of the first aspect.

In a fifth aspect of embodiments of the present disclosure, a computer program product is provided, which comprises a computer program that, when executed by a processor, performs the operations performed by the audio processing method according to the first aspect and any embodiment of the first aspect.

According to the audio processing method, device, computing equipment, medium and the like of the embodiments of the disclosure, the main component audio and the surround component audio corresponding to the target multi-channel audio are determined based on the left channel audio and the right channel audio corresponding to the audio to be processed and the frequency band weight parameters of the frequency bands to which the main component audio and the surround component audio respectively corresponding to the target multi-channel audio obtained through the first audio network training, so as to obtain the rendered audio based on the surround component audio and the target mapping parameter obtained through the second audio network training, further obtain the target multi-channel audio corresponding to the audio to be processed based on the main component audio and the rendered audio corresponding to the target multi-channel audio, perform audio processing by using the parameters obtained through the audio processing network training, and avoid the problem that the separation degree is reduced due to the weight imbalance when a plurality of main sound source separation signals exist in the audio, therefore, the number of main sound sources in the audio to be processed does not need to be considered in the audio processing process, the accuracy of the target multi-channel audio obtained by processing can be improved, and the processing effect of the audio processing method is improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 is a flow chart illustrating an audio processing method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a method for obtaining left and right channel audio based on pending audio according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a parameter training process according to an exemplary embodiment of the present disclosure;

FIG. 4 is a network architecture diagram of a first audio processing network, shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 is a network architecture diagram of a second audio processing network shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 6 is a flow chart illustrating a training process according to an exemplary embodiment of the present disclosure;

FIG. 7 is a block diagram of an audio processing device shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a computer-readable storage medium illustrated in accordance with an exemplary embodiment of the present disclosure;

FIG. 9 is a schematic block diagram of a computing device shown in accordance with an exemplary embodiment of the present disclosure;

in the drawings, like or corresponding reference characters designate like or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.

According to an embodiment of the disclosure, an audio processing method, an audio processing device, a computing device and a medium are provided. The method described above may be performed by a computing device for processing audio to be processed to obtain target multi-channel audio. The computing device may be a server, such as one server, multiple servers, a server cluster, a cloud computing platform, or the like, and optionally, the computing device may also be a terminal device, such as a smart phone, a tablet computer, a desktop computer, a portable computer, a smart speaker, or the like, where the disclosure does not limit the device type and the device number of the computing device.

Having described an environment in which the present disclosure may be applied, a specific implementation of the present disclosure is described below.

Referring to fig. 1, fig. 1 is a flowchart illustrating an audio processing method according to an exemplary embodiment of the present disclosure, the method including:

step 101, determining a main component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed and frequency band weight parameters of frequency bands to which the main component audio and the surround component audio respectively correspond to the target multi-channel audio, wherein the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio comprises a target left channel audio part and a target right channel audio part.

The audio to be processed may be a dual-channel audio, optionally, the audio to be processed may also be a multi-channel audio, and the number of channels included in the audio to be processed is smaller than the number of channels included in the target multi-channel audio.

And 102, acquiring rendered audio based on the surrounding component audio and the target mapping parameter, wherein the target mapping parameter is obtained through training of a second audio network.

And 103, acquiring target multi-channel audio corresponding to the audio to be processed based on the main component audio and the rendered audio corresponding to the target multi-channel audio.

The method determines the main component audio frequency and the surrounding component audio frequency corresponding to the target multi-channel audio frequency based on the left channel audio frequency and the right channel audio frequency corresponding to the audio frequency to be processed and the frequency band weight parameters of the frequency bands of the main component audio frequency and the surrounding component audio frequency which are obtained by the training of the first audio frequency network and respectively correspond to the target multi-channel audio frequency, thereby obtaining the rendering audio frequency based on the surrounding component audio frequency and the target mapping parameter obtained by the training of the second audio frequency network, further obtaining the target multi-channel audio frequency corresponding to the audio frequency to be processed based on the main component audio frequency and the rendering audio frequency corresponding to the target multi-channel audio frequency, and carrying out audio frequency processing by adopting the parameters obtained by the training of the audio frequency processing network, thereby avoiding the problem that the separation degree is reduced due to the weight imbalance when a plurality of main sound sources exist in the audio frequency, and further avoiding the problem that the number of the main sound sources existing in the audio frequency to be processed need not to be considered in the audio frequency processing process, and further, the accuracy of the target multi-channel audio obtained by processing can be improved, and the processing effect of the audio processing method is improved.

Having described the basic processes of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.

In some embodiments, before step 101, left channel audio and right channel audio may be acquired in advance based on the audio to be processed, so that the acquisition of the target multi-channel audio may be performed based on the left channel audio and the right channel audio. Alternatively, the acquisition of the left channel audio and the right channel audio may be performed by the following procedure.

Under the condition that the audio to be processed is the dual-channel audio, the left channel audio and the right channel audio included in the audio to be processed can be directly acquired.

When the audio to be processed is a multi-channel audio, the audio to be processed may be processed in advance to obtain a left channel audio and a right channel audio corresponding to the audio to be processed.

Taking the audio to be processed as a multi-channel audio including 5 channels as an example, the audio to be processed may include a left front channel audio, a right front channel audio, a center channel audio, a left surround channel audio, and a right surround channel audio, and then the left channel audio and the right channel audio corresponding to the audio to be processed may be obtained through the following processing manners:

weighting the center channel audio and the left surround channel audio respectively through preset weighting parameters, and determining the left channel audio based on a weighted result and the left front channel audio; and weighting the center channel audio and the right surround channel audio respectively through preset weighting parameters, and determining the right channel audio based on the weighted result and the right front channel audio.

For example, the left channel audio and the right channel audio may be determined by the following equations (1) and (2), respectively:

L＝FL+a×C+a×RL (1)

R＝FR+a×C+a×RR (2)

where L represents a left channel audio, R represents a right channel audio, FL represents a front left channel audio, FR represents a front right channel audio, C represents a center channel audio, RL represents a left surround channel audio, RR represents a right surround channel audio, and a represents a preset weight parameter.

Alternatively, the preset weight parameter may take a value of 0.71. Taking a preset weight parameter of 0.71 as an example, weighting the audio by using 0.71 as the preset weight parameter, which is equivalent to attenuating the audio by 3 decibels (dB), and under the condition that the preset weight parameter is 0.71, referring to fig. 2, a principle of acquiring left channel audio and right channel audio can be shown, fig. 2 is a schematic diagram of acquiring left and right channel audio based on the audio to be processed, shown in fig. 2, when acquiring the left channel audio, summing a result of attenuating the center channel audio by 3dB with the left front channel audio, and summing the summed result with a result of attenuating the left surround channel audio by 3dB, so as to obtain the left channel audio; when the right channel audio is obtained, the result of 3dB attenuation of the center channel audio and the right front channel audio can be summed, and the summed result and the result of 3dB attenuation of the right surround channel audio are summed, so that the right channel audio can be obtained.

The above is only an exemplary way of obtaining left and right channel audio based on audio to be processed, and in more possible implementations, other ways may also be used to obtain left and right channel audio.

After the left and right channel audios corresponding to the audio to be processed are obtained, the target multi-channel audio can be obtained based on the obtained left and right channel audios.

In some embodiments, for step 101, when determining the main component audio and the surround component audio corresponding to the target multi-channel audio based on the left channel audio and the right channel audio corresponding to the audio to be processed and the frequency band weight parameters of the frequency bands to which the main component audio and the surround component audio respectively correspond to the target multi-channel audio, the following steps may be included:

step 1011, determining the principal component audio frequency based on the left channel audio frequency, the right channel audio frequency and the frequency band weight parameter corresponding to the frequency band to which the principal component audio frequency belongs.

The principal component audio may include a principal component left channel audio and a principal component right channel audio. Based on this, for the above step 1011, when determining the principal component audio, the following steps may be included:

and step 1011-1, performing weighted summation on the left channel audio and the right channel audio based on the first frequency band weight parameter for processing the left channel audio corresponding to the frequency band to which the principal component audio belongs and the second frequency band weight parameter for processing the right channel audio to obtain the principal component left channel audio.

For example, the principal component left channel audio may be obtained by the following formula (3):

wherein,

representing the principal component left channel audio, w₁Representing a first frequency band weight parameter, X_LRepresenting the left channel audio, w₂Representing a second frequency band weight parameter, X_RRepresents the right channel audio, b represents the frequency band number, k represents the frequency bin number, and n represents the time series number.

And 1011-2, performing weighted summation on the left channel audio and the right channel audio based on the third frequency band weight parameter which corresponds to the frequency band to which the principal component audio belongs and is used for processing the left channel audio and the fourth frequency band weight parameter which is used for processing the right channel audio to obtain the principal component right channel audio.

For example, the principal component right channel audio may be obtained by the following equation (4):

wherein,

representing the principal component right channel audio, w₃Representing a third frequency band weight parameter, X_LRepresenting the left channel audio, w₄Representing a fourth frequency band weight parameter, X_RRepresents the right channel audio, b represents the frequency band number, k represents the frequency bin number, and n represents the time series number.

Step 1012, determining the surround component audio based on the left channel audio, the right channel audio, and the frequency band weighting parameter corresponding to the frequency band to which the surround component audio belongs.

Wherein the surround component audio may include surround component left channel audio and surround component right channel audio. Based on this, for the above step 1012, when determining the surround component audio, the following steps may be included:

and 1012-1, performing weighted summation on the left channel audio and the right channel audio based on a fifth frequency band weight parameter corresponding to the frequency band to which the surround component audio belongs and used for processing the left channel audio and a sixth frequency band weight parameter used for processing the right channel audio to obtain a surround component left channel audio.

For example, the surround left channel audio may be obtained by the following equation (5):

wherein,

representing surround-ing divided left channel audio, w₅Representing a fifth frequency band weight parameter, X_LRepresenting left channel audio, w₆Representing a sixth frequency band weight parameter, X_RRepresents the right channel audio, b represents the frequency band number, k represents the frequency point number, and n represents the time series number.

And 1012-2, performing weighted summation on the left channel audio and the right channel audio based on a seventh frequency band weight parameter corresponding to the frequency band to which the surround component audio belongs and used for processing the left channel audio and an eighth frequency band weight parameter used for processing the right channel audio to obtain a surround component left channel audio.

For example, the surround-component right-channel audio can be obtained by the following equation (6):

wherein,

representing surround divided left channel audio, w₇Representing a seventh frequency band weight parameter, X_LRepresenting the left channel audio, w₈Representing the weight parameter, X, of the eighth frequency band_RRepresents the right channel audio, b represents the frequency band number, k represents the frequency point number, and n represents the time series number.

It should be noted that, the reference numerals of the above steps do not limit the execution order of the above two steps, taking step 1011 and step 1012 as an example, optionally, step 1011 may be executed first and then step 1012 is executed, or step 1012 may be executed first and then step 1011 may be executed, or step 1011 and step 1012 may be executed simultaneously, and the present disclosure does not limit which execution order is specifically adopted.

Through the above process, the principal component audio and the surround component audio can be determined according to a linear estimation method, and the principal component audio and the surround component audio can be represented as a linear combination of stereo signals (i.e., multi-channel signals), so that the target multi-channel audio can be determined based on the determined principal component audio and surround component audio.

This is disclosed through the frequency channel weight parameter that utilizes first audio frequency processing network training to obtain, carry out the separation of principal component audio frequency and surround the component audio frequency, need not to acquire the split file, can obtain the very high principal component audio frequency of degree of separation and surround the component audio frequency, in addition, realize the separation of principal component audio frequency and surround the component audio frequency based on audio frequency processing network, the problem that separation signal weight unbalance leads to the degree of separation to reduce when having a plurality of principal sound sources in the audio frequency has been avoided, thereby can improve the audio frequency separation effect.

In some embodiments, for step 102, when the rendered audio is obtained based on the surround component audio and the target mapping parameters, this may be achieved by:

The surround component audio can improve the diversity of the diffusion signals and create a surround immersion sound effect, and when the surround component audio is rendered into multi-channel audio, the separate diffusion signals with independent channels can be generated through the decorrelator. For example, the surround component signal may be rendered into multi-channel audio by the following equation (7) to obtain rendered audio:

wherein,

representing rendered audio, o_c，1A first target mapping parameter is represented and,

representing surround-ing divided left channel audio, o_c，2A second target mapping parameter is represented that is,

represents surround component right channel audio, b represents a frequency band number, k represents a frequency bin number, and n represents a time series number.

And rendering the surrounding component audio by utilizing the target mapping parameters obtained by training the second audio processing network, so that the rendering effect of the immersive audio can be improved.

In some embodiments, for step 103, when obtaining target multi-channel audio corresponding to the audio to be processed based on the principal component audio and the rendered audio, the following steps may be performed:

For example, the target multi-channel audio may be obtained by the following equation (8):

wherein, U_cTo represent the target multi-channel audio,

representing a principal component audio (including a principal component left channel audio and a principal component right channel audio),

represents a rendered audio obtained based on a surround component audio, b represents a frequency band number, k represents a frequency bin number, and n represents a time series number.

Since the principal component audio is an audio signal having directivity, it is necessary to maintain the spatial sound image orientation of the principal component audio when the principal component audio is mapped to multi-channel audio. Thus, the direction of the principal component audio needs to be estimated and then mapped to the left and right channels according to the angle, and the other channels except the left and right channels do not need to retain the principal component audio. Therefore, the principal component audio in the above formula (8) includes only the principal component left channel audio and the principal component right channel audio, and the principal component audio of other channels is mapped to 0, which can be ignored, i.e. it can be ensured that the principal component audio for obtaining the target multi-channel audio can truly restore the spatial position distribution of the original sound field.

The above process is an introduction on how to obtain target multi-channel audio based on audio to be processed by using parameters obtained by pre-training, and the following description is made on a process of obtaining parameters by pre-training.

In some embodiments, the training process of the frequency band weight parameter and the target mapping parameter may refer to fig. 3, fig. 3 is a flowchart illustrating a parameter training process according to an exemplary embodiment of the present disclosure, and as shown in fig. 3, the parameter training process includes:

step 301, obtaining a sample left channel audio and a sample right channel audio based on the sample multi-channel audio.

Wherein the sample multi-channel audio may be audio in a multi-channel data set, and the sample multi-channel audio may include sample front left channel audio, sample front right channel audio, sample center channel audio, sample left surround channel audio, and sample right surround channel audio.

Taking the example that the sample multi-channel audio includes the audio of the above 5 channels, the sample left channel audio and the sample right channel audio can be obtained through the following steps:

step 3011, weighting the sample center channel audio and the sample left surround channel audio respectively by preset weighting parameters, and determining the sample left channel audio based on the weighted result and the sample left front channel audio.

Step 3012, weighting the sample center channel audio and the sample right surround channel audio respectively by preset weighting parameters, and determining a sample right channel audio based on the weighted result and the sample right front channel audio.

The process of obtaining left and right channel audio of the sample through the steps 3011 and 3012 may refer to formula (1) and formula (2) and related descriptions, and is not described herein again.

Step 302, determining a first sample audio feature, a second sample audio feature and a third sample audio feature based on the sample left channel audio and the sample right channel audio, the first sample audio feature being used for indicating a power sum of the sample left channel audio and the sample right channel audio, the second sample audio feature being used for indicating a power difference of the sample left channel audio and the sample right channel audio, and the third sample audio feature being used for indicating a real part cross-correlation power of the sample left channel audio and the sample right channel audio.

In one possible implementation, the first, second and third sample audio features may be determined by:

step 3021, determining a first sample audio feature and a second sample audio feature based on the sample left channel audio, the complex conjugate audio corresponding to the sample left channel audio, the sample right channel audio, the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing parameter.

For example, the first sample audio feature and the second sample audio feature may be determined by the following equation (9) and equation (10), respectively:

wherein phi is_SRepresenting a first sample audio characteristic (i.e., the sum of the powers of the sample left channel audio and the sample right channel audio), Φ_DRepresenting a second sample audio characteristic (i.e., the power difference between the sample left channel audio and the sample right channel audio), X_LRepresenting the sample left channel audio, and,

complex conjugate audio, X, representing sample left channel audio_RRepresenting the sample right channel audio, and,

the method includes the steps of representing complex conjugate audio of a sample right channel audio, b representing a frequency band number, k representing a frequency point number, n representing a time sequence number, and alpha representing a target smoothing parameter, wherein optionally alpha can be any integer value.

Step 3022, determining a third sample audio characteristic based on the real part of the multiplication result of the sample left channel audio and the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing factor.

For example, the third sample audio feature may be determined by the following equation (11):

wherein phi_XCRepresenting the third sample audio feature (i.e., the real cross-correlation power of the sample left channel audio feature and the sample right channel), R represents the real portion of the result, X_LRepresenting the sample left channel audio, and,

Step 303, determining a first sample audio and a second sample audio based on the sample left channel audio, the sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through the first audio processing network.

Wherein the first sample audio is sample principal component audio and the second sample audio is sample surround component audio.

It should be noted that, before step 303, the first sample audio characteristic, the second sample audio characteristic, and the third sample audio characteristic may be input into the first audio processing network, so that the plurality of predicted frequency band parameters are obtained through the first audio processing network.

In one possible implementation, the first audio processing Network may be a Deep Neural Network (DNN). The first audio processing network may include a first Linear transformation layer and a second Linear transformation layer, and each of the first Linear transformation layer and the second Linear transformation layer may be composed of a Linear transformation layer using a Linear rectification Unit (ReLU) function as an activation function and a Linear transformation layer using a normalization function as an activation function.

Taking the first audio processing network including the first linear transform layer and the second linear transform layer as an example, after the first sample audio feature, the second sample feature, and the third sample audio feature are input into the first audio processing network, the first sample audio feature, the second sample feature, and the third sample audio feature may be processed through the first linear transform layer and the second linear transform layer included in the first audio processing network, respectively, to obtain a plurality of predicted frequency band weight parameters for obtaining the first sample audio and a plurality of predicted frequency band weight parameters for obtaining the second sample audio, so that in step 303, the first sample audio and the second sample audio may be determined based on the plurality of predicted frequency band weight parameters.

Optionally, when the first sample audio feature, the second sample audio feature, and the third sample audio feature are input to the first audio processing network, the target feature may be obtained for the first sample audio feature, the second sample audio feature, and the third sample audio feature, and then the target feature obtained by splicing is input to the first audio processing network.

Referring to fig. 4, fig. 4 is a schematic diagram of a network structure of a first audio processing network shown in the present disclosure according to an exemplary embodiment, and as shown in fig. 4, each of a first linear transformation layer and a second linear transformation layer may be composed of three linear transformation layers using a ReLU function as an activation function and a linear transformation layer using a Sigmoid function as an activation function, when a plurality of predicted band weight parameters are obtained through the first audio processing network shown in fig. 4, a first sample audio feature, a second sample audio feature and a third sample audio feature may be respectively input into the first linear transformation layer and the second linear transformation layer to respectively obtain a plurality of predicted band weight parameters for weighting a sample left audio and a plurality of predicted band weight parameters for weighting a sample right audio through the first linear transformation layer and the second linear transformation layer, the internal processes of the first and second linear conversion layers will be described below.

For the first linear transformation layer, the features of the input layers can be sequentially subjected to linear rectification processing through three linear transformation layers which use a ReLU function as an activation function and are included in the first linear transformation layer to obtain rectified features, then the rectified features are subjected to normalization processing through the linear transformation layers which use a Sigmoid function as the activation function, and a plurality of predicted frequency band weight parameters (including a first predicted frequency band weight parameter w) for obtaining a first sample audio are determined based on the features obtained through the normalization processing₁Second predicted frequency band weight parameter w₂A third predicted frequency band weight parameter w₃The fourth predicted frequency band weight parameter w₄)。

For the second linear transformation layer, the features of the input layers can be sequentially subjected to linear rectification processing through three linear transformation layers which use a ReLU function as an activation function and are included in the second linear transformation layer to obtain rectified features, then the rectified features are subjected to normalization processing through the linear transformation layers which use a Sigmoid function as the activation function, and a plurality of predicted frequency band weight parameters (including a fifth predicted frequency band weight parameter w) for obtaining the second sample audio are determined based on the features obtained through the normalization processing₅Sixth predicted frequency band weight parameter w₆The seventh predicted frequency band weight parameter w₇The eighth predicted frequency band weight parameter w₈)。

After determining the plurality of predicted frequency band weight parameters for obtaining the first sample audio and the plurality of predicted frequency band weight parameters for obtaining the second sample audio, the first sample audio and the second sample audio may be determined based on the determined predicted frequency band parameters, and the specific determination manner may refer to step 101 and the descriptions of equations (3) to (6), which are not described herein again.

And step 304, obtaining a predicted rendered audio based on the second sample audio and the predicted mapping parameter obtained by processing the second sample audio through the second audio processing network.

It should be noted that, before step 304, a second sample audio may be input into a second audio processing network, so as to obtain the prediction mapping parameters through the second audio processing network.

In one possible implementation, the second audio processing network may be a DNN. Wherein the second audio processing network may comprise a linear transformation layer, which may consist of a linear transformation layer using a ReLU function as an activation function and a linear transformation layer using a Sigmoid function as an activation function.

When the prediction mapping parameters are obtained by the second audio processing network, the second sample audio may be processed by a linear transformation layer included in the second audio processing network to obtain the prediction mapping parameters, so that the obtaining of the rendered audio may be performed based on the prediction mapping parameters in step 304.

Referring to fig. 5, fig. 5 is a schematic network structure diagram of a second audio processing network according to an exemplary embodiment of the present disclosure, and as shown in fig. 5, a linear transformation layer of the second audio processing network is composed of two linear transformation layers using a ReLU function as an activation function and a Sigmoid function as an activation function, when a prediction mapping parameter is obtained through the second audio processing network as shown in fig. 5, a second sample audio represented in a vector form may be input to the linear transformation layer to obtain the prediction mapping parameter through the linear transformation layer, and the internal processing procedure of the linear transformation layer is explained below.

For example, the features of the input layers may be sequentially subjected to linear rectification processing by two linear conversion layers using a ReLU function as an activation function included in the linear conversion layers to obtain rectified features, the rectified features may be further subjected to normalization processing by the linear conversion layers using a Sigmoid function as an activation function, and the prediction mapping parameters (including the first prediction mapping parameter o) may be determined based on the features obtained by the normalization processing_c,1And a second prediction mapping parameter o_c,2)。

After the prediction mapping parameters are determined, the rendered audio may be obtained based on the determined prediction mapping parameters, and the specific obtaining manner may be as described in step 102 and the description related to the formula (7), which is not described herein again.

Step 305, obtaining a predicted multi-channel audio based on the first sample audio and the predicted rendered audio.

It should be noted that, the implementation process of step 305 may refer to the description related to step 103, and is not described herein again.

Step 306, training the first audio processing network and the second audio processing network based on a target loss function indicating a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and a target mapping parameter.

The above step 306 can be realized by the following steps:

step 3061, a first loss function is determined based on the predicted multi-channel audio and the sample multi-channel audio.

The first loss function may be a minimum Mean-Square Error (MSE) loss function. For example, the first loss function can be seen in equation (12) below:

wherein L is_CRepresenting a first loss function, Y_CRepresenting sample multi-channel audio, U_CRepresents the predicted multi-channel audio, K represents the number of frequency bands, K represents the frequency bin number, and n represents the time series number.

Step 3062, a plurality of second loss functions are determined based on the predicted magnitude difference between the channels of the multi-channel audio and the magnitude difference between the channels of the sample multi-channel audio.

The second loss function may be an Inter Channel Level Differences (ICLD) loss function. For example, the second loss function can be seen in equation (13) below:

wherein L is_ICLDThe second loss function is represented as a function of,

representing the magnitude difference between any two channels of sample multi-channel audio,

the method is used for predicting the amplitude difference between any two sound channels of multi-channel audio, K represents the number of frequency bands, K represents frequency point numbers, and n represents time sequence numbers.

Taking sample multi-channel audio and predicted multi-channel audio as examples, which include 7 channels, there are 21 second loss functions between the sample multi-channel audio and the predicted multi-channel audio.

Step 3063, a target loss function is determined based on the first loss function and the plurality of second loss functions.

In one possible implementation, the sum of the first loss function and the plurality of second loss functions may be determined as a target loss function. For example, the target loss function may be determined by the following equation (14):

wherein L represents an objective loss function, L_CRepresents a first loss function, L_ICLDRepresenting a second loss function, C represents a plurality of channels.

Step 3064, training the first audio processing network and the second audio processing network based on the target loss function until the training cutoff condition is met, and obtaining a frequency band weight parameter and a target mapping parameter.

It should be noted that, under the condition that the training cutoff condition is satisfied, the trained first audio processing network and the trained second audio processing network may be obtained, the predicted frequency band weight parameter output by the trained first audio processing network is the frequency band weight parameter to be obtained, and the predicted mapping parameter output by the trained second audio processing network is the target mapping parameter to be obtained.

The training cutoff condition may be that a function value of the target loss function satisfies a set condition, or the iteration number reaches the equipment number, optionally, the training cutoff condition may also be other conditions, which is not limited in this disclosure.

It should be noted that the training process shown in fig. 3 is an iterative process, that is, each time a sample left channel audio and a sample right channel audio are obtained through step 301, the obtained sample left channel audio and sample right channel audio are processed through steps 302 to 305 to obtain a predicted multi-channel audio, so that the first audio processing network and the second audio processing network are trained based on the target loss function through step 306, and then the obtained next sample left channel audio and sample right channel audio are processed through steps 302 to 305 to obtain a predicted multi-channel audio, so that the first audio processing network and the second audio processing network obtained through training are continuously trained based on the target loss function through step 306, and so on until the training cutoff condition is met, and obtaining the trained first audio processing network and second audio processing network, thereby obtaining the frequency band weight parameter and the target mapping parameter.

The deep neural network is trained by introducing the deep neural network in a link of separating a main component audio and a surrounding component audio, the deep neural network comprises a plurality of linear transformation layers, the linear rectification processing can be carried out on the input characteristics of each layer through the linear transformation layers included in the deep neural network, after the iterative linear rectification processing is realized through the plurality of linear transformation layers, the audio output by different sound channels can be obtained through the linear transformation layer using a Sigmoid function as an activation function, so that the sound channel separation through the deep neural network can be realized, the advantages of the deep neural network can be utilized, and the deep neural network can be trained, the method and the device improve the separation degree of the audios of different sound channels, and simultaneously can reduce the occurrence of audio distortion and tone quality damage caused by time domain and frequency domain transformation and nonlinear processing of the audios.

In addition, the deep neural network is introduced in the rendering link of the surround component audio, and the introduction of the deep neural network can avoid the situation that the learning factors of partial sound channels are empty when the surround component audio is rendered under the condition that a plurality of main components exist in the audio, so that the method can adapt to the situation that a plurality of main sound sources exist in the audio, and the audio quality of the generated multi-channel audio can be improved.

Based on various optional embodiments involved in the above-mentioned training process, fig. 6 is a flowchart of a training process shown in the present disclosure according to an exemplary embodiment, and fig. 6 is a flowchart of a training process shown in fig. 6, after obtaining sample multi-channel audio from a multi-channel data set, a down-mix may be performed based on the sample multi-channel audio through step 301 to obtain sample left-channel audio and sample right-channel audio, so that a separation of a principal component and a surround component may be performed through steps 302 to 303 to obtain a first sample audio as a principal component audio and a second sample audio as a surround component, and then a mapping may be performed based on the first sample audio, and a rendering may be performed based on the second sample audio through step 304 to be based on a result of the mapping of the first sample audio and a result of the rendering of the second sample audio through step 305, the predicted multi-channel audio is obtained, and then, in step 306, training is performed through back propagation based on the loss function indicating the difference between the predicted multi-channel audio and the sample multi-channel audio, and the specific implementation process may refer to steps 301 to 306, which are not described herein again.

Having described the audio processing method of the exemplary embodiments of the present disclosure, next, the structures of the audio processing apparatus of the exemplary embodiments of the present disclosure and the computing device for implementing the audio processing method will be explained.

Referring to fig. 7, fig. 7 is a block diagram of an audio processing apparatus according to an exemplary embodiment of the present disclosure, the apparatus including:

a determining module 701, configured to determine a main component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed, and frequency band weight parameters of frequency bands to which the main component audio and the surround component audio respectively correspond to the target multi-channel audio, where the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio includes a target left channel audio portion and a target right channel audio portion;

a first obtaining module 702, configured to obtain a rendered audio based on the surround component audio and a target mapping parameter, where the target mapping parameter is obtained through a second audio network training;

the second obtaining module 703 is configured to obtain a target multi-channel audio corresponding to the audio to be processed based on the principal component audio and the rendered audio corresponding to the target multi-channel audio.

In an embodiment of the present disclosure, the determining module 701, when configured to determine a main component audio and a surround component audio corresponding to a target multi-channel audio based on a left channel audio and a right channel audio corresponding to an audio to be processed and a frequency band weight parameter corresponding to a frequency band to which the main component audio and the surround component audio of the target multi-channel audio belong, includes:

a first determining unit, when configured to determine the principal component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the principal component audio belongs, configured to:

a second determination unit, when configured to determine the surround component audio based on the left channel audio, the right channel audio, and the band weight parameter corresponding to the band to which the surround component audio belongs, configured to:

a first obtaining module 702, when configured to obtain rendered audio based on surround component audio and the target mapping parameters, is configured to:

In an embodiment of the present disclosure, the second obtaining module 703, when configured to obtain target multi-channel audio corresponding to the audio to be processed based on the principal component audio and the rendered audio, is configured to:

a first determining unit, configured to determine, based on a sample left channel audio and a sample right channel audio, a first sample audio feature, a second sample audio feature, and a third sample audio feature, the first sample audio feature being used to indicate a power sum of the sample left channel audio and the sample right channel audio, the second sample audio feature being used to indicate a power difference of the sample left channel audio and the sample right channel audio, and the third sample audio feature being used to indicate a real part cross-correlation power of the sample left channel audio and the sample right channel audio;

a third sample audio feature is determined based on the real part of the multiplication result of the sample left channel audio and the complex conjugate audio corresponding to the sample right channel audio, and the target smoothing factor.

the first processing unit is configured to input the first sample audio feature, the second sample feature, and the third sample audio feature into a first audio processing network, and process the first sample audio feature, the second sample feature, and the third sample audio feature through a first linear transformation layer and a second linear transformation layer included in the first audio processing network, respectively, to obtain a plurality of predicted frequency band weight parameters for obtaining the first sample audio and a plurality of predicted frequency band weight parameters for obtaining the second sample audio.

In one embodiment of the present disclosure, the apparatus further comprises:

and the second processing unit is used for inputting the second sample audio into a second audio processing network, and processing the second sample audio through a linear transformation layer included in the second audio processing network to obtain the prediction mapping parameter.

It should be noted that although in the above detailed description several modules/units of the audio processing device are mentioned, this division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more modules/units described above may be embodied in one module/unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module/unit described above may be further divided into embodiments by a plurality of modules/units.

The embodiment of the disclosure also provides a computer readable storage medium. Fig. 8 is a schematic diagram of a computer-readable storage medium shown in the present disclosure according to an exemplary embodiment, as shown in fig. 8, the storage medium has a computer program 801 stored thereon, and when executed by a processor, the computer program 801 may perform an audio processing method provided by any embodiment of the present disclosure.

Embodiments of the present disclosure also provide a computing device that may include a memory for storing computer instructions executable on a processor, the processor for implementing an audio processing method provided by any of the embodiments of the present disclosure when executing the computer instructions. Referring to fig. 9, fig. 9 is a schematic block diagram illustrating a computing device 900 according to an exemplary embodiment of the present disclosure, where the computing device 900 may include, but is not limited to: a processor 910, a memory 920, and a bus 930 that couples various system components including the memory 920 and the processor 910.

The memory 920 stores therein computer instructions executable by the processor 910, so that the processor 910 can perform the audio processing method provided by any embodiment of the present disclosure. The memory 920 may include a random access memory unit RAM921, a cache memory unit 922, and/or a read-only memory unit ROM 923. The memory 920 may further include: program tool 925 having a set of program modules 924, the program modules 924 including but not limited to: an operating system, one or more application programs, other program modules, and program data, one or more combinations of which may comprise an implementation of a network environment.

The bus 930 may include, for example, a data bus, an address bus, a control bus, and the like. The computing device 900 may also communicate with external devices 950, such as a keyboard, bluetooth device, etc., through the I/O interface 940. The computing device 900 may also communicate with one or more networks through a network adapter 960, such as a local area network, a wide area network, a public network, and so forth. As shown in fig. 9, the network adapter 960 may also communicate with other modules of the computing device 900 via the bus 930.

Embodiments of the present disclosure also provide a computer program product, which includes a computer program, and when the program is executed by the processor 910 of the computing device 900, the audio processing method provided by any embodiment of the present disclosure may be implemented.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of audio processing, the method comprising:

determining a main component audio frequency and a surrounding component audio frequency corresponding to a target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed and frequency band weight parameters of frequency bands to which the main component audio frequency and the surrounding component audio frequency respectively correspond to the target multi-channel audio frequency, wherein the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio frequency comprises a target left channel audio frequency part and a target right channel audio frequency part;

acquiring rendered audio based on the surround component audio and a target mapping parameter, wherein the target mapping parameter is obtained through second audio network training;

and acquiring target multi-channel audio corresponding to the audio to be processed based on the main component audio corresponding to the target multi-channel audio and the rendering audio.

2. The method of claim 1, wherein determining the main component audio and the surround component audio corresponding to the target multi-channel audio based on the left channel audio and the right channel audio corresponding to the audio to be processed and the frequency band weight parameters corresponding to the frequency bands to which the main component audio and the surround component audio of the target multi-channel audio belong respectively comprises:

determining a principal component audio based on the left channel audio, the right channel audio and a frequency band weight parameter corresponding to a frequency band to which the principal component audio belongs;

determining the surround component audio based on the left channel audio, the right channel audio, and a band weight parameter corresponding to a band to which the surround component audio belongs.

3. The method of claim 2, wherein the principal component audio comprises a principal component left channel audio and a principal component right channel audio;

the determining the principal component audio based on the left channel audio, the right channel audio, and a frequency band weight parameter corresponding to a frequency band to which the principal component audio belongs includes:

weighting and summing the left channel audio and the right channel audio based on a first frequency band weight parameter which corresponds to a frequency band to which a principal component audio belongs and is used for processing the left channel audio and a second frequency band weight parameter which is used for processing the right channel audio to obtain the principal component left channel audio;

and weighting and summing the left channel audio and the right channel audio based on a third frequency band weight parameter which corresponds to the frequency band of the main component audio and is used for processing the left channel audio and a fourth frequency band weight parameter which is used for processing the right channel audio to obtain the main component right channel audio.

4. The method of claim 2, wherein the surround component audio comprises surround component left channel audio and surround component right channel audio;

the determining the surround component audio based on the left channel audio, the right channel audio, and a band weight parameter corresponding to a band to which the surround component audio belongs includes:

weighting and summing the left channel audio and the right channel audio based on a fifth frequency band weight parameter corresponding to the frequency band of the surround component audio and used for processing the left channel audio and a sixth frequency band weight parameter used for processing the right channel audio to obtain the surround component left channel audio;

and weighting and summing the left channel audio and the right channel audio based on a seventh frequency band weight parameter corresponding to the frequency band of the surround component audio and used for processing the left channel audio and an eighth frequency band weight parameter used for processing the right channel audio to obtain the surround component left channel audio.

5. The method of claim 1, wherein the surround component audio comprises surround component left channel audio and surround component right channel audio;

the obtaining rendered audio based on the surround component audio and the target mapping parameter includes:

and performing weighted summation on the surround partial left channel audio and the surround partial right channel audio based on a first target mapping parameter used for processing the surround partial left channel audio in the target mapping parameters and a second target mapping parameter used for processing the surround partial right channel audio in the target mapping parameters to obtain the rendered audio.

6. The method of claim 1, wherein the training procedure of the frequency band weight parameter and the target mapping parameter comprises:

determining, based on the sample left channel audio and the sample right channel audio, a first sample audio feature indicative of a sum of powers of the sample left channel audio and the sample right channel audio, a second sample audio feature indicative of a difference in power of the sample left channel audio and the sample right channel audio, and a third sample audio feature indicative of real cross-correlation powers of the sample left channel audio and the sample right channel audio;

determining a first sample audio and a second sample audio based on the sample left channel audio, the sample right channel audio, and a plurality of predicted frequency band weight parameters obtained by processing the first sample audio feature, the second sample audio feature, and the third sample audio feature through a first audio processing network;

training the first and second audio processing networks based on an objective loss function indicative of a difference between the predicted multi-channel audio and the sample multi-channel audio, resulting in a frequency band weight parameter and an objective mapping parameter.

7. The method of claim 6, wherein the sample multi-channel audio comprises sample front left channel audio, sample front right channel audio, sample center channel audio, sample left surround channel audio, and sample right surround channel audio;

the obtaining of sample left channel audio and sample right channel audio based on sample multi-channel audio comprises:

weighting the sample center channel audio and the sample left surround channel audio respectively through preset weighting parameters, and determining the sample left channel audio based on a weighted result and the sample front left channel audio;

weighting the sample center channel audio and the sample right surround channel audio respectively through the preset weighting parameters, and determining the sample right channel audio based on a weighted result and the sample right front channel audio.

8. An audio processing apparatus, characterized in that the apparatus comprises:

the determining module is used for determining a main component audio frequency and a surrounding component audio frequency corresponding to a target multi-channel audio frequency based on a left channel audio frequency and a right channel audio frequency corresponding to the audio frequency to be processed and frequency band weight parameters of frequency bands of the main component audio frequency and the surrounding component audio frequency respectively corresponding to the target multi-channel audio frequency, wherein the frequency band weight parameters are obtained through training of a first audio network, and the target multi-channel audio frequency comprises a target left channel audio frequency part and a target right channel audio frequency part;

the first obtaining module is used for obtaining rendering audio based on the surrounding component audio and a target mapping parameter, and the target mapping parameter is obtained through second audio network training;

and the second acquisition module is used for acquiring the target multi-channel audio corresponding to the audio to be processed based on the main component audio corresponding to the target multi-channel audio and the rendered audio.

9. A computing device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements operations performed by the audio processing method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has a program stored thereon, which is executed by a processor to perform operations performed by the audio processing method according to any one of claims 1 to 7.