Nothing Special   »   [go: up one dir, main page]

CN117711440B - Audio quality evaluation method and related device - Google Patents

Audio quality evaluation method and related device Download PDF

Info

Publication number
CN117711440B
CN117711440B CN202311765461.3A CN202311765461A CN117711440B CN 117711440 B CN117711440 B CN 117711440B CN 202311765461 A CN202311765461 A CN 202311765461A CN 117711440 B CN117711440 B CN 117711440B
Authority
CN
China
Prior art keywords
voice
segment
audio
fragments
evaluation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311765461.3A
Other languages
Chinese (zh)
Other versions
CN117711440A (en
Inventor
张力恒
李凡
陈靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shuhang Technology Beijing Co ltd
Original Assignee
Shuhang Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shuhang Technology Beijing Co ltd filed Critical Shuhang Technology Beijing Co ltd
Priority to CN202311765461.3A priority Critical patent/CN117711440B/en
Publication of CN117711440A publication Critical patent/CN117711440A/en
Application granted granted Critical
Publication of CN117711440B publication Critical patent/CN117711440B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an audio quality evaluation method and a related device, wherein the method relates to the field of audio and video, and comprises the following steps: acquiring audio to be evaluated; classifying the audio to be evaluated, and determining a voice segment and a music segment in the audio to be evaluated; extracting a plurality of voice fragments from the audio to be evaluated according to the position information of the voice fragments and the position information of the music fragments; performing quality evaluation on the plurality of voice fragments to obtain an evaluation result of each voice fragment in the plurality of voice fragments; and obtaining an evaluation result of the audio to be evaluated based on the evaluation result of each voice segment. According to the method provided by the application, the defects of the prior art can be overcome, and the audio quality evaluation result is more accurate.

Description

Audio quality evaluation method and related device
Technical Field
The application relates to the field of audio and video, in particular to an audio quality evaluation method and a related device.
Background
With the widespread use of audio, such as voice communication, music playing, voice recognition, etc., ensuring that the audio quality reaches a high level is critical to the user experience and application performance. Therefore, developing an accurate and reliable audio quality assessment method is critical to providing quality audio services.
At present, the no-reference audio quality evaluation method comprises the following steps: the subjective scores are collected from a large number of participants, and then the subjective audio quality scores are counted and analyzed to obtain an average score; some non-subjective audio assessment algorithms automatically evaluate the audio quality, typically using machine learning to analyze the characteristics of the audio and then give an audio quality score.
However, none of the currently known non-parametric sound quality evaluation algorithms is capable of performing quality evaluation on the audio including music, and if the quality evaluation is forcibly performed on the audio including music, there is a problem that the evaluation result is inaccurate.
Disclosure of Invention
The embodiment of the application provides an audio quality evaluation method and a related device, which can overcome the defects of the prior art, can realize the evaluation of audio quality, and enable the audio quality evaluation result to be more accurate by extracting voice fragments from audio.
In a first aspect, an embodiment of the present application provides an audio quality evaluation method, including:
Acquiring audio to be evaluated;
Classifying the audio to be evaluated, and determining a voice segment and a music segment in the audio to be evaluated;
extracting a plurality of voice fragments from the audio to be evaluated according to the position information of the voice fragments and the position information of the music fragments;
Performing quality evaluation on the voice fragments to obtain an evaluation result of each voice fragment in the voice fragments;
and obtaining an evaluation result of the audio to be evaluated based on the evaluation result of each voice segment.
It can be seen that in the embodiment of the application, according to the position information of the voice fragments and the music fragments, a plurality of voice fragments are extracted from the audio to be evaluated, so that the audio with only voice can be extracted, the influence of the interference of music, background and the like on the evaluation result is avoided, and the audio quality evaluation result is more accurate. And based on the evaluation result of each voice segment, obtaining the overall evaluation result of the audio to be evaluated. Thus, the quality of each voice segment can be comprehensively considered to obtain the evaluation of the whole audio, and a comprehensive evaluation conclusion is provided.
Based on the first aspect, in a possible implementation manner, the classifying processing is performed on the audio to be evaluated, and determining a voice segment and a music segment in the audio to be evaluated includes:
dividing the audio to be evaluated into a plurality of fragments;
extracting features of each of the plurality of segments;
and determining whether each of the plurality of segments is the voice segment or the music segment according to the characteristics of each segment.
It can be seen that in the embodiment of the application, the audio to be evaluated is divided into a plurality of segments, so that the audio can be conveniently and finely analyzed and evaluated; by extracting the characteristics of each segment and analyzing the characteristics, each segment can be accurately classified, and a foundation is laid for subsequent evaluation and analysis.
Based on the first aspect, in a possible implementation manner, the determining whether each of the plurality of segments is the speech segment or the music segment according to the feature of each segment includes:
Inputting the characteristics of each segment into a convolutional neural network to obtain the probability that each segment is the voice segment and the probability that each segment is the music segment;
And determining whether each segment is the voice segment or the music segment according to the probability that each segment is the voice segment and the probability that each segment is the music segment.
It can be seen that in the embodiment of the application, the convolutional neural network is used for classifying the segment types, and the convolutional neural network can learn the local features in the audio segment, so that the classification of the fine segment is realized. Compared with the traditional method based on rule or manual feature extraction, the convolutional neural network can better capture the fine difference in the audio fragments, and the classification precision and accuracy are improved.
Based on the first aspect, in a possible implementation manner, the audio to be evaluated includes time information, the position information of the voice segment refers to time position information of the voice segment in the audio to be evaluated, and the position information of the music segment refers to time position information of the music segment in the audio to be evaluated.
Based on the first aspect, in a possible implementation manner, if adjacent segments in the plurality of segments overlap in time position, the voice segment in the plurality of segments overlaps in time position with the music segment;
before said quality evaluation of said plurality of speech segments, said method further comprises:
and deleting the voice fragments overlapped with the music fragments in time positions.
It can be seen that in the embodiment of the application, before quality evaluation is performed on a plurality of voice fragments, the voice fragments overlapped with the music fragments in time positions are deleted, so that the accuracy of evaluation can be improved. Because the overlapping speech segments may be disturbed by music, resulting in inaccurate evaluation results. By deleting the overlapping part, the object to be evaluated can be ensured to be a pure voice fragment, thereby improving the accuracy and reliability of the evaluation.
Based on the first aspect, in a possible implementation manner, the performing quality evaluation on the plurality of voice segments to obtain an evaluation result of each voice segment in the plurality of voice segments includes:
Inputting a plurality of voice fragments into a voice evaluation model to obtain an evaluation result of each voice fragment in the voice fragments; the speech evaluation model comprises a feature extraction layer, a convolution layer, a self-attention network layer and an attention pooling layer, wherein,
The feature extraction layer is used for extracting the features of each voice segment in the voice segments;
The convolution layer is used for carrying out dimension reduction processing on the characteristics of each voice segment to obtain dimension reduction characteristics of each voice segment;
The self-attention network layer is used for carrying out weighting processing on the dimension reduction characteristics of each voice segment based on a self-attention mechanism to obtain the weighting characteristics of each voice segment;
The attention pooling layer is used for evaluating each voice segment according to the weighted characteristics of each voice segment to obtain the evaluation result of each voice segment.
It can be seen that in the embodiment of the application, the speech evaluation model can extract useful features of a speech segment, capture context information and evaluate according to weighted features through combination of layers such as feature extraction, dimension reduction, self-attention and attention pooling. Thus, the accuracy and the robustness of the evaluation can be improved, and an accurate evaluation result is provided for each voice segment.
Based on the first aspect, in a possible implementation manner, the method is applied to a live or on-demand scene.
In a second aspect, an embodiment of the present application provides a method for training a speech evaluation model, including:
acquiring a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments, wherein the labels corresponding to each reference voice fragment in the plurality of reference voice fragments comprise the average value of the evaluation results of a plurality of users on the reference voice fragments;
Training based on the plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments to obtain a voice evaluation model, wherein a loss function in the voice evaluation model comprises the number of users evaluating the reference voice fragments.
It can be seen that in the embodiment of the application, the objectivity, reliability and generalization capability of the speech evaluation model can be improved by training through using a plurality of reference speech fragments and corresponding labels and adding the number of the evaluation users to the loss function. Therefore, the model can be better adapted to the evaluation requirements of different users, and accurate and consistent evaluation results are provided.
Based on the second aspect, in a possible implementation manner, the speech evaluation model is obtained through multiple rounds of training, in each round of training, the loss function is used for solving a root mean square error of a product between a prediction error of a current reference speech segment and an alpha power of the number of users evaluating the current reference speech segment, wherein the prediction error of the current reference speech segment is a difference value between a label corresponding to the current reference speech segment and an evaluation result output by the speech evaluation model on the current reference speech segment, and alpha is an adjustable parameter.
It can be seen that in the embodiment of the application, the performance and accuracy of the speech evaluation model can be gradually optimized through multiple rounds of training and using the loss function considering the prediction error and the number of the evaluation users, so that the speech evaluation model can better adapt to the evaluation requirements of different users, an accurate and consistent evaluation result is provided, and flexible control of the evaluation weight is realized through adjusting the parameter alpha.
In a third aspect, an embodiment of the present application provides an audio quality evaluation apparatus, including:
the acquisition module is used for acquiring the audio to be evaluated;
The determining module is used for classifying the audio to be evaluated and determining a voice segment and a music segment in the audio to be evaluated;
the extraction module is used for extracting a plurality of voice fragments from the audio to be evaluated according to the position information of the voice fragments and the position information of the music fragments;
The quality evaluation module is used for performing quality evaluation on the voice fragments to obtain an evaluation result of each voice fragment in the voice fragments;
the quality evaluation module is further used for obtaining an evaluation result of the audio to be evaluated based on the evaluation result of each voice segment.
Based on the third aspect, in a possible implementation manner, the determining module is configured to:
dividing the audio to be evaluated into a plurality of fragments;
extracting features of each of the plurality of segments;
and determining whether each of the plurality of segments is the voice segment or the music segment according to the characteristics of each segment.
Based on the third aspect, in a possible implementation manner, the determining module is further configured to:
Inputting the characteristics of each segment into a convolutional neural network to obtain the probability that each segment is the voice segment and the probability that each segment is the music segment;
And determining whether each segment is the voice segment or the music segment according to the probability that each segment is the voice segment and the probability that each segment is the music segment.
Based on the third aspect, in a possible implementation manner, the audio to be evaluated includes time information, the position information of the voice segment refers to time position information of the voice segment in the audio to be evaluated, and the position information of the music segment refers to time position information of the music segment in the audio to be evaluated.
Based on the third aspect, in a possible implementation manner, when adjacent segments in the plurality of segments overlap in time position, the voice segment in the plurality of segments overlaps in time position with the music segment;
the extraction module is used for deleting the voice fragments overlapped with the music fragments in time positions.
Based on the third aspect, in a possible implementation manner, the quality evaluation module is configured to input a plurality of voice segments into a voice evaluation model to obtain an evaluation result of each voice segment in the plurality of voice segments; the speech evaluation model comprises a feature extraction layer, a convolution layer, a self-attention network layer and an attention pooling layer, wherein,
The feature extraction layer is used for extracting the features of each voice segment in the voice segments;
The convolution layer is used for carrying out dimension reduction processing on the characteristics of each voice segment to obtain dimension reduction characteristics of each voice segment;
The self-attention network layer is used for carrying out weighting processing on the dimension reduction characteristics of each voice segment based on a self-attention mechanism to obtain the weighting characteristics of each voice segment;
The attention pooling layer is used for evaluating each voice segment according to the weighted characteristics of each voice segment to obtain the evaluation result of each voice segment.
Each functional module in the third aspect is configured to implement the method described in any one of the foregoing first aspects and implementations of the first aspect.
In a fourth aspect, an embodiment of the present application provides a training device for a speech evaluation model, including:
The system comprises an acquisition module, a judgment module and a judgment module, wherein the acquisition module is used for acquiring a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments, and the labels corresponding to each reference voice fragment in the plurality of reference voice fragments comprise the average value of evaluation results of a plurality of users on the reference voice fragments;
The training module is used for training based on the plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments to obtain a voice evaluation model, and the loss function in the voice evaluation model comprises the number of users evaluating the reference voice fragments.
Based on the fourth aspect, in a possible implementation manner, the speech evaluation model is obtained through multiple rounds of training, in each round of training, the loss function is used for solving a root mean square error of a product between a prediction error of a current reference speech segment and an alpha power of the number of users evaluating the current reference speech segment, wherein the prediction error of the current reference speech segment is a difference value between a label corresponding to the current reference speech segment and an evaluation result output by the speech evaluation model on the current reference speech segment, and alpha is an adjustable parameter.
Each functional module in the fourth aspect is configured to implement the method described in any one of the foregoing second aspect and implementation manners of the second aspect.
In a fifth aspect, embodiments of the present application provide a computing device, including a memory for storing instructions and a processor for executing the instructions stored in the memory to implement the method described in the first aspect and any one of the possible implementations of the first aspect, or to implement the method described in the second aspect and any one of the possible implementations of the second aspect.
In a sixth aspect, embodiments of the present application provide a computer storage medium comprising program instructions which, when executed by an apparatus, cause the apparatus to perform the method described in the first aspect and any one of the possible implementations of the first aspect, or cause the apparatus to perform the method described in the second aspect and any one of the possible implementations of the second aspect.
In a seventh aspect, the present application provides a computer program product comprising program instructions for performing the method of any one of the preceding first aspect and any one of the possible implementations of the first aspect, or for performing the method of any one of the preceding second aspect and any one of the possible implementations of the second aspect, when the computer program product is executed by a computing device. The computer program product may be a software installation package which, in case the method provided by any of the possible designs of the first or second aspect described above is required, may be downloaded and executed on a device to implement the method of the first aspect and any of the possible implementations of the first aspect or to implement the method of the second aspect and any of the possible implementations of the second aspect.
Drawings
In order to more clearly describe the embodiments of the present application or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present application or the background art.
FIG. 1 is a schematic diagram of a system architecture according to the present application;
fig. 2 is a flowchart for evaluating audio quality in a live scene according to the present application;
fig. 3 is a flowchart for evaluating audio quality in an on-demand scene provided by the present application;
FIG. 4 is a schematic flow chart of an audio quality evaluation method provided by the application;
FIG. 5 is a schematic flow chart of determining the type of audio to be evaluated provided by the application;
FIG. 6 is a block diagram of a convolutional neural network provided by the present application;
FIG. 7 is a schematic diagram of a speech evaluation model provided by the application;
FIG. 8 is a flowchart of the present application for obtaining an evaluation result of each of a plurality of speech segments;
FIG. 9 is a schematic flow chart of a training method of a speech evaluation model provided by the application;
FIG. 10 is a schematic diagram of yet another system architecture provided by the present application;
Fig. 11 is a schematic structural diagram of a Convolutional Neural Network (CNN) according to an embodiment of the present application;
FIG. 12 is a schematic structural diagram of an apparatus for audio quality assessment provided by the present application;
FIG. 13 is a schematic diagram of a training device for a speech evaluation model according to the present application;
fig. 14 is a schematic structural diagram of a computing device provided by the present application.
Detailed Description
Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It is noted that the terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be noted that the term "comprising" and any variations thereof, as used in this specification and the appended claims, is intended to cover a non-exclusive inclusion. For example, a system, article, or apparatus that comprises a list of elements is not limited to only those elements but may include other elements not expressly listed or inherent to such article or apparatus.
It should also be understood that the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" or "in the case of …" depending on the context.
In a live or on-demand scene, music is typically contained in the audio in the live or on-demand scene. For example, in a certain piece of audio, both music and the voice of the main player are contained, wherein the music is in the form of background music (BGM) which is the voice of the main player, i.e., the music and the voice of the main player overlap in time. For another example, in a certain piece of audio, both music and the host's speech are contained, but the music and the host's speech are non-overlapping in time.
For audio containing music in such live or on-demand scenes, it is often not possible to score the audio.
The application provides a method for scoring and evaluating audio in a live or on-demand scene. Before describing the method embodiments of the present application, a system architecture of the present application will be described.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture provided in the present application, where the system includes at least one terminal 110, at least one network device 120, and at least one server 130.
The terminal 110 may be, for example, a desktop computer, a notebook, a mobile phone terminal, a tablet, a server, etc. A user (anchor) may live through the terminal 110. The terminal 110 may be configured to store audio and video in live broadcast of a user (anchor), and the terminal 110 is further configured to send the audio and video of the user (anchor) to the server 130 through a network device.
The network device 120 is used for communication between the terminal 110 and the server 130 via a communication network of any communication mechanism/communication standard. The communication network may be a wide area network, a local area network, a point-to-point connection, or any combination thereof. For example, the network device 120 is configured to send the audio and video on the terminal 110 to the server 130.
The server 130 may be, for example, a computing device located in a cloud, such as a central server, where the cloud may be a private cloud, a public cloud, or a hybrid cloud. The server 130 is configured to receive the audio and video sent by the terminal 110 and process the audio and video, for example, in the present application, the server 130 is configured to receive the audio sent by the terminal 110 and perform a series of processing on the audio, so as to finally obtain the scoring result of the audio.
The system architecture shown in fig. 1 is merely an example and is not meant to limit the present application.
The audio evaluation system provided by the application can be applied to live broadcast or on-demand scenes. In live scenes, the video content is typically real-time, and the viewer can interact with the anchor during the live broadcast, for example, by means of a bullet screen, comment, etc. Live scenes are common for the transmission of a variety of real-time events, such as sporting events, news stories, game live, online education, and the like. Live scenes are characterized by real-time and interactive properties, and viewers can watch the ongoing content in real time and communicate with other viewers and anchor. Referring to fig. 2, fig. 2 is a flowchart of audio quality evaluation in a live scene provided by the present application. The following describes the flow of audio quality evaluation in live scenes in detail.
(1) Entering live room addresses
A specific tool or interface is used by a person associated with the profiling system, such as an administrator of the live platform, a technician, or an operator of the profiling system, to enter a live room address for extracting audio from the live stream. Wherein the live room address may be a live room link on a live platform or other form.
(2) Live video clip pulling
And carrying out streaming according to the input live broadcasting room address, thereby obtaining the audio. The streaming refers to a process of acquiring real-time audio and video data streams from a live broadcast source. According to the input direct broadcasting room address, the operation of pulling the direct broadcasting stream from the direct broadcasting platform can be realized by establishing connection with the direct broadcasting platform and using a corresponding stream media protocol. In this process, a period of streaming may be set, and audio and video data may be stored at set time intervals, for example, a video of 2 minutes is stored every 2 minutes, to form a video file of a short period. Such segmented preservation facilitates subsequent processing and storage management.
(3) Decoding an audio stream
During the streaming process, the acquired audio data is typically compression encoded. Decoding the audio stream is to restore the audio data from a compressed format to the original audio signal for subsequent audio evaluation.
(4) Input to an audio evaluation system
The decoded audio signal is transmitted to an audio evaluation system, and through analysis and evaluation of the audio evaluation system, evaluation of audio quality can be obtained so as to guide subsequent processing and improvement.
In an on-demand scene, video content is produced in advance and stored on a server, and a viewer can select to watch according to own time and interests. On-demand scenes are common to various video platforms, online courses, movies, television shows, and the like. The on-demand scene is characterized by flexibility and selectivity, and the audience can freely select to watch the content according to own preference and time schedule. Referring to fig. 3, fig. 3 is a flowchart of audio quality evaluation in an on-demand scene provided by the present application. The following describes the flow of audio quality evaluation in live scenes in detail.
(1) Input video on demand index
By inputting the index of the video on demand, the required video can be conveniently and quickly found, and the index can contain the information of the title, description, keywords, classification and the like of the video.
(2) Acquiring video-on-demand
Audio is obtained from the video on demand index. This may be accomplished in a number of ways, including downloading from a server of the audio provider, retrieving from cloud storage, streaming, etc.
(3) Decoding an audio stream
After the audio data is obtained, the corresponding decoder is used for decoding the audio data, and the audio data is restored into the original audio signal. The decoding process is similar to the process of decoding an audio stream in a live scene.
(4) Input to an audio evaluation system
And transmitting the decoded audio signal to an audio evaluation system for audio quality evaluation.
In one implementation manner, a threshold value may be preset, the obtained evaluation result of the audio to be evaluated is compared with the threshold value, and if the score is lower than the threshold value, the audio enhancement processing may be performed at the client, that is, after the audio is decoded, the audio quality is enhanced through some processing and then played to the user.
The audio quality evaluation method provided by the application is described below. Referring to fig. 4, fig. 4 is a schematic flow chart of an audio quality evaluation method according to the present application, which includes, but is not limited to, the following description.
S101, acquiring audio to be evaluated.
The audio includes a plurality of speech segments, a plurality of music segments, the plurality of speech segments being discontinuous in time, the plurality of music segments being discontinuous in time.
In one implementation, in a live scene, capturing audio data typically involves a pull stream technique. The streaming refers to obtaining data from a remote audio and video source server through network connection so as to play, process or store locally. By entering the live room address, the system can acquire audio data from the live source server using a pull streaming protocol. In this process, a period of streaming may be set, and audio and video data may be stored at set time intervals, for example, a video of 2 minutes is stored every 2 minutes, to form a video file of a short period. Such segmented preservation facilitates subsequent processing and storage management.
In another implementation, in an on-demand scenario, acquiring audio and video data may involve acquiring an audio and video file from a cloud on-demand service. The video-on-demand index provided by the user may be an identification pointing to a particular video file in the cloud storage. The system can call the on-demand service interface through the index to acquire corresponding audio data. In this process, on-demand pulling can also be implemented, i.e. only specific segments required by the user are acquired, without having to download the entire video file.
Whether live or on demand, the acquisition of audio data requires consideration of the decoding and processing of the audio stream. In a live scene, the live video can be decoded in real time and saved as a local file. In the on-demand scene, the required audio data can be analyzed and acquired as required through a streaming media analysis technology.
S102, classifying the audio to be evaluated, and determining the voice fragments and the music fragments in the audio to be evaluated.
And classifying the audio to be evaluated, and determining whether the audio to be evaluated is a voice segment or a music segment. Referring to fig. 5, fig. 5 is a schematic flow chart of determining the type of audio to be evaluated according to the present application.
S1021, dividing the audio into a plurality of fragments with a first fixed duration.
The audio is divided into a plurality of segments of a first fixed duration, for example, each segment being 8 seconds in length. By cutting the audio into a plurality of segments of a first fixed duration, more refined processing and analysis of each segment is enabled. The length of the segment can be adjusted according to specific requirements, and the application is not particularly limited.
Alternatively, when dividing the video into a plurality of segments of a first fixed duration, an overlapping region of a certain length of time may be reserved between each adjacent two segments, for example, an overlapping region of 2 seconds may be reserved between each adjacent two segments. The existence of the overlapping area can ensure smooth transition between adjacent segments and avoid abrupt jump feeling in the cutting process. The time length of the overlapping area can also be set according to actual requirements so as to meet the requirements of different application scenes, and the application is not particularly limited.
S1022, dividing each of the plurality of segments of the first fixed duration into a plurality of frames of the second fixed duration.
Dividing each of the plurality of segments of the first fixed duration into a plurality of frames of the second fixed duration, for example, 25 milliseconds per frame duration, can enable more accurate capture of transient characteristics of the audio signal. Here, the present application is not particularly limited as to the duration of each frame.
Alternatively, an overlapping area of a certain time length may be set for each two adjacent frames to ensure a smooth transition between frames, for example, the time length of the overlapping area may be set to 15 ms, or may be set according to actual requirements, and the present application is not limited specifically.
S1023, calculating the Mel spectrum of each frame with the second fixed duration.
Mel-spectrum is a widely used representation in sound signal processing for describing the energy distribution of an audio signal at different frequencies. It better captures the characteristics of the audio signal by simulating the characteristics of the human ear auditory system.
Calculating the mel-spectrum for each frame of the second fixed duration may convert the audio signal of each frame from the time domain to the frequency domain for better analysis of its spectral characteristics. Mel-spectrum emphasizes important frequency regions of the audio signal in human ear perception, helps to reduce dimensionality, reduces computational effort, and improves feature distinguishability.
In one implementation, first, each frame with a second fixed duration is subjected to short-time fourier transform to convert a signal from a time domain to a frequency domain, a set of mel filters are designed, center frequencies of the filters are uniformly distributed on mel scales, and the filters simulate the perception characteristics of human ears, so that important frequency areas of an audio signal can be better captured; then, calculating the response of each frame on each filter, and carrying out convolution operation on each frequency spectrum component and the corresponding filter in the process; then, calculating the energy in each filter band, and obtaining a result more conforming to the perception of human ears by adopting square sum or logarithmic operation; finally, the energy in each filter band is combined into a mel spectrum, thereby completing the computation of the mel spectrum for each frame of a second fixed duration.
Mel-spectra are typically a matrix in which each column represents an audio frame and each row represents a different mel-frequency band. Each element represents energy or amplitude within a corresponding frequency band. Thus, the mel-spectrum can be seen as a matrix representing the characteristics of the audio signal at different times and frequencies. The dimension of the mel-spectrum is generally dependent on the number of filters, with higher dimensions capturing spectral information more comprehensively, but also increasing the computational effort. For example, the dimension of the mel-spectrum may be set to 64, which may cover enough frequency bands to capture important information in the audio, without being computationally expensive. Regarding the dimension of the mel spectrum, the present application is not particularly limited.
S1024, inputting the Mel spectrums of the fragments with the first fixed time length into a convolutional neural network to obtain the prediction results of the fragments with the first fixed time length.
The mel spectrums of the fragments with the first fixed time length are input into a convolutional neural network, the convolutional neural network can better understand the content of an audio signal by learning the mel spectrum characteristics, and the prediction results of the fragments with the first fixed time length are output, wherein the prediction results can be probability values of existence of voice and music. Comparing the voice probability value with a preset first threshold value, and if the voice probability value is larger than the first threshold value, indicating that the voice exists in the current segment; and comparing the music probability value with a preset first threshold value, and if the music probability value is larger than the first threshold value, indicating that music exists in the current segment.
In one implementation, mel spectra of a plurality of segments of a first fixed duration are input into a convolutional neural network, the neural network comprising a two-dimensional convolutional network layer, a batch normalization layer, a linear rectification function layer, a fully connected layer, and the like. Referring to fig. 6, fig. 6 is a block diagram of a convolutional neural network provided by the present application. The two-dimensional convolution network layer is used for learning the input Mel spectrum characteristics; the batch normalization layer is used for accelerating the neural network training process, and normalizing the input of each batch, namely normalizing the mean value and the variance of the input, so that the gradient vanishing problem can be relieved, the convergence can be accelerated, and the stability of the network can be improved; the linear rectification function layer is used for setting all negative input values to zero and keeping positive values unchanged. This facilitates nonlinear modeling of the network and sparse activation of features, helping the network learn complex features. Wherein the depth of the model can be increased and more level features extracted by repeatedly applying a combination of two-dimensional convolution network, batch normalization layer and linear rectification function. Specifically, this combination is applied 33 times in a loop, each loop comprising a two-dimensional convolution layer, a batch normalization layer and a linear rectification function. In this cyclical manner, the model may progressively build more complex feature representations to better capture the structure and pattern of the input data. The full connection layer is located at the last layer of the convolutional neural network and is used for mapping the extracted features of the previous layer to the final output. The convolutional neural network may have an output dimension of 20 x 2, the first dimension 20 corresponding to the predicted results of different time segments in a fixed time length segment, and the second dimension 2 corresponding to the probability of existence of both speech and music. The first threshold is set to be 0.5, the voice probability value is compared with 0.5, and if the voice probability value is larger than 0.5, the current segment is indicated to have voice; the music probability value is compared with 0.5, and if the music probability value is larger than 0.5, the current segment has music.
S103, extracting a plurality of voice fragments from the audio to be evaluated according to the position information of the voice fragments and the position information of the music fragments.
The audio to be evaluated comprises time information, the position information of the voice segment refers to the time position information of the voice segment in the audio to be evaluated, and the position information of the music segment refers to the time position information of the music segment in the audio to be evaluated.
In one implementation, adjacent ones of the plurality of segments overlap in time position, and the speech segment of the plurality of segments overlaps in time position with the music segment. Thus, a speech segment overlapping with the music segment in time position is deleted.
In one implementation, the obtained prediction results of the plurality of segments with the first fixed duration are corresponding to a time axis, the adjacent segments of speech are fused together to obtain a plurality of speech segments, the adjacent segments of music are fused together to obtain a plurality of music segments, and each segment has a start time and an end time. Comparing the time length of the voice segment with a second threshold value, eliminating the voice segment under the condition that the time length of the voice segment is smaller than the second threshold value, comparing the time length of the music segment with a third threshold value, and eliminating the music segment under the condition that the time length of the music segment is smaller than the third threshold value. And processing the voice fragments according to the time information, and eliminating the voice fragments overlapped with the music in time to finally obtain a plurality of fragments with voice activity only.
For example, the second threshold is set to 0.8 seconds, the time lengths of several speech segments are sequentially compared with the second threshold, if the time length of the speech segment is less than 0.8 seconds, the speech segment is rejected, which helps to filter out very short speech segments to preserve more meaningful speech content; the third threshold is set to 0.3 seconds, and the time lengths of several pieces of music are compared with the third threshold in turn, and if the time length of the piece of music is less than 0.3 seconds, the piece of music is removed, which helps to filter out very short pieces of music to preserve longer music content.
And splicing the plurality of fragments with voice activity together to obtain voice audio with the length of M.
And continuously dividing the voice audio with the length of M seconds, extracting a plurality of voice fragments with the first fixed duration, and performing zero padding operation if fragments which are less than the first fixed duration but greater than a fourth threshold exist in the dividing process. Wherein M is any positive number.
In one implementation, a plurality of segments with voice activity only are spliced together to obtain voice audio with the length of 67 seconds, the voice audio with the length of 67 seconds is continuously segmented, 7 voice segments with the length of 10 seconds are extracted, wherein the last voice segment is less than 10 seconds, and zero can be added at the beginning of the voice segment to reach the time length of 10 seconds. This can be achieved by inserting a sample point of zero value in front of the speech segment. Or zero padding at the end of the speech segment to a time length of 10 seconds. This can be achieved by appending a sample point of zero value at the end of the speech segment. In another implementation, extracting characteristics of each of the plurality of second fixed-duration segments, inputting the characteristics of each of the first fixed-duration segments into a convolutional neural network, and determining whether each of the first fixed-duration segments is a speech segment or a music segment; and splicing and dividing the voice fragments in the fragments with the first fixed duration to obtain a plurality of voice fragments with the first fixed duration.
S104, performing quality evaluation on the voice fragments to obtain an evaluation result of each voice fragment in the voice fragments.
Inputting a plurality of voice fragments with first fixed duration into a trained voice evaluation model, wherein the voice evaluation model is mainly divided into four layers, and the voice evaluation model is respectively: a feature extraction layer, a convolution layer, a self-attention network layer and an attention pooling layer. Referring to fig. 7, fig. 7 is a schematic structural diagram of a speech evaluation model provided by the present application, in which each of a plurality of speech segments is input to a feature extraction layer, and features of each of the plurality of speech segments are extracted; inputting the features of the feature extraction layer into the convolution layer, and performing dimension reduction processing on the features of each voice segment to obtain dimension reduction features of each voice segment; inputting the feature to be the feature of the convolution layer into a self-attention network layer, and carrying out weighting processing on the dimension reduction features of each voice segment based on a self-attention mechanism to obtain the weighting feature of each voice segment; and the weighted characteristics of the attention network layer are input into the attention pooling layer, each voice segment is evaluated according to the weighted characteristics of each voice segment, and an evaluation result of each voice segment is obtained. The evaluation result of each voice segment in the plurality of voice segments can be obtained through the voice evaluation model. For ease of understanding, the implementation will be described below. Referring to fig. 8, fig. 8 is a schematic flow chart of the present application for obtaining an evaluation result of each of a plurality of speech segments.
S1041, calculating a Mel spectrum of each of the first fixed-duration voice fragments.
The method for calculating the mel-spectrum feature of each of the plurality of first fixed-duration speech segments is similar to S1023, and for brevity of application, details are not repeated here.
S1042, inputting the Mel spectrum of each voice segment with the first fixed duration into the convolution layer.
And inputting the mel spectrum of each voice segment with the first fixed duration into the convolution layer, and further extracting the local characteristics of the voice segments. Wherein the convolution layer may be applied over multiple iterations to increase the depth of the model. For example, the convolutional layers may be chosen to be stacked and applied 6 times repeatedly to increase the depth and complexity of the model. The output of each convolution layer will be the input to the next convolution layer. The output of the convolution layer is a series of local feature maps that capture the frequency domain information of the audio segment. These features undergo a unidimensional operation to change the features of each speech segment into a one-dimensional vector form for further processing.
S1043, inputting the feature output by the convolution layer into the self-attention network layer.
The self-attention network layer is used for carrying out weighting processing on the dimension reduction characteristics of each voice segment based on a self-attention mechanism to obtain the weighting characteristics of each voice segment. The self-attention network takes the one-dimensional feature vector of the convolution layer as input, and weights different parts in the input data through a self-attention mechanism, so that the interaction information between the global voice fragments is focused, and the feature expression capability is improved.
S1044, inputting the output characteristics of the self-attention network layer into the attention pooling layer to obtain the evaluation result of each voice segment in the plurality of voice segments.
The attention pooling layer is used for evaluating each voice segment according to the weighted characteristics of each voice segment to obtain the evaluation result of each voice segment. The attention pooling layer assigns a different weight to each element in the sequence in order to better capture the important elements. And inputting the output characteristics of the self-attention network layer into an attention pooling layer, and obtaining the evaluation results of the N voice fragments through weighted aggregation. The attention pooling layer can weight the characteristics according to the importance of each voice segment, focus on the characteristics related to tone quality evaluation, and improve the accuracy and the robustness of the model. And finally, outputting the evaluation result of each voice segment through the attention pooling layer operation.
S105, based on the evaluation result of each voice segment, an evaluation result of the audio to be evaluated is obtained.
Inputting the voice fragments with the first fixed time length into a trained voice evaluation model to obtain evaluation results of the voice fragments with the first fixed time length in the voice fragments with the first fixed time length, and carrying out weighted summation on each evaluation result according to the time length to obtain an evaluation result of the audio to be evaluated. The calculation formula is as follows:
Wherein, MOS overall represents the overall evaluation result of the whole audio, MOS i represents the evaluation result of each voice segment with a first fixed duration, T i represents the duration of each voice segment, i is the number of voice segments of the whole audio, and N is any positive integer.
In one implementation manner, a threshold may be preset, the obtained evaluation of the audio to be evaluated is compared with the threshold, and if the score is lower than the threshold, the audio enhancement processing may be performed at the client, that is, after the audio is decoded, the audio quality is enhanced through some processing and then played to the user.
It can be seen that the method for evaluating audio quality provided by the application is applied to live broadcast and on-demand scenes. In the case that the audio frequently contains two different types of sounds, namely voice and music in live broadcast and on demand scenes, the audio quality evaluation provided by the application firstly separates the voice and the music from the audio, and a plurality of voice fragments are extracted. And then, inputting the extracted voice fragments into a model and evaluating each voice fragment, wherein a convolutional neural network is added into the model, a loss function based on the information of the number of votes is introduced into the convolutional neural network, and the influence of noise existing in a data set is reduced by utilizing the information of the number of votes, so that the accuracy of the model is effectively improved. And finally, carrying out weighted summation on the obtained evaluation results of the plurality of voice fragments to output the overall evaluation result of the final whole audio. Therefore, by implementing the embodiment, the limitation of the prior art can be overcome, and the accuracy of audio quality evaluation in live broadcast and on-demand scenes can be improved.
In the above method embodiment, the speech quality evaluation model is used for speech quality evaluation, and how the speech quality evaluation model is obtained is described below.
Referring to fig. 9, fig. 9 is a schematic flow chart of a method for training a speech evaluation model according to the present application, where the method includes, but is not limited to, the following description.
S201, a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments are obtained.
And acquiring a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments. In one implementation, in the speech evaluation model, the trained data set may obtain a plurality of reference speech segments from the open source data set, where a label corresponding to each of the plurality of reference speech segments includes a mean of evaluation results of the plurality of users on the reference speech segments.
For example, the plurality of reference voice segments include a reference voice segment 1, a reference voice segment 2, and a reference voice segment 3, n users evaluate the reference voice segment 1, the reference voice segment 2, and the reference voice segment 3, respectively, calculate an average evaluation score of the reference voice segment 1, that is, obtain a label of the reference voice segment 1, calculate an average evaluation score of the reference voice segment 2, that is, obtain a label of the reference voice segment 2, and similarly, the label of the reference voice segment 3 is obtained according to this method, which is not described herein again.
S202, training is carried out based on a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments, and a voice evaluation model is obtained.
The voice evaluation model comprises a feature extraction layer, a convolution layer, a self-attention network layer and an attention pooling layer, wherein the feature extraction layer is used for extracting the features of each reference voice segment in a plurality of reference voice segments; the convolution layer is used for carrying out dimension reduction processing on the characteristics of each reference voice segment to obtain dimension reduction characteristics of each reference voice segment; the self-attention network layer is used for carrying out weighting processing on the dimension reduction characteristics of each reference voice segment based on a self-attention mechanism to obtain the weighting characteristics of each reference voice segment; the attention pooling layer is used for evaluating each reference voice segment according to the weighted characteristic of each reference voice segment to obtain the evaluation result of each reference voice segment.
Different users can easily score different scores for the same audio, and as the scored users become more, the average value tends to be stable, so that the score of each audio can be considered to be noisy. In order to reduce the influence of noise on the learning of the speech evaluation model, a loss function based on the number of votes is provided to improve the accuracy of the speech evaluation model. And adding the number of votes and adjustable parameters corresponding to each label into a loss function of the voice evaluation model. The loss function is as follows:
where Loss is a Loss function, RMSE represents root mean square error, y represents the label corresponding to each reference speech segment, And (3) representing the evaluation result output by each iteration training of the voice evaluation model, wherein votes represents the number of users evaluating the reference voice fragment, alpha represents an adjustable parameter, and the value range is (0-1/2).
In particular, when α is 1/2, subjective opinion of each person is considered to be equally reflected in Loss. The reason is that root mean square error (root mean square error, RMSE) is an indicator of the difference between the measured predicted and actual values, which is commonly used to evaluate the performance of regression models. The calculation steps comprise: (1) calculating the prediction error of each sample, i.e. the prediction value of the model minus the actual value (2) calculating the average error, i.e. squaring the prediction error of each sample (3) calculating the average square error, i.e. averaging the square errors of all samples (4) calculating the root mean square error RMSE, i.e. taking the square root of the average square error. Because of the square removing operation in the root mean square error function, when the alpha is 1/2, the square sum of 1/2 is exactly counteracted to be 1, which is equivalent to equally reflecting the subjective opinion of each person in the Loss.
And calculating a predicted value of the voice evaluation model through forward propagation, and comparing the predicted value with the label to obtain a value of the loss function. Then, the gradient of the loss function to the parameters of the speech evaluation model is calculated through a back propagation algorithm, and the parameters of the speech evaluation model are updated through an optimization algorithm (such as a gradient descent method) according to gradient information, so that the value of the loss function is gradually reduced. Through iteration of the process, parameters of the speech evaluation model are continuously adjusted, so that the speech evaluation model can be better fitted with training data. During the training process, performance indexes of the speech evaluation model on the training set and the verification set, such as accuracy, loss value and the like, can be monitored. According to the monitoring result, the voice evaluation model can be optimized, such as adjusting the learning rate, increasing regularization items and the like, so as to improve the performance and generalization capability of the voice evaluation model. When the training stopping condition is met, such as the maximum iteration number is reached or the loss function converges, the speech evaluation model training is terminated.
It can be seen that the application provides a training method of a speech evaluation model, wherein the number of votes and adjustable parameters corresponding to each label are introduced into a loss function of the speech evaluation model. Thus, the model adjusts the learning degree of different audios according to the weight of voters in the training process. When the score of an audio comes from more voters, the weight will be higher and the model will pay more attention to the learning of the audio. Conversely, when an audio score comes from a smaller number of votes, its weight will be lower and the model will learn less about the audio. The weight of the voter is introduced, so that the learning degree of different audios can be balanced better by the loss function, and the influence of individual difference and noise on the model is reduced. Therefore, the model can learn the quality characteristics of the audio more accurately, and the accuracy and the robustness of the voice evaluation model are improved.
As shown in FIG. 10, an embodiment of the present application provides a schematic diagram of yet another system architecture 400. Referring to fig. 10, the data acquisition device 460 is used to acquire data, where the data acquisition device 460 may include a camera and microphone, etc., for example, for live scenes, the data acquisition device 460 may be a professional live camera and microphone for acquiring audio and video content of a host.
After the data is collected, the data processing device 470 processes the collected data to obtain training data, such that the final training data is a fixed-length speech segment that contains only speech activity. The training data in the embodiment of the application comprises a voice segment with fixed length and a label, wherein the label comprises a scoring result obtained by listening to a piece of audio by a plurality of listeners and scoring and averaging the tone quality of the audio.
After the training data is processed, the data processing device 470 stores the training data in the database 430, and the training device 420 trains the recognition model 413 based on the training data maintained in the database 430.
The training device 420 will be described below, the recognition model 413 is obtained based on training data, the input data of the training device 420 has a voice segment with a fixed length and a label, the training device 420 processes the input voice segment with a fixed length, the output scoring result is compared with subjective scores of the label, which are obtained by listening to a piece of audio by a plurality of listeners, and the voice quality of the label is scored and averaged, until the difference between the scoring result output by the training device 420 and the scoring result in the label is smaller than a preset threshold, and the output scoring result is considered to be capable of replacing the scoring result in the label, so that the training of the recognition model 413 is completed.
The above recognition model 413 can be used to implement the method for evaluating audio quality provided by the embodiment of the present application, and the audio data to be processed is input into the recognition model 413, so as to obtain the scoring result of each audio data. In practical applications, the training data maintained in the database 430 is not necessarily all from the data processing device 470, but may be obtained from other devices. It should be noted that, the training device 420 is not necessarily completely based on the training data maintained by the database 430 to perform training of the identification model 413, and it is also possible to obtain the training data from other devices to perform model training, which should not be taken as limiting the embodiment of the present application. The training device 420 may exist independently of the execution device 410 or may be integrated within the execution device 410.
The recognition model 413 obtained by training according to the training device 420 may be applied to different systems or devices, for example, the execution device 410 shown in fig. 10 may be a cloud server, a virtual machine, or the like, in fig. 10, the execution device 410 configures an input/output (I/O) interface 412 for data interaction with an external device, a user may input data to the I/O interface 412 through the user device 440, and the I/O interface 412 may also output a scoring result to the user device 440. The input data may include speech segments in embodiments of the present application and the user device 440 may include various terminal devices for interacting with the execution device 410. For example, the user device 440 may be a smart phone through which the user may record and send audio clips to the execution device 410 for quality assessment. In addition, the user device 440 may also be a personal computer, and the user may interact with the execution device 410 using a browser interface for sound quality assessment. And the voice quality evaluation system can also be an intelligent voice box, and a user can trigger voice quality evaluation through a voice command.
In an embodiment of the present application, the computing module 411 is configured to process the input/output data, for example, perform weighted summation on scoring results of the plurality of speech segments output by the recognition model 413, so as to obtain an overall scoring result of the audio.
In the processing of the input data by the execution device 410, or in the processing related to the execution of the computation by the computation module 411 of the execution device 410, the execution device 410 may call the data, the code, etc. in the data storage system 450 for the corresponding processing, or may store the data, the instruction, etc. obtained by the corresponding processing in the data storage system 450.
It should be noted that the training device 420 may generate, based on different training data, a corresponding recognition model 413 for different targets or tasks, where the corresponding recognition model 413 may be used to achieve the targets or to perform the tasks, thereby providing the user with the desired results.
The recognition model described in the embodiments of the present application is configured based on a convolutional neural network (convolutional neural networks, CNN), which is described below. In the application, the convolutional neural network model can be used for determining whether each fixed-duration segment is a voice segment or a music segment from the audio and can also be used for scoring each fixed-duration voice segment.
The convolutional neural network is a deep neural network with a convolutional structure, and can be a deep learning (DEEP LEARNING) architecture, wherein the deep learning architecture refers to learning of multiple layers at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.
The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or convolution feature plane (feature map) using a trainable filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. So we can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation. The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a Convolutional Neural Network (CNN) 500 according to an embodiment of the present application. As shown in fig. 11, convolutional Neural Network (CNN) 500 may include an input layer 510, a convolutional layer/pooling layer 520, and a neural network layer 530.
Input layer 510 may process multidimensional data, e.g., the input layer may acquire and process sample data; typically, the input layer of a one-dimensional convolutional neural network receives a one-or two-dimensional array, where the one-dimensional array is typically a time or frequency spectrum sample; the two-dimensional array may include a plurality of channels; the input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array.
Since gradient descent is used for learning, the input features of the convolutional neural network can be normalized. Specifically, before the learning data is input into the convolutional neural network, the input data needs to be normalized in a channel or time/frequency dimension. The standardization of the input features is beneficial to improving the operation efficiency and learning performance of the algorithm.
The convolutional/pooling layers 520 may include, as in examples 521-526, in one implementation, 521 being the convolutional layer, 522 being the pooling layer, 523 being the convolutional layer, 524 being the pooling layer, 525 being the convolutional layer, 526 being the pooling layer; in another implementation, 521, 522 are convolutional layers, 523 is a pooling layer, 524, 525 are convolutional layers, and 526 is a pooling layer. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.
Taking the example of the convolution layer 521, the convolution layer 521 may include a plurality of convolution operators, which are also called convolution kernels, and function in image processing as a filter that extracts specific information from an input image matrix, where the convolution operators may be a weight matrix, which is generally predefined, and where the weight matrix is generally processed on the input image in a horizontal direction (or two pixels followed by two pixels … … depending on the value of the step size stride) to perform the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same dimension. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrices can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, the other weight matrix is used for extracting specific colors of the image, the other weight matrix is used for blurring … … unnecessary noise points in the image, the dimensions of the weight matrices are identical, the dimensions of feature images extracted by the weight matrices with the identical dimensions are identical, and the extracted feature images with the identical dimensions are combined to form the output of convolution operation.
The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can extract information from the input image, so that the convolutional neural network 500 is helped to perform correct prediction. The application trains the voice fragments with fixed length and the labels, wherein the labels comprise scoring results obtained by listening to a section of audio by a plurality of listeners and scoring and averaging the tone quality of the audio, so that the convolutional neural network model outputs the scoring results of each voice fragment.
It should be noted that the above 521-526 layers are merely examples, and that in practice more convolution layers and/or more pooling layers may be provided. When convolutional neural network 500 has multiple convolutional layers, the initial convolutional layer (e.g., 521) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 500 increases, features extracted by the later convolutional layers (e.g., 526) become more complex, such as features of high level semantics. The embodiment of the application utilizes the characteristics of different scales to assist in solving the related technical problems.
Since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, i.e., layers 521-526 as illustrated at 520 in FIG. 11, which may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the pooling layer may be used to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The averaging pooling operator may calculate pixel values in the image over a particular range to produce an average value. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
Generally speaking, the convolution kernels in the convolution layer contain weight coefficients (weight matrix) and the pooled layer does not contain weight coefficients, so in some scenarios the pooled layer may also not be considered a separate layer.
After processing by the convolutional/pooling layer 520, the convolutional neural network 500 is not yet sufficient to output the desired output information. Because, as previously described, the convolution/pooling layer 520 will only extract features and reduce the parameters imposed by the input image. However, to generate the final output information, convolutional neural network 500 needs to utilize neural network layer 530 to generate the output of one or a set of the required number of classes. Thus, multiple hidden layers (531, 532, and 53n as shown in fig. 11) and an output layer 540 may be included in the neural network layer 530, where parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type.
Implicit layers in convolutional neural networks include, for example, fully-connected (FC) layers, which typically only pass signals to other fully-connected layers. The feature map loses 3-dimensional structure in the fully connected layer, is expanded into vectors and passes through the excitation functions to the next layer. In some possible convolutional neural networks, the function of the fully connected layer may be partially replaced by global averaging (global average pooling) that averages all the values of each channel of the feature map.
After the underlying layers in the neural network layer 530, i.e., the final layer of the overall convolutional neural network 500 is the output layer 540, the output layer 540 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 500 (e.g., propagation from 510 to 540 in fig. 11) is completed (e.g., propagation from 540 to 510 in fig. 11) and the backward propagation (e.g., propagation from 540 to 510 in fig. 11) will begin to update the weights and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 500 and the error between the result output by the convolutional neural network 500 through the output layer and the desired result.
The output layer 540 may output the class labels using a logic function or a normalized exponential function (softmax function). For example, in the present application, features of each voice segment are identified to obtain scoring results for each voice segment, so the output layer may be designed to output the scoring results for each voice segment.
It should be noted that, as the convolutional neural network 500 shown in fig. 11 is merely an example of a convolutional neural network, in a specific application, the convolutional neural network may also exist in the form of other network models, for example, multiple convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the neural network layer 530 for processing.
The present application provides an apparatus for audio quality evaluation, referring to fig. 12, fig. 12 is a schematic structural diagram of an apparatus 600 for audio quality evaluation provided by the present application, where the apparatus 600 includes:
an obtaining module 610, configured to obtain audio to be evaluated;
the determining module 620 is configured to perform classification processing on the audio to be evaluated, and determine a speech segment and a music segment in the audio to be evaluated;
The extracting module 630 is configured to extract a plurality of speech segments from the audio to be evaluated according to the position information of the speech segments and the position information of the music segments;
the quality evaluation module 640 is configured to perform quality evaluation on the plurality of voice segments, and obtain an evaluation result of each voice segment in the plurality of voice segments;
The quality evaluation module 640 is further configured to obtain an evaluation result of the audio to be evaluated based on the evaluation result of each speech segment.
In a possible implementation, the determining module 620 is configured to:
dividing the audio to be evaluated into a plurality of fragments;
extracting characteristics of each of the plurality of segments;
Based on the characteristics of each segment, it is determined whether each segment of the plurality of segments is a speech segment or a music segment.
In a possible implementation, the determining module 620 is further configured to:
Inputting the characteristics of each segment into a convolutional neural network to obtain the probability that each segment is a voice segment and the probability that each segment is a music segment;
Whether each of the segments is a speech segment or a music segment is determined based on the probability that each of the segments is a speech segment and the probability that each of the segments is a music segment.
In a possible implementation manner, the audio to be evaluated includes time information, the position information of the voice segment refers to time position information of the voice segment in the audio to be evaluated, and the position information of the music segment refers to time position information of the music segment in the audio to be evaluated.
In a possible implementation manner, if adjacent segments in the plurality of segments overlap in time position, then the voice segment in the plurality of segments overlaps in time position with the music segment; the extraction module 630 is used for deleting speech segments overlapping with the music segments in time position.
In a possible implementation manner, the quality evaluation module 640 is configured to input a plurality of voice segments into a voice evaluation model, and obtain an evaluation result of each voice segment in the plurality of voice segments; the speech evaluation model comprises a feature extraction layer, a convolution layer, a self-attention network layer and an attention pooling layer, wherein,
The feature extraction layer is used for extracting the features of each voice segment in the plurality of voice segments;
the convolution layer is used for carrying out dimension reduction processing on the characteristics of each voice segment to obtain dimension reduction characteristics of each voice segment;
The self-attention network layer is used for carrying out weighting processing on the dimension reduction characteristics of each voice segment based on a self-attention mechanism to obtain the weighting characteristics of each voice segment;
The attention pooling layer is used for evaluating each voice segment according to the weighted characteristics of each voice segment to obtain the evaluation result of each voice segment.
The steps of the method embodiment of fig. 4 to 8 may be specifically referred to the descriptions of the relevant contents of the method embodiment of fig. 4 to 8, and are not repeated herein for brevity of description.
The application provides a training device for a speech evaluation model, referring to fig. 13, fig. 13 is a schematic structural diagram of a training device 700 for a speech evaluation model provided by the application, where the device 700 includes:
The obtaining module 710 is configured to obtain a plurality of reference voice segments and labels corresponding to the plurality of reference voice segments, where the labels corresponding to each reference voice segment in the plurality of reference voice segments include an average value of evaluation results of the plurality of users on the reference voice segments;
the training module 720 is configured to perform training based on the plurality of reference speech segments and labels corresponding to the plurality of reference speech segments, and obtain a speech evaluation model, where a loss function in the speech evaluation model includes the number of users evaluating the reference speech segments.
In a possible implementation manner, the speech evaluation model is obtained through multiple rounds of training, and in each round of training, the loss function is used for solving root mean square error of a product between a prediction error of the current reference speech segment and alpha power of the number of users evaluating the current reference speech segment, wherein the prediction error of the current reference speech segment is a difference value between a label corresponding to the current reference speech segment and an evaluation result output by the speech evaluation model to the current reference speech segment, and alpha is an adjustable parameter.
The steps of the method embodiment of fig. 9 may be specifically referred to the description of the related content in the method embodiment of fig. 9, and the description is omitted herein for brevity.
The present application also provides a computing device, referring to fig. 14, fig. 14 is a schematic structural diagram of a computing device 800 provided by the present application, where the computing device is configured as an apparatus 600, and is configured to implement the method embodiments described in fig. 4 to fig. 8; when configured as apparatus 700, the computing device is configured to implement the method embodiment described in fig. 9. The computing device 800 includes: processor 810, communication interface 820, and memory 830. The processor 810, the communication interface 820, and the memory 830 may be connected to each other through an internal bus 840, or may communicate through other means such as wireless transmission.
By way of example, the bus 840 may be a peripheral component interconnect (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus 840 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 14, but not only one bus or one type of bus.
The processor 810 may be comprised of at least one general purpose processor, such as a CPU, or a combination of a CPU and a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, general-purpose array logic (GENERIC ARRAY logic, GAL), or any combination thereof. Processor 810 executes various types of digitally stored instructions, such as software or firmware programs stored in memory 830, which enable computing device 800 to provide a wide variety of services.
The memory 830 is configured to store program codes and is controlled by the processor 810 to perform the steps described in the embodiments of fig. 4 to 8, and specific reference may be made to the description of the embodiments described above, which is not repeated herein.
Memory 830 may include volatile memory such as RAM; memory 830 may also include non-volatile memory such as ROM, flash memory (flash memory); memory 830 may also include combinations of the above.
Communication interface 820 may be a wired interface (e.g., an ethernet interface), may be an internal interface (e.g., a high-speed serial computer expansion bus (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIE) bus interface), a wired interface (e.g., an ethernet interface), or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other devices or modules.
The processor 810, the communication interface 820, etc. in the computing device 800 may implement the functions and/or the steps and methods implemented in the above-described method embodiments, and are not described herein for brevity. When the computing device is configured in the apparatus 600, the acquisition module 610, the determination module 620, the extraction module 630, and the quality evaluation module 640 in the apparatus 600 may be located in the processor 810 in the computing device 800. When the computing device is configured in the apparatus 700, the acquisition module 710, the training module 720 in the apparatus 700 may be located in the processor 810 in the computing device 800.
It should be noted that fig. 14 is only one possible implementation of the embodiment of the present application, and the apparatus may further include more or fewer components in practical applications, which is not limited herein. For details not shown or described in the embodiments of the present application, reference may be made to the related descriptions in the embodiments of the foregoing method, which are not described herein.
The present application also provides a readable storage medium comprising program instructions which, when executed by a device, perform some or all of the steps described in the above embodiments of the audio quality assessment method.
The present application also provides a computer program product comprising program instructions which, when executed by a device, cause the device to perform some or all of the steps described in the above embodiments of the audio quality assessment method.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
In the above embodiments, it may be implemented in whole or in part by software, hardware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product may contain code. When the computer program product is read and executed by a computer, some or all of the steps of the method described in the above method embodiments may be implemented. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium, or a semiconductor medium, etc.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined or deleted according to actual needs; the units in the device of the embodiment of the application can be divided, combined or deleted according to actual needs.
The foregoing has outlined rather broadly the more detailed description of embodiments of the application, wherein the principles and embodiments of the application are explained in detail using specific examples, the above examples being provided solely to facilitate the understanding of the method and core concepts of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (13)

1. A method of audio quality assessment, the method comprising:
Acquiring audio to be evaluated;
Classifying the audio to be evaluated, and determining a voice segment and a music segment in the audio to be evaluated;
extracting a plurality of voice fragments from the audio to be evaluated according to the position information of the voice fragments and the position information of the music fragments;
Performing quality evaluation on the voice fragments to obtain an evaluation result of each voice fragment in the voice fragments;
Based on the evaluation result of each voice segment, obtaining an evaluation result of the audio to be evaluated;
the step of performing quality evaluation on the plurality of voice fragments to obtain an evaluation result of each voice fragment in the plurality of voice fragments comprises the following steps:
Inputting a plurality of voice fragments into a voice evaluation model to obtain an evaluation result of each voice fragment in the voice fragments; the voice evaluation model is obtained by training based on a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments, the labels corresponding to each reference voice fragment in the plurality of reference voice fragments comprise the average value of evaluation results of a plurality of users on the reference voice fragments, and a loss function in the voice evaluation model comprises the number of users evaluating the reference voice fragments.
2. The method of claim 1, wherein classifying the audio to be evaluated to determine a speech segment and a music segment in the audio to be evaluated comprises:
dividing the audio to be evaluated into a plurality of fragments;
extracting features of each of the plurality of segments;
and determining whether each of the plurality of segments is the voice segment or the music segment according to the characteristics of each segment.
3. The method of claim 2, wherein determining whether each of the plurality of segments is the speech segment or the music segment based on the characteristics of each segment comprises:
Inputting the characteristics of each segment into a convolutional neural network to obtain the probability that each segment is the voice segment and the probability that each segment is the music segment;
And determining whether each segment is the voice segment or the music segment according to the probability that each segment is the voice segment and the probability that each segment is the music segment.
4. The method of claim 1, wherein the audio to be evaluated includes time information, wherein the location information of the speech segment refers to time location information of the speech segment in the audio to be evaluated, and wherein the location information of the music segment refers to time location information of the music segment in the audio to be evaluated.
5. The method of claim 2, wherein adjacent ones of the plurality of segments overlap in time position, the speech segment of the plurality of segments overlapping the music segment in time position;
before said quality evaluation of said plurality of speech segments, said method further comprises:
and deleting the voice fragments overlapped with the music fragments in time positions.
6. The method of any one of claims 1 to 5, wherein the speech evaluation model comprises a feature extraction layer, a convolution layer, a self-attention network layer, and an attention pooling layer, wherein,
The feature extraction layer is used for extracting the features of each voice segment in the voice segments;
The convolution layer is used for carrying out dimension reduction processing on the characteristics of each voice segment to obtain dimension reduction characteristics of each voice segment;
The self-attention network layer is used for carrying out weighting processing on the dimension reduction characteristics of each voice segment based on a self-attention mechanism to obtain the weighting characteristics of each voice segment;
The attention pooling layer is used for evaluating each voice segment according to the weighted characteristics of each voice segment to obtain the evaluation result of each voice segment.
7. The method according to any of claims 1 to 5, wherein the method is applied in a live or on-demand scene.
8. A method for training a speech evaluation model, comprising:
acquiring a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments, wherein the labels corresponding to each reference voice fragment in the plurality of reference voice fragments comprise the average value of the evaluation results of a plurality of users on the reference voice fragments;
Training based on the plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments to obtain a voice evaluation model, wherein a loss function in the voice evaluation model comprises the number of users evaluating the reference voice fragments.
9. The method of claim 8, wherein the speech evaluation model is obtained through multiple rounds of training, and wherein in each round of training, the loss function is used to determine a root mean square error of a product between a prediction error of a current reference speech segment and an alpha power of a number of users evaluating the current reference speech segment, wherein the prediction error of the current reference speech segment is a difference between a label corresponding to the current reference speech segment and an evaluation result output by the speech evaluation model to the current reference speech segment, and alpha is an adjustable parameter.
10. An audio quality evaluation apparatus, the apparatus comprising:
the acquisition module is used for acquiring the audio to be evaluated;
The determining module is used for classifying the audio to be evaluated and determining a voice segment and a music segment in the audio to be evaluated;
the extraction module is used for extracting a plurality of voice fragments from the audio to be evaluated according to the position information of the voice fragments and the position information of the music fragments;
The quality evaluation module is used for performing quality evaluation on the voice fragments to obtain an evaluation result of each voice fragment in the voice fragments;
The quality evaluation module is further used for obtaining an evaluation result of the audio to be evaluated based on the evaluation result of each voice segment;
The quality evaluation module is further used for inputting a plurality of voice fragments into a voice evaluation model to obtain an evaluation result of each voice fragment in the voice fragments; the voice evaluation model is obtained by training based on a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments, the labels corresponding to each reference voice fragment in the plurality of reference voice fragments comprise the average value of evaluation results of a plurality of users on the reference voice fragments, and a loss function in the voice evaluation model comprises the number of users evaluating the reference voice fragments.
11. A training device for a speech evaluation model, comprising:
The system comprises an acquisition module, a judgment module and a judgment module, wherein the acquisition module is used for acquiring a plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments, and the labels corresponding to each reference voice fragment in the plurality of reference voice fragments comprise the average value of evaluation results of a plurality of users on the reference voice fragments;
The training module is used for training based on the plurality of reference voice fragments and labels corresponding to the plurality of reference voice fragments to obtain a voice evaluation model, and the loss function in the voice evaluation model comprises the number of users evaluating the reference voice fragments.
12. A computing device comprising a memory for storing instructions and a processor for executing the instructions stored in the memory to implement the method of any one of claims 1 to 7 or to implement the method of any one of claims 8 to 9.
13. A computer storage medium comprising program instructions which, when executed by a computing device, cause the computing device to perform the method of any of claims 1 to 7 or cause the computing device to perform the method of any of claims 8 to 9.
CN202311765461.3A 2023-12-20 2023-12-20 Audio quality evaluation method and related device Active CN117711440B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311765461.3A CN117711440B (en) 2023-12-20 2023-12-20 Audio quality evaluation method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311765461.3A CN117711440B (en) 2023-12-20 2023-12-20 Audio quality evaluation method and related device

Publications (2)

Publication Number Publication Date
CN117711440A CN117711440A (en) 2024-03-15
CN117711440B true CN117711440B (en) 2024-08-20

Family

ID=90160462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311765461.3A Active CN117711440B (en) 2023-12-20 2023-12-20 Audio quality evaluation method and related device

Country Status (1)

Country Link
CN (1) CN117711440B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716470A (en) * 2012-09-29 2014-04-09 华为技术有限公司 Method and device for speech quality monitoring
CN109087634A (en) * 2018-10-30 2018-12-25 四川长虹电器股份有限公司 A kind of sound quality setting method based on audio classification

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10991379B2 (en) * 2018-06-22 2021-04-27 Babblelabs Llc Data driven audio enhancement
CN114242044B (en) * 2022-02-25 2022-10-11 腾讯科技(深圳)有限公司 Voice quality evaluation method, voice quality evaluation model training method and device
CN115334349B (en) * 2022-07-15 2024-01-02 北京达佳互联信息技术有限公司 Audio processing method, device, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716470A (en) * 2012-09-29 2014-04-09 华为技术有限公司 Method and device for speech quality monitoring
CN109087634A (en) * 2018-10-30 2018-12-25 四川长虹电器股份有限公司 A kind of sound quality setting method based on audio classification

Also Published As

Publication number Publication date
CN117711440A (en) 2024-03-15

Similar Documents

Publication Publication Date Title
JP6855527B2 (en) Methods and devices for outputting information
CN109359636B (en) Video classification method, device and server
CN109145784B (en) Method and apparatus for processing video
CN111477250B (en) Audio scene recognition method, training method and device for audio scene recognition model
JP7537060B2 (en) Information generation method, device, computer device, storage medium, and computer program
CN110839173A (en) Music matching method, device, terminal and storage medium
CN110085244B (en) Live broadcast interaction method and device, electronic equipment and readable storage medium
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
CN109936774A (en) Virtual image control method, device and electronic equipment
CN111401100B (en) Video quality evaluation method, device, equipment and storage medium
CN110071938B (en) Virtual image interaction method and device, electronic equipment and readable storage medium
CN110047510A (en) Audio identification methods, device, computer equipment and storage medium
CN111737516A (en) Interactive music generation method and device, intelligent sound box and storage medium
CN113992970A (en) Video data processing method and device, electronic equipment and computer storage medium
CN113948105A (en) Voice-based image generation method, device, equipment and medium
CN116665083A (en) Video classification method and device, electronic equipment and storage medium
CN116756285A (en) Virtual robot interaction method, device and storage medium
WO2024160038A1 (en) Action recognition method, apparatus and device, and storage medium and product
CN113724689B (en) Speech recognition method and related device, electronic equipment and storage medium
CN110996021A (en) Director switching method, electronic device and computer readable storage medium
Ying et al. Telepresence video quality assessment
CN115937726A (en) Speaker detection method, device, equipment and computer readable storage medium
CN117711440B (en) Audio quality evaluation method and related device
CN115687696A (en) Streaming media video playing method and related device for client
CN105872586A (en) Real time video identification method based on real time video streaming collection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant