1 Motivation and Related Work
In the era of exponentially increasing number of internet users and mobile devices, the uses of various
Voice over Internet Protocol (VoIP) applications, for example, Google Meet, Microsoft Skype, and Apple FaceTime, are growing dramatically. For fulfilling the users expectations to meet better
quality of experience (QoE) while using such VoIP applications, it is necessary to measure and monitor real-time speech quality predictions at different nodes of speech communication networks. Traditionally,
Absolute Category Rating (ACR) [
1] method is used to measure speech quality, where the speech materials are played and speech quality ratings are provided by the subjects. Although it is highly reliable and provides an accurate assessment of speech quality, one needs to arrange number of subjects who will have no objections in listening the speech material for speech quality ratings. Moreover, it consumes more time, and thus speech quality monitoring in real time using this method is impractical. Alternatively, one can utilize objective speech quality assessment metrics conveniently, which can use different computational algorithms for speech quality measurement. Measuring and monitoring real-time speech quality at different nodes of speech communication networks using objective metrics are less costly, faster, and practical.
The
International Telecommunication Union (ITU) has standardized various objective speech quality assessment metrics. For example, Instrumental metrics, as shown in Figure
1, are used to estimate the average user judgement of the quality of a service [
2]. Signal-based metric employs received (degraded) speech signals for speech quality estimation. Signal-based metrics are of two types: full-reference metric (also called “intrusive” or “double-ended” metric) and no-reference metric (also called “non-intrusive” or “single-ended” metric).
Intrusive (reference-based) metrics usually calculate the distance between spectral representations of the transmitted reference signal and the received degraded signal. For example, Perceptual Evaluation of Speech Quality [
3], Perceptual Objective Listening Quality Assessment [
4], and ViSQOL [
5] and its improved version ViSQOL v3 [
6] are some of the popular intrusive metrics. Since one cannot have the access to the original input reference signal in most of the speech processing applications, the intrusive metrics are not appropriate for real-time speech quality monitoring at different nodes of communication networks.
Non-intrusive (no-reference) speech quality metrics are preferred for real-time monitoring of speech quality and the scenarios where a reference speech signal is unavailable. These metrics only use the degraded speech signal to predict the speech quality and could be easily installed at the end point of VoIP channels or at any nodes of communication networks for monitoring the quality of speech. For the assessment of narrow-band speech signals, two no-reference speech quality metrics are standardized [
2]: first, the ITU recommended P.563 [
7] and, second, the American National Standard Institute standardized “Auditory non-intrusive quality estimation plus (ANIQUE+) [
8].” ANIQUE+ is a perceptual model that simulates the functional roles of human auditory system (that is, based on human perception of speech or audio) and deploys improved modelling of quality estimation by a statistical learning paradigm [
8]. An implementation of ANIQUE+ is only commercially available while P.563 is publicly available and currently in force. Moreover, both ANIQUE+ and P.563 metrics are not context-aware or context-sensitive quality metrics.
The examples of non-standardized speech quality prediction metrics include
Low Complexity Speech Quality Assessment (LCQA) [
9] and
Deep Neural Network– (DNN) based speech quality prediction metrics [
10,
11,
12,
13,
14]. The LCQA algorithm deploys low complexity (execution time) to monitor the quality of speech over a network. It estimates the quality of speech by mapping the global statistical features obtained from speech codecs using Gaussian Mixture Model [
15]. For each frame, the global features of speech like mean, variance, skewness and kurtosis are calculated from the parameters of speech-coding [
16]. Moreover, LCQA is restricted only to the parametric representation of the input speech signal without its perceptual transform. The recent work on DNN-based speech quality prediction metrics include [
10,
11,
12,
13,
14]. For example, Reference [
12] predicts speech quality based on extracting Mel Frequency Cepstral Coefficents features and then training a DNN, and [
13] predicts speech quality from the speech signal and then train a convolutional neural network. The author of Reference [
14] deploys deep autoencoder and sub-band autoencoder features to train
artificial neural network (ANN) on noisy speech samples for predicting quality of speech. Reported results indicate that these DNN-based metrics perform poorly in the scenarios of competing speaker type degradations. Also, these metrics directly predict the speech quality without identifying/classifying the context (noise class) of the input speech signal before predicting its speech quality.
Parametric metrics [
17], for example, the E-Model [
18], estimates the quality of speech using the network and the terminal parameters. Network delay and packet loss are the key components of network parameters. Jitter buffer overflow, coding distortions, jitter buffer delay, and echo cancellation are the key components of terminal parameters. Impairments of the received speech signal are predicted using these parameters, and then the rating factor is converted into
Mean Opinion Score (MOS).
1 Some recent study on speech quality monitoring using parametric metric includes wired, wireless, or atmospheric degradations in speech signals [
19,
20,
21,
22,
23]. However, the E-model has limitations, that is, it cannot clearly represent non-linear relationship between perceptual characteristics of speech signal and network planning parameters due to the dynamic change in the speech signal characteristics. Moreover, the disadvantage of the parametric metric is that it does not involve the speech signal in the prediction of speech quality, and, therefore, it is not appropriate for predicting quality of speech based on the signal-noise characteristics [
2]. For real-time measuring and monitoring of the quality of speech, a no-reference signal-based speech quality prediction metric is the most appropriate.
The motivation to propose a Context-aware QoE Prediction Metric (CAQoE) came to light while digging into the literature and knowing that there is no signal-based context-aware speech quality prediction metric. There are three main components in the proposed metric: (i) a context-classifier to classify the context of the speech signal (noise type), (ii) a Voice Activity Detector (VAD) to identify the voiced segments present in the noisy signals, and (iii) Context-specific Speech Quality Estimation Model (CSQM) to predict context-specific speech quality. To perfectly train the context-classifier and the CSQM, one needs to have a large size of noisy speech database. However, due to the availability of small size database of speech samples of different noise classes in the NOIZEUS speech corpus, we also have addressed the challenges of small size training samples database in building an accurate machine learning (ML) classifier for classifying the context of speech signal and also in training the CSQM for precisely estimating the quality of speech signal.
The rest of this article is organized as follows: The usage of machine learning and the concept of QoE in speech processing applications are presented in Section
2. Experimental dataset to evaluate speech quality is described in Section
3. Section
4 presents the detailed explanation of the proposed metric and its each associated component. The evaluation methodologies of each component and overall speech quality metric are described in Section
5. Section
6 presents the system setup to execute the program and the choices of different parameters and hyper-parameters. The results of each component and overall proposed metric are presented and discussed in Section
7. Section
8 presents the summary, which includes the key contributions, limitations of the proposed metric and the plans for the future work.
2 Quality of Experience and Machine Learning
The degree of delight or annoyance of the user while using a particular application or service is defined as the QoE. It results from the fulfillment of the customer’s expectations with respect to the utility and/or enjoyment of that application or service in light of their personality and current state [
24]. There are various factors that influence the QoE while using VoIP applications for VoIP call, namely system, network, content, context of use, and user. The types of channel (mono or stereo), position of microphone, central processing unit overload, and so on, are the key factors of the service and system. Jitter, packet loss, and delay of the transmitted speech signal are the network factors. Content, that is, the characteristics of speech and voice, may be affected by processing and can influence the QoE. Location of using the service is the part of contextual factor, for example, in the noisy environments, such as a car, street, or train, compared to the noiseless and silence at home. Finally, the good quality is expected by the service users.
In view of this, a novel intelligent technique is necessary to design appropriate data-driven context-aware QoE metric that can perform real-time prediction of speech quality and can make its efficient utilization in speech quality monitoring. Using
artificial intelligence (AI) and ML techniques, smart decision making AI-based algorithms can be introduced into the mobile devices to improve the QoE and the performance gains of the end-user. Moreover, the proposed speech quality prediction metric CAQoE can be easily deployed by the internet service providers for continuously measuring and monitoring the quality of service performance by detecting the impairments and potentially identifying the context (noise type). The potential root causes can be identified using this assistance, and then the QoE-aware management actions can be installed to react and maintain the end-user QoE levels [
25].
3 The Speech Quality Evaluation Dataset
Different datasets have different applicabilities. For example, the ITU-T P.Supplement-23 database [
26] contains the coded version of speech utterances used in the ITU-T 8 kbps codec characterization tests [
27]. Experiment-1 examines the G.729 codec with coded speech samples, which are, thus, not useful for our experimentation. Experiment-2 investigates the effect of background noise for transmission quality but not having classified noise such as car, street, train noise, and so on. Also the method of assessment is comparison category rating not ACR, which is, thus, not suitable for this work. Experiment-3 investigates effects of channel degradations using coded speech samples and, thus, not useful for our experimentation. Therefore, the ITU-T P.Supplement-23 database is not suitable for testing our proposed metric due to the non-availability of different types/classes of degradations present in speech signal.
As different types of environmental noise degrades the speech, the performance of proposed context-aware speech quality prediction metric is investigated using noisy speech samples from a publicly available noisy dataset, that is, NOIZEUS [
28]. Three male and three female speakers pronounced 30 phonetically balanced IEEE English sentences. Four real-world noises: babble, car, street, and train at two
signal-to-noise ratios (SNRs): 5 and 10 dB are used for degrading each sentences. Noises are taken from the AURORA database [
29]. Each sentences are down-sampled from 25 kHz to 8 kHz, that is, narrow-band noisy speech samples. For down-sampling, the “resample” function resamples the input speech signal at
\(p/q\) times the original sampling rate. It applies an FIR antialiasing low-pass filter to input speech signal and compensates for the delay introduced by the filter. The average duration of each utterance is 3 seconds. All the noisy speech samples are saved in .WAV format (16 bit PCM, mono).