Voice Quality With ITU-T P.863 POLQA': A Rohde & Schwarz Company
Voice Quality With ITU-T P.863 POLQA': A Rohde & Schwarz Company
Voice Quality With ITU-T P.863 POLQA': A Rohde & Schwarz Company
SwissQual AG
Allmendweg 8 CH-4528 Zuchwil Switzerland
t +41 32 686 65 65 f +41 32 686 65 66 e info@swissqual.com
www.swissqual.com
No part of this publication may be copied, distributed, transmitted, transcribed, stored in a retrieval system, or translated
into any human or computer language without the prior written permission of SwissQual AG.
Confidential materials.
All information in this document is regarded as commercial valuable, protected and privileged intellectual property, and is
provided under the terms of existing Non-Disclosure Agreements or as commercial-in-confidence material.
When you refer to a SwissQual technology or product, you must acknowledge the respective text or logo trademark
somewhere in your text.
SwissQual®, Seven.Five®, SQuad®, QualiPoc®, NetQual®, VQuad®, Diversity® as well as the following logos are
registered trademarks of SwissQual AG.
Diversity Explorer™, Diversity Ranger™, Diversity Unattended™, NiNA+™, NiNA™, NQAgent™, NQComm™, NQDI™,
NQTM™, NQView™, NQWeb™, QPControl™, QPView™, QualiPoc Freerider™, QualiPoc iQ™, QualiPoc Mobile™,
QualiPoc Static™, QualiWatch-M™, QualiWatch-S™, SystemInspector™, TestManager™, VMon™, VQuad-HD™ are
trademarks of SwissQual AG.
SwissQual acknowledges the following trademarks for company names and products:
Adobe®, Adobe Acrobat®, and Adobe Postscript® are trademarks of Adobe Systems Incorporated.
Intel®, Intel Itanium®, Intel Pentium®, and Intel Xeon™ are trademarks or registered trademarks of Intel Corporation.
Microsoft®, Microsoft Windows®, Microsoft Windows NT®, and Windows Vista® are either registered trademarks or
trademarks of Microsoft Corporation in the United States and/or other countries U.S.
Contents
Voice Quality with ITU-T P.863 ‘POLQA’ ........................................................................................................ 0
6 Conclusion .........................................................................................................................................32
| ii
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
Figures
Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks .................... 6
Figure 2: Scheme of a full-reference psycho-acoustic motivated speech quality model .................................. 6
Figure 3: Basic scheme of the main components of P.863 ‘POLQA’ ................................................................ 8
Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts ................. 9
Figure 5: Illustration of assigned signal parts and the optimal ‘path’ of signal correspondences ..................... 9
Figure 6: Example of an aligned pair of reference and degraded signal ........................................................... 9
Figure 7: Block-scheme of POLQA as in ITU-T P.863 .................................................................................... 10
Figure 8: Application of masking slopes to the Bark spectrum........................................................................ 12
Figure 9: Consideration of fully and partially masked spectral parts ............................................................... 13
Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking ....................... 13
Figure 11: Insertion and capturing in a speech test setup ............................................................................... 16
Figure 12: IRS in send and receive direction as specified in ITU-T P.48 ........................................................ 17
Figure 13: P.863 ‘POLQA’ narrowband main result representation in NQDI .................................................. 19
Figure 14: P.863 ‘POLQA’ narrowband detail result representation in NQDI ................................................. 20
Figure 15: P.863 ‘POLQA’ test selection in NQDI ........................................................................................... 20
Figure 16: P.863 ‘POLQA’ statistical report in MS EXCEL .............................................................................. 21
Figure 17: P.863 ‘POLQA’ wideband main result representation in NQDI ...................................................... 25
Figure 18: P.863 ‘POLQA’ wideband audio bandwidth representation in NQDI ............................................. 25
Figure 19: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI ........................................................... 26
Figure 20: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI with signal interruptions .................... 26
Figure 21: Distribution of predicted MOS scores by P.862.1 ‘PESQ’ ............................................................. 27
Figure 22: Distribution of predicted MOS scores by P.863 ‘POLQA’ .............................................................. 28
Figure 23: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB modeand NB mode .................... 31
Figure 24: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB mode in wideband networks ....... 32
Tables
Table 1: Improvement in performance of P.863 ‘POLQA’ to P.862 ‘PESQ’ .................................................... 14
Table 2: Typical predicted MOS-LQ values for common transmission techniques ......................................... 18
Table 3: Typical P.863 ‘POLQA’ scores for common transmission techniques .............................................. 24
Table 4: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in high qualitative UMTS/GSM setups ... 27
Table 5: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in common real field setups .................. 29
Table 6: Comparison of different speech samples in common real field setups ............................................. 30
Table 7: Comparison of the NB and SWB mode of P.863 ‘POLQA’ in common real field setups .................. 31
| iii
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
speech frames. All of this considerably changes the physical signal, without necessarily affecting its
qualitative perception. The correct rating of these types of signal distortions is a clear shortcoming of PESQ
and is now solved by POLQA.
copy of
high quality
speech signal
Distorted
signal Psycho-acoustic model
Model of Device
(frequency and intensity
(i.e.handset)
warping, masking)
Distance Cognitive
MOS-LQO
Similarity model
Reference
signal Psycho-acoustic model
Model of Device
(frequency and intensity
(i.e.handset)
warping, masking)
Perceptual
model
Internal
Idealization representation
of the ideal
Space / Time
Alignment
Difference in
Quality
Cognitive
internal
representation model
Internal
Idealization representation
of the output
Perceptual
Degraded model
Output
Time Alignment
Why does POLQA perform time alignment?
POLQA and other objective measures following the same base structure compare the (spectral) short-term
characteristics of the reference signal and the degraded signal frame by frame. The alignment marks
corresponding sections in both signals. Only this way can the correct frames be compared to each other.
What makes it challenging?
Aligning two signals is simple for constant delay between the two signals and a linear transmission. Here,
just an offset has to be compensated. More complicated are un-synchronous devices (clock drift), they lead
to a constantly increasing / decreasing ‘delay‘. Here the compensation is not constant but at least constantly
and linearly changing over time. Even more challenging are processing components transmitting individual
parts of the signal with different delays. These can lead to stretched or compressed speech pauses but also
to stretched or compressed speech parts. This stretching or compressing can be done by preserving the
pitch or by just ‘warping’ the entire signal part.
In all these cases, each individual short frame of the degraded signal (usually 32ms in length) has to be
assigned to a corresponding frame in the reference signal.
How can it be done in a robust and fast way?
At first POLQA indicates signal parts where the delay can be assumed to be constant and flags them as
‘landmarks’. These parts can be of different length; in the simplest case one single part covers the entire
signal (if there is a constant delay over the entire file).
Correspondence
with confidence
REFERENCE PROCESSED
Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts
In a second step, the areas between these landmarks are analyzed. Therefore, the signal is sub-divided
more and more into a series of smaller parts. Each part has an assigned corresponding part in the other
signal.
Each assigned signal part is given a value that rates the confidence of the assignment. In less confident
areas a wider signal range is analyzed, whereas the assignment correspondences of parts with a high
confidence are considered as fixed.
This approach allows a very efficient and robust search structure since the search range becomes more and
more restricted as more landmarks are set. The result is a kind of matrix with corresponding signal parts and
associated search ranges.
Figure 5: Illustration of assigned signal parts and the optimal ‘path’ of signal correspondences
A Viterbi-like algorithm then calculates the most likely ‘path’ through this matrix and fixes the corresponding
signal parts.
The end result of the time alignment step is a correspondence table with start and the end times of each
signal part and its correspondence in the reference. Parts of the degraded signal with no correspondence in
the reference (i.e. inserted or added parts), as well as parts of the reference signal that are missing in the
degraded signal, are marked as well. The following signal graph illustrates a practical example. The upper
graph shows the (complete) reference signal, the lower graph shows the received and degraded signal.
The green areas denote signal parts assigned with high confidence, the blue ones are those with lower
confidence. The red signal part indicates a part of the reference signal that was lost during transmission and
is no longer present in the degraded signal. Unassigned silent parts (white) are not used for direct
comparison but rather for an analysis of the annoyance of the noise floor in there.
Psycho-acoustic model
Just like any of the models that have the same basic approach as POLQA, the psycho-acoustic model starts
with a global level alignment followed by a frame-wise spectral analysis of overlapping frames. As is usual in
these models, a short-term level scaling is applied as well, and the application of a cosine-based window and
a FFT is used for converting the audio signal from the time domain to the spectral domain.
The block scheme of the POLQA psycho-acoustic model is shown in the figure below.
Scaling towards
Scaling towards
degraded
playback level
Idealization Frequency response
Noise estimation
Reverb
Windowed FFT Windowed FFT
FRQ NOI RVB indicators
Frequency response
x
compensation
Masking Masking
Perceptual subtraction
Asymmetry processing
Cognitive model
- Combination of individual indicators
- Training on subjective reference scores
- Mapping into MOS scale
Predicted Listening
Quality MOS-LQO
However, there are three parts that make P.863 ‘POLQA’ different from established standards such as
P.862 ‘PESQ’.
Removing / Reduction of individual distortion types and separate consideration of them
Idealization of the reference signal
Sharpened loudness spectra
Sone
Bark Bark
unmasked
Partially masked
Sone
Sone
Bark Bark
Masked
unmasked
Partially masked
Sone
Sone
Bark Bark
Masked
Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking
Finally, we get a loudness spectrum that represents the individual spectral parts as they contribute to
perception. This means that fully masked parts are taken out, while partially masked parts are attenuated.
These modified spectra of the reference and the degraded signal are then compared and differences are
considered as ‘perceptible’ differences. The big advantage of the ‘sharpened’ approach is the remaining high
resolution of the spectrum. It allows a high spectral resolution in the analysis, as required e.g. for a valid
qualitative assessment of the reproduction of fine spectral structures in upper bands by compression
algorithms.
In addition the P.863 development should extend the scope of P.862 mainly by
Extension to super-wideband (50 to 14’000Hz)
Qualitative prediction of intermediate bandwidth, changes in audio bandwidth, bandwidth extension
Acoustical ‘interfaces’, echoes, reverberations
Sound presentation level
Due to the wide scope of P.863, the development and evaluation required a huge amount of test data. Test
data means, speech samples with this variation of degradations scored by human listeners in defined sub-
jective experiments. In the end, for the evaluation of P.863 ‘POLQA’ a total of 62 subjectively scored data
sets were used containing more than 45’000 voice samples.
1
These data sets were used for calculating the prediction performance by means of residual square errors or
correlation coefficients. The residual square error or – as in previous times – Pearson’s correlation coefficient
is the indicator for the accuracy of the objective measure; it is given by the remaining prediction error to the
‘true’ scores obtained in the subjective tests.
These values give an overview of the performance in general. However, the actual reached numbers depend
on the construction of the data set and the kind of conditions it contains. It is always true that there are test
conditions that can be predicted ‘easily’ in an accurate way by a model (e.g. noises, waveform codecs and
so on) and others where the deviation is higher (usually combinations of distortions). The occurrence of such
conditions in a data set has a strong influence on these figures. This is not only due to the objective
prediction method rather caused by uncertainties of the listeners in the auditory tests as well.
For the P.863 ‘POLQA’ evaluation ITU-T has chosen a statistic approach that is based on an r.m.s.e.
calculation, but takes the uncertainty of the subjectively derived MOS values into account. Based on these
figures, the performance evaluation of P.863 ‘POLQA’ compared to P.862.1 and P.862.2 ‘PESQ’ was done.
Table 1: Improvement in performance of P.863 ‘POLQA’ to P.862 ‘PESQ’
rmse* P.862.1 'PESQ' P.863 'POLQA' in NB mode Improvement by
Classical narrowband exp. 0.157 0.123 22%
Advanced narrowband exp. 0.227 0.154 32%
1
A data set, also often called experiment or database, is a set of speech files processed or transmitted under different
real field or simulated conditions and scored subjectively. A data set usually consists of about 200 individual speech
samples. The prediction accuracy is calculated by comparison of the MOS scores given by the listeners and the
prediction by the objective measure as e.g. P.863 ‘POLQA’.
Chapter 2 | Technical Details of POLQA 14
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
The so-called classical set of narrowband experiments covers 22 data sets used in ITU-T already for
standardization efforts from the mid 90’s until about 2003. They contain common codec and noise
nd rd
distortions, mobile channels of the 2 and 3 generation as well as VoIP as it was state of the art at the
millennium. Even though these databases cover distortions that were already used during the development
of P.862 ‘PESQ’, the new method P.863 ‘POLQA’ shows even higher prediction accuracy here.
The advanced set of narrowband experiments is more focused on the latest coding technologies, frame loss
rd th
concealment strategies, noise reduction and of course 3 and 4 generation mobile as well as the newest
VoIP implementations. This set is based on 15 data sets. The improvement reached with the new method
P.863 ‘POLQA’ is evident. This set covers a wide range of test conditions of latest technologies which P.863
was designed for.
Finally, there was a set of common wideband data as well. It covers 7 different data sets. Here the
improvement over P.862.2 ‘PESQ-WB’ is extremely high.
electr. electr.
network / real channel
interface interface
Similar to this, the sending direction is modeled in this narrow-band setup as well. The source speech signal
is inserted into an electrical interface, either a PSTN or ISDN line or into the microphone input of a mobile
device. In reality at this point the signal has passed the microphone and some voice processing components
already. To emulate this part of the signal path, a model of a typical narrowband microphone is applied. This
is called IRS send, since it models the device in sending direction. It can also be imagined as a weak
telephony band-pass but with a quite strong pre-emphasis up to 3kHz. This makes the speech sound a bit
2
‘sharp’ but with higher intelligibility in background noise situations.
Figure 11 schematizes the idea behind a narrow-band test. The modeled sending device allows a direct
electric coupling to the channel under test and guarantees reproducible results independent from an actual
used microphone.
The frequency responses for the two filters modeling the device are given in Figure 12. It is clearly visible
that there is a bandwidth limitation to the telephony band, although a slightly wider band can pass than just
300 to 3400Hz.
10 IRS send direction (ITU-T P.48) 10 IRS rcv direction (ITU-T P.48)
0 0
a / dB
a / dB
-10 -10
-20 -20
-30 -30
0 1000 2000 3000 4000 0 1000 2000 3000 4000
f / Hz f / Hz
Figure 12: IRS in send and receive direction as specified in ITU-T P.48
While for ISDN and PSTN interfaces defined level and impedance requirements are given and fulfilled by the
interface devices, for mobile phones only the headset connector as a proprietary interface is available.
SwissQual’s connector interface for mobile phones is adjusted for this type of interface. It applies the correct
level, adjusts the frequency response and matches to the impedance of each individual phone type and
enables a quasi-standard electrical network termination point even for mobile handsets.
2
This characteristic is taken from older carbon microphones: the pre-emphasis should compensate the low-pass
characteristic of the inductive loaded analogue lines at that time.
3
The value of -26dB relates to an overload point of 32767/-32768 as is used in 16bit resolution in the digital signal
domain.
Chapter 3 | Narrow-band Voice Quality measurements with 17
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
Linear distortions
Transparent transmission 4.50 4.50 4.50
~40 – ~3800 Hz
Transparent transmission 4.40 4.50 4.30
~180 – ~3500 Hz (G.712)
Transparent transmission 4.50 4.50 4.40
~200 – ~3500 Hz (IRSsend)
Transparent transmission 4.10 4.30 3.60
300 – 3400 Hz (box block)
IRSsend + G.711 4.40 4.40 4.30
(A-Law standard PCM)
Codec conditions
IRSsend + EFR / AMR 12.2kbps 4.15 4.15 4.20
IRSsend + EFR 4.10 4.15 4.10
(real loss-free connection)
IRSsend + QCELP 13kbps 3.90 4.00 4.00
IRSsend + EVRC 9.5 kbps 3.75 3.90 3.90
IRSsend + EVRC-B 9.3 kbps 3.75 4.00 3.90
IRSsend + AMR 7.95 kbps 3.90 4.00 3.95
IRSsend + AMR 6.70 kbps 3.75 3.90 3.85
AMR 4.75 kbps 3.40 3.70 3.65
4
ITU-T and 3GPP do not recommend the use of the P.862 family for EVRC-type codecs.
Chapter 3 | Narrow-band Voice Quality measurements with 18
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
The codecs are used as reference SW implementations. In addition one EFR condition is shown as it
behaves in a real loss-free channel, using a commercial Nokia handset as access device to the network. The
channel was terminated by an ISDN card device running G.711 A-Law.
Firstly, a very slight more pessimistic prediction is enabled by P.863 ‘POLQA’ compared to SQuad08.
However, for practical use cases this absolute difference is negligible. Compared to P.862.1 the higher rates
of AMR match very well even though the lower rates are scored higher by P.863. In addition, the EVRC type
codecs are scored higher and more realistic by P.863 and especially SQuad08 compared to P.862.1.
P.863 ‘POLQA’ considers linear distortions and bandwidth limitations in its score. For super-wideband mode
it is obvious. There, a signal is always compared to a super-wideband reference (50 to 14000 Hz). It is
important to note that P.863 ‘POLQA’ in narrow-band mode considers a ‘full narrow-band’ signal (~50 to
3800 Hz) as reference. To this signal an IRSrcv filter is applied in P.863 ‘POLQA’ itself. That means
limitations lowering this bandwidth will lead to a predicted distortion. With P.863 ‘POLQA’ the actual channel
filters and band-pass characteristics in the microphone and loudspeaker path of the used mobile phone are
5
taken more into account as it was for P.862 ‘PESQ’.
SwissQual’s SQuad08 also considers linear distortion in narrow-band mode; however it is less sensitive than
P.863 ‘POLQA’ and is supposed to be less dependent from the actually used phone and its internal filtering.
SwissQual’s speech quality suite offers two methods for predicting listening quality: The known SQuad08
and the new ITU-T P.863 ‘POLQA’. Both models may be combined with ITU-T P.862 ‘PESQ’ as an option.
The entire framework as known from SQuad including the voice samples, the insertion and capturing
procedure and – of course – all of the additional signal analysis results are used and available for
P.863 ‘POLQA’ in the same way.
5
Since, P.863 ‘POLQA’ measures the actual spectral loss of the speech signal, the actual impact by band-limitations
depend on the actual spectral power distribution if the speech sample. That means there are samples more or less
affected by this filtering due to their spectral characteristic e.g. losing more or less high frequency parts.
In addition to the global values for the entire speech sample, graphs illustrate the quality profile over the
sample duration, the signal envelopes as well as the signal gain
same table. The results for each algorithm are given in a separate column.
6
For narrow-band mode P.863 ‘POLQA’ applies an IRS receive filter that emulates a narrow-band handset
(see: Figure 12: IRS in send and receive direction as specified in ITU-T P.48)
Chapter 4 | Wideband Voice Quality measurements with 22
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
This is the difference to the narrow-band case. The comparison of the recorded signal is made relatively to a
super-wideband reference. In the same way, the recorded signal is not post-filtered to avoid any band
limitation that models a receiving HiFi headphone.
That means, in case of a ‘full-band’ audio channel (i.e. a VoIP connection using full audio bandwidth or an
application using a MP3 with sufficient bitrate as in video or audio streaming), the recorded signal matches to
the reference in its bandwidth. In case of a common wideband or even a narrow-band channel or device, the
bandwidth becomes limited during transmission. In case this signal is recorded and compared to the full
reference, the spectral loss is weighted as degradation.
Of course the exploration of a wideband channel requires also the insertion of a signal with sufficient
bandwidth. To actually feed wideband signals into the channel, new voice samples were recorded. They are
without a perceptual bandwidth limitation and are stored at 32kHz sampling frequency in a separate
reference folder ‘Speech-Wideband’ or ‘Speech-Wideband POLQA’ respectively. As usual, the samples are
constructed out of a male and a female spoken sentence and have a constant length of 6s. Thus, the
continuity to the narrowband tests is completely given.
For the time being SwissQual provides samples in
German (German pronunciation)
German (Swiss pronunciation)
British English
Italian
Dutch
Each language sample is provided without any pre-filtering (except for a 50 – 14’000Hz band-pass) and
called i.e. GE_fm_wide.wav. As specified for wideband devices, the microphone path is considered as flat in
the transmission band. It means no IRSsend as for required narrow-band is applied. The signal remains ‘flat’,
without any further band limitation and without any pre-emphasis as in the IRS.
7
For more detailed information, please refer to:
‚White Paper – About MOS and Quality Measurements’ published by SwissQual AG in 2011.
Chapter 4 | Wideband Voice Quality measurements with 23
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
Table 3: Typical P.863 ‘POLQA’ scores for common transmission techniques in a wideband and a narrowband context
P.863 in P.863 in
super-wideband narrowband
(50-14000 Hz) (300-3400 Hz)
It can be seen that the rank-order of the systems remains independent from the test scenario. The upper
range of the wide-band scale is just used for the high qualitative wideband voice samples. The common
narrowband scenarios are compressed to the lower 60% of the scale and thus show a smaller gradient as
well.
In case of optimizing and benchmarking pure narrowband networks and applications, the common
narrowband test application can be used without any problems. The individual systems are more clearly
discriminated due to the wider scale range used.
For optimizing wideband applications and networks and especially for benchmarking of wideband networks
against narrowband ones, a wideband test application is required.
Firstly, the degradations in wideband mode can only be assessed in a wideband test application and
secondly, a wideband signal can only ‘show’ its better quality against narrowband in wideband mode.
Note: Narrowband MOS-LQ values and wideband MOS-LQ values must never be mixed or directly
compared. They are referring to different interpretations of the MOS scale.
were considered in the huge training set for SQuad and P.863 ‘POLQA’.
The main focus of Diversity’s wideband test solution is of course the evaluation and benchmarking of
wideband channels in cellular networks.
An additional application area for wideband voice testing in Diversity is video streaming. In video streaming
audio codecs are usually used; these don’t have any bandwidth restriction, except in very low bitrate
conditions. Consequently, Speech Wideband as a test case is also applied to video streaming starting with
Release 10.2 of Diversity and completed in Release 11.0 with the full support of ITU-T P.863 ‘POLQA’.
The lower and upper bound are marked with blue lines. As is clearly visible, Diversity and ITU-T P.863 make
use of real super-wideband signals. The frequency scale here ends at 16’000 Hz; this corresponds to an
internal sampling frequency of 32’000kHz.
A good example of the difference between the two algorithms is the treatment of interruptions and lost
speech. Here P.862 ‘PESQ’ is suspected of scoring inaccurately and usually too optimistic. In the example
almost 4% of the original speech was lost, however P.862 ‘PESQ’ scores with 3.2, while P.863 ‘POLQA’ only
predicts 2.7 which appears closer to the perceived score here.
Figure 20: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI with signal interruptions
By analyzing a larger number of quality scores obtained in a drive test, the picture remains almost the same.
The following figures are based on a drive test and a collection of data from a European operator. The
8
P.862 ‚PESQ’ defines the algorithm technically. The actual transformation from the P.862 outcome to a MOS-like scale
is defined in P.862.1. All predicted MOS scores in this document are computed in accordance to P.862 and were
converted to the MOS domain according to P.862.1.
Chapter 5 | Real field measurements 26
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
speech sample used was American English and each given number is based on a collection of around 100
individual scores.
Table 4: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in high qualitative UMTS/GSM setups
Average Maximum
P.862.1 P.863 P.862.1 P.863
PESQ POLQA PESQ POLQA
Just looking at ‘Downlink’ which is usually the less critical direction, there is on average a difference between
PESQ and POLQA averages of just 0.02, which is completely negligible. There are small differences in
average between the phones and the two technologies GSM and UMTS. But the behavior is always the
same for either method, i.e. GSM 900 is scored lower by 0.2 MOS on average with both methods.
In Uplink the situation is slightly different. Here P.863 ‘POLQA’ scores slightly lower than PESQ, on average
by 0.15 MOS. This effect is due to several reasons, the main one being the more restricted audio bandwidth
by using the microphone path of the mobile device as it is the case in Uplink. By contrast, the Downlink is
using the (wider) loudspeaker path of the phone. The former P.862 ‘PESQ’ compensates the frequency
response of the channel and therefore ‘ignores’ that band-limitation mostly. P.863 ‘POLQA’ considers
changes in bandwidth as they are perceived by a user and consequently a limitation will lead to a slightly
lower score here.
Besides the average values, the distribution of the predicted values provides information of the measures
behavior. The following two graphs are based on the downlink scores of Device ‘A’ in UMTS 2100 as above.
50%
PDF Number of Values
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1
2-2.1
3-3.1
4-4.1
1.2-1.3
1.4-1.5
1.6-1.7
1.8-1.9
2.2-2.3
2.4-2.5
2.6-2.7
2.8-2.9
3.2-3.3
3.4-3.5
3.6-3.7
3.8-3.9
4.2-4.3
4.4-4.5
4.6-4.7
4.8-4.9
Listening Quality
Figure 21: Distribution of predicted MOS scores by P.862.1 ‘PESQ’ (Device A, UMTS, Downlink as in Table 4)
50%
PDF Number of Values
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1
2-2.1
3-3.1
4-4.1
1.2-1.3
1.4-1.5
1.6-1.7
1.8-1.9
2.2-2.3
2.4-2.5
2.6-2.7
2.8-2.9
3.2-3.3
3.4-3.5
3.6-3.7
3.8-3.9
4.2-4.3
4.4-4.5
4.6-4.7
4.8-4.9
Listening Quality
Figure 22: Distribution of predicted MOS scores by P.863 ‘POLQA’ (Device A, UMTS, Downlink as in Table 4)
Both distribution functions are very close and concentrate a wide majority of the scores in the range of 4.0 to
4.2 that corresponds to the best quality in error-free connections. It is logical that a certain quality can’t be
exceeded. It is set by the coding scheme, the channel limits and other included voice processing. Even in
undistorted conditions they insert a certain amount of degradation. This defines the upper level that can’t be
exceeded in this setup. This causes the steep decline towards higher values on the right-hand side. Usually,
the majority of scores are in this region which corresponds to error-free transmission.
In the direction of lower values, the distribution falls shallower. Values in this region indicate degradations in
addition to the unavoidable distortions. In cellular networks these problems are usually interruptions (due to
handovers), falling back to lower bitrates in case of AMR (due to bad radio conditions) and frame losses that
were concealed artificially by the AMR decoder. In principle there could be other distortions as well,
e.g. transcodings in case of special routing or noise bursts coupled into analogue parts of the PSTN.
Regarding the absolute maximum as shown in Table 4 there is no difference between the phones and the
technologies used, meaning that the reachable quality is identical for both and the slightly differing averages
are caused by individual test conditions e.g. slightly different RF coupling or a few more bad channels in the
averaging process. It should be noted that the reached maximum is the same as obtained by just processing
the same speech sample over an AMR 12.2 kbps codec in offline emulation. This indicates that there are no
further distortions introduced by the phone or constant speech processing components in the network.
Table 5: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in common real field setups
Network 2
Device C 3.35 3.42 3.69 3.69 3.54 3.56
UMTS
Network 3
Device D 3.98 3.79 4.10 3.97 4.03 3.87
UMTS
Network 4
Device E 3.30 3.29 3.76 3.66 3.62 3.54
CDMA / EVRC
Network 5
Device F 3.33 3.44 3.77 3.84 3.56 3.64
CDMA / EVRC-B
For network ‘2’ the situation is different, despite of a device that applies some gain and noise control, the
network is here limited to AMR at 5.9 kbps. This reduces the achievable quality compared to a
network/device combination as in network ‘1’ significantly.
Network ‘3’ is somewhat in between networks ‘1’ and ‘2’, it enables AMR at 12.2 kbps but the used handset
is not as transparent as the devices ‘A’ and ‘B’ are. P.863 ‘POLQA’ scores these device characteristics lower
than P.862 ‘PESQ’.
The two real field CDMA networks are also in the range of networks ‘2’ and ‘3’. The quality is determined by
the coding schemes used. Mainly for EVRC-B the quality scores are improved compared to P.862 ‘PESQ’.
However, aggressive noise and gain control have a strong influence to the achieved scores as well. Finally,
even the maximum scores are lower than what could be expected by plain encoding distortions.
The achieved quality figures in real field measurements are – of course – depending on the RF conditions in
the network. However, it has to be considered that a certain quality can’t be exceeded due to fixed speech
processing components in the channel and in the device. Despite of comparing averages, a closer look at
the distribution of the scores and the values where most of the scores are located, give useful information
about potential reasons of a non-perfect quality.
To get an impression of how this deviation could be, the following analysis was made. Nine different speech
samples were transmitted consecutively in a phone call during a drive test in a real UMTS network. A total of
30 calls were made. It can be assumed that the ‘distribution of real channel quality’ was the same for all nine
samples.
The following table shows the averages, the absolute maximum values and the 80% percentiles of the MOS
scores obtained with the nine speech samples. For a better overview, the samples are grouped by language.
The test situation is the ‘reference situation’ as above, i.e. network 1 (European UMTS 2100MHz, Device ‘B’
as a quasi-transparent device and ‘uncritical’ downlink only).
Table 6: Comparison of different speech samples in common real field setups
In general it can be observed that there is considerable difference between the samples. The averages and
the maximum values span over a range of >0.2 MOS in case of P.862 ‘PESQ’ and even >0.3 MOS for P.863
‘POLQA’. There are two reasons for this. First, the individual samples are treated slightly differently by the
voice processing in the channel. They are more or less affected by e.g. band-pass filtering or compression.
Secondly, there is the consideration of the talker’s timbre, the spectral power distribution of the reference and
degraded signal in P.863 ‘POLQA’. Since there are differences in the talkers’ individual characteristics and
the actual recording conditions of the reference speech samples, P.863 ‘POLQA’ scores slightly different too.
The situation is widely systematic, i.e. a speech sample that is scored slightly lower, will tend to lower scores
under all realistic test conditions. Therefore, when comparing MOS values from different investigations, the
influence of the speech sample used should not be overlooked. Ideally, results that are to be compared to
each other should be based on the same speech sample or the same selection of those.
Table 7: Comparison of the NB and SWB mode of P.863 ‘POLQA’ in common real field setups
P.863 'POLQA'
Average Maximum 80% Percentile
NB SWB NB SWB NB SWB
Network 1
UMTS DL 4.06 3.03 4.17 3.25 4.13 3.17
Device A
The achievable quality scores of 3.17 (80% percentile) or 3.25 (absolute maximum in the collection) fit the
simulated value of 3.2 given in Table 3 quite well (which itself is an average over a set of different speech
samples).
In a more detailed analysis the distributions of the two test cases are compared in Figure 23. It can be seen
that the typical shape of MOS distribution is shifted and compressed towards the lower scale end.
50%
PDF Number of Values
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1
2-2.1
3-3.1
4-4.1
1.2-1.3
1.4-1.5
1.6-1.7
1.8-1.9
2.2-2.3
2.4-2.5
2.6-2.7
2.8-2.9
3.2-3.3
3.4-3.5
3.6-3.7
3.8-3.9
4.2-4.3
4.4-4.5
4.6-4.7
4.8-4.9
Listening Quality
50%
PDF Number of Values
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1
2-2.1
3-3.1
4-4.1
1.2-1.3
1.4-1.5
1.6-1.7
1.8-1.9
2.2-2.3
2.4-2.5
2.6-2.7
2.8-2.9
3.2-3.3
3.4-3.5
3.6-3.7
3.8-3.9
4.2-4.3
4.4-4.5
4.6-4.7
4.8-4.9
Listening Quality
Figure 23: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB mode (lower graph) and NB mode (upper
graph) using Device A, UMTS, Downlink as in Table 4
It has to be considered that the reporting of SWB scores will not match the expected figures obtained with
NB in the past. Both values and analysis types must not be mixed or compared to each other. However, it
will just be a question of time until the market deals with quality scores obtained in super-wideband mode
and has adapted to the lower range of quality achievable in narrowband connections.
The most important point for using P.863 ‘POLQA’ in SWB is of course the evaluation of wideband channels
and networks, both for comparison to each other and to traditional narrowband systems.
Today in 2011, only few mobile networks are equipped with wideband transmission capabilities, and it is
often restricted to mobile-to-mobile connections in transcoding-free operational mode. On the other hand,
Chapter 5 | Real field measurements 31
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG
VoIP services have been using wideband and even super-wideband transmission already for a long time and
with P.863 ‘POLQA’ there is now an appropriate objective measure of speech quality for these systems.
The following example for wideband transmission is based on a collection of speech samples obtained in a
mobile network that was equipped with AMR-WB at most locations.
30%
PDF Number of Values
25%
20%
15%
10%
5%
0%
1-1.1
2-2.1
3-3.1
4-4.1
1.2-1.3
1.4-1.5
1.6-1.7
1.8-1.9
2.2-2.3
2.4-2.5
2.6-2.7
2.8-2.9
3.2-3.3
3.4-3.5
3.6-3.7
3.8-3.9
4.2-4.3
4.4-4.5
4.6-4.7
4.8-4.9
Listening Quality
Figure 24: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB mode in wideband capable networks
It can clearly be observed that the majority of the quality scores are in a range from 3.7 to 3.9, which
represents the achievable quality for AMR-WB at 12.65kbps (as shown in Table 4). The lower scores are
partially caused by transmission errors and a few narrowband connections in this selection. The AMR-NB at
12.2kbps will result in a predicted MS of 3.2 or lower in this dataset.
6 Conclusion
With P.863 ‘POLQA’ a new measure for objective speech quality assessment has been standardized,
serving today’s and future demands of voice quality testing. P.863 ‘POLQA’ is embedded in SwissQual’s
strong SQuad application framework which computes a powerful set of additional information characterizing
the analyzed speech signal.
The actual implementation of P.863 ‘POLQA’ was thoroughly speed optimized for a resource saving
calculation in a fraction of real time on SwissQual’s platform. The SQuad framework that provides P.863
‘POLQA’ can also compute P.862 ‘PESQ’ scores in parallel to P.863 ‘POLQA’ in NB mode as an option for
customers who are a interested in a direct comparison between the two measures.
Of course, the SQuad measurement suite can also be equipped with SQuad08 as a MOS predictor.
SQuad08 is technically compatible to P.863 and serves the same applications as P.863 ‘POLQA’.
Chapter 6 | Conclusion 32
CONFIDENTIAL MATERIALS