Sound System Engineering and Optimization The Effects of Multiple Arrivals On The Intelligibility of Reinforced Speech PDF
Sound System Engineering and Optimization The Effects of Multiple Arrivals On The Intelligibility of Reinforced Speech PDF
Sound System Engineering and Optimization The Effects of Multiple Arrivals On The Intelligibility of Reinforced Speech PDF
Abstract
Résumé
Dedication
Preface
Considering the magnitude of this project, it was known from the onset
that much help, advice and assistance would be required from many people –
however it was not known just how many people and how much help would be
needed. The author would like to take this opportunity to acknowledge, and
extend his gratitude to the following people for their contributions to this project.
to this dissertation. Dr. Roger Schwenke at Meyer Sound for providing the
author with enough knowledge of the inner workings of their hardware, without
divulging trade secrets, to effectively use it in this research. Dr. Geoffrey
Groocock, at the College of Veterinary Medicine at Cornell University, for his
contributions and assistance with statistical analysis using non-parametric testing
methods. Also, for having been a dear friend for the better part of 20 years. Dr.
Terrell Finney, Rayburn Dobson and Patti Hall, for permission to use, and
assistance scheduling, the facilities of the College Conservatory of Music at the
University of Cincinnati. Stirling Shelton, Stevenson Miller and Richard
Palmer, at the College Conservatory of Music, for the use of rigging hardware
and for their assistance flying loudspeakers. The National Institute for
Occupational Safety and Health (NIOSH) for the loan of many pieces of
equipment used in this research project, including the KEMAR, measurement
microphones, preamplifiers, adaptors and calibrator. Chucri A. (Chuck)
Kardous, at the Cincinnati division of NIOSH, for graciously offering his time,
equipment and laboratory space. Dr. Babette Verbsky, at the Cincinnati division
of NIOSH, for her assistance developing the author’s body of literature on
subjective testing. Claudia Norman, of the Institutional Review Board at the
University of Cincinnati, for all of the meetings, for proofreading the research
protocols and for answering so many questions about the process. Lynda McNeil,
of the Research Ethics Board at McGill University, for her assistance
understanding and compiling the submission for approval. Sun Hee Kil for her
unwavering dedication and help during the stimulus recording sessions. Martin
Garneret and David Hunter, for their logistical support of the pilot study. Nik
Tranby & Dr. Sungyoung Kim, for their assistance with software programming.
Teri Waters for the initial pass at proof reading this document, and for logistical
support of the second half of phase 2 of the main study. Velma Dawson for the
use of her basement, and for enduring the associated inconvenience. And last, but
not least, thanks go to all of the subjects who participated in these studies -
“Circle the thanks again.”
vii
Table of Contents
Abstract ............................................................................................................... ii
Résumé ............................................................................................................... iii
Dedication .......................................................................................................... iv
Preface ................................................................................................................. v
Table of Contents ............................................................................................ viii
List of Tables ................................................................................................... xiii
List of Figures .................................................................................................. xvi
Chapters:
1. Introduction ............................................................................................. 1
1.1 Motivation ......................................................................................... 2
1.2 Project Overview .............................................................................. 3
1.3 Research Questions & Variables ...................................................... 4
1.3.1 Project Summary ...................................................................... 6
2. Review of Literature ............................................................................... 9
2.1 Speech Intelligibility ......................................................................... 9
2.1.1 Factors That Affect Speech Intelligibility .............................. 10
2.1.2 Objective Methods for Estimating Speech Intelligibility ...... 20
2.1.2.1 Narrow Band Effects ..................................................... 24
2.1.3 Subjective Methods for Evaluating Speech Intelligibility ..... 27
2.1.3.1 A Review of Various Testing Methods ........................ 27
2.1.3.2 Conclusions ................................................................... 30
2.1.3.3 Use of the Modified Rhyme Test .................................. 31
2.2 Sound System Design & Optimization .......................................... 34
2.2.1 Types of Loudspeaker Arrays ............................................... 36
2.2.2 Multiple Arrivals & Summation ........................................... 41
2.2.2.1 Background & Measurable Effects ............................... 42
2.2.2.2 Subjective Effects ......................................................... 50
ix
List of Tables
Table 3.1 Methods used to measure the level of running speech with
results ............................................................................................78
Table 5.1 Corrective equalization settled upon for use on all stimuli............95
Table 6.9 Results of ANOVA and Kruskal-Wallis tests for the effect
of stimulus presentation order on adjusted score (original
data set). Results are shown for all word lists evaluated
and for the first 10 sets evaluated by each subject ......................113
Table 6.10 Results of ANOVA and Kruskal-Wallis tests for the effect
of MRT word list on adjusted score (SNR stratified data
set) ...............................................................................................116
xiv
Table 7.1 Variable values used in the first phase of the main study ...........125
Table 7.2 Tests for homogeneity of variance for the six data sets:
Kolmogorov-Smirnov and Shapiro-Wilk statistics .....................132
Table 7.10 Results showing generating class for MCTA using 88%
and 90% pass criteria ..................................................................144
Table 8.1 Variable values used in the second phase of the main study .......160
Table 8.2 Tests for homogeneity of variance for the four data sets
generated (49w censored data) ....................................................164
Table 8.8 Results showing generating class for MCTA using 83%
and 85% pass criteria ..................................................................171
Table 9.1 Octave weights used for STI calculations in EASERA ...............178
xvi
List of Figures
Figure 2.12 Summation of equal level sine waves with 180° relative
phase .............................................................................................44
Figure 2.14 Equal relative phase summation of audio signals with non-
identical angle of incidence to the receiver (Reprinted with
permission from [124], © Elsevier Limited, 2007) ......................46
Figure 2.16 Summation with delay for a complex waveform with four
different frequency components ...................................................47
Figure 2.18 The effect of a 0.1 ms time offset on the frequency and
phase response of summed coherent audio signals with
equal level and relative phase (Reprinted with permission
from [124], © Elsevier Limited, 2007) .........................................49
Figure 2.21 MTF graphs of the 500 Hz band, showing the interaction
effects of SNR and delay time on intelligibility (dark
lines), compared to the MTF graphs of the unaffected
transmission system (Reprinted with permission from
[167], © Audio Engineering Society, 1973) .................................56
xviii
Figure 2.22 Measured RaSTI values vs. delay time and echo level
obtained by injecting an electronically delayed echo into
the measurement process. The bottom line corresponds to
an echo with level equal to the direct signal (Reprinted
with permission from [170], © Audio Engineering Society,
1988) .............................................................................................58
Figure 3.6 Magnitude vs. frequency response of the UPA and two
UPM loudspeakers, as measured on the axis of each
loudspeaker ...................................................................................83
Figure 4.7 Free- and diffuse-field responses for blocked ear canal
(Reprinted with permission from [14] © John Wiley &
Sons, 2006) ....................................................................................92
Figure 6.1 Box and whisker plot of adjusted score vs. SNR for all
tested levels of SNR ....................................................................104
Figure 6.4 Detrended normal Q-Q plot of adjusted score in the SNR
stratified data set .........................................................................107
Figure 6.5 Box and whisker plot of adjusted score vs. array type for
SNR stratified data set ................................................................112
Figure 6.6 Box and whisker plot of adjusted score vs. SNR for all
tested levels of SNR. Outliers are identified by data point
index number ..............................................................................114
Figure 6.7 Box and whisker plot of adjusted score vs. subject (SNR
stratified data set) .........................................................................115
Figure 6.8 Box and whisker plot of adjusted score vs. word list (SNR
stratified data set) .........................................................................116
Figure 7.2 Box and whisker plot of adjusted score (full data set) ................129
Figure 7.3 Box and whisker plot of adjusted score vs. subject (full
data set) .......................................................................................130
Figure 7.5 Detrended normal Q-Q plot of adjusted score (Strat_7 data
set) ...............................................................................................132
Figure 7.6 Box and whisker plot of adjusted score vs. array type
(Strat_7 data set) .........................................................................134
Figure 7.7 Box and whisker plot of adjusted score vs. delay time
(Strat_7 data set) .........................................................................135
Figure 7.8 Box and whisker plot of adjusted score vs. delay time, by
SNR (Strat_7 data set) .................................................................137
Figure 7.9 Box and whisker plot of adjusted score vs. array type, by
SNR (Strat_7 data set) .................................................................138
Figure 7.10 Box and whisker plot of adjusted score vs. delay time, by
array type (Strat_7 data set) .........................................................141
xxi
Figure 7.11 Box and whisker plot of adjusted score vs. presentation
order (Strat_7 data set) ................................................................146
Figure 7.12 Box and whisker plot of adjusted score vs. word list
(Strat_7 data set) .........................................................................147
Figure 7.13 Box and whisker plot of adjusted score vs. word list
(Strat_7, 49w data set) .................................................................149
Figure 8.1 Box and whisker plot of adjusted score (full 49w data set) ........164
Figure 8.2 Box and whisker plot of adjusted score vs. subject (full
49w data set). Note that, as subjects 12 and 28 did not
finish all of the testing sessions, their user numbers have
been shifted to 112 and 128 for ease of identification and
exclusion .....................................................................................165
Figure 8.3 Box and whisker plot of adjusted score vs. array type
(Strat_6, 49w data set) .................................................................168
Figure 8.4 Box and whisker plot of adjusted score vs. delay time, by
array type (Strat_6 data set) .........................................................172
Figure 8.5 Box and whisker plot of adjusted score vs. presentation
order (Strat_6 data set) ................................................................173
Figure 9.2 Box and whisker plot of STI vs. delay time (all treatments) .......181
Figure 9.4 Box and whisker plot of STI vs. SNR (all treatments) ...............183
Figure 9.5 Box and whisker plot of STI vs. array type (all treatments) ........184
Figure 9.6 3-dimensional box and whisker plot of STI vs. delay time,
by SNR (all treatments) ...............................................................185
Figure 9.7 Box and whisker plot of STI vs. SNR, by array type (all
treatments) ...................................................................................186
xxii
Figure 9.8 Box and whisker plot of STI vs. SNR, by array type (all
treatments containing SNR conditions 6 dB, 3 dB and 0
dB)................................................................................................187
Figure 9.9 Box and whisker plot of STI vs. delay time, by array type
(all treatments) .............................................................................188
Figure 9.10 Box and whisker plot of STI vs. delay time, by array type
(all treatments containing SNR conditions 6 dB, 3 dB and
0 dB).............................................................................................189
1. Introduction
together or located some distance apart, there will be some difference in distance
between each loudspeaker and an audience member [1, 124]. Differences in
distance translate to differences in the travel time of sound, resulting in multiple
arrivals: Time-delayed copies of the reinforced signal arriving and summing at a
listener. The summation of multiple time-delayed sound signals results in a
pattern of undesirable frequency response irregularities known as comb-filters [42,
46].
The negative effects of comb-filtering can be combated. Time-delays can
be electronically added to speaker signals, in an attempt to align, or intentionally
misalign, the timing of the arrivals of multiple loudspeaker broadcasts at the
location of an individual listener in the audience [11, 94, 124]. As there are many
listeners in an audience, it is physically impossible to achieve ideal loudspeaker
time-alignment for every seat in an audience. Sound engineers are forced to
select points within the audience where the signals from multiple loudspeakers
will be aligned in time.
There are several differing theories regarding how to decide alignment
locations [4, 11, 28, 42, 53, 124, 131] but to date, none have addressed the effects
on intelligibility for the surrounding locations which are not time-aligned [115].
Research (e.g. [130, 170]) has shown that non-aligned multiple arrivals can
negatively affect intelligibility. This same research has also shown that alignment
through the use of electronic delay compensation can ameliorate the problem at
specific audience locations. However, most studies used electronic intelligibility
estimation methods and were limited in the number and types of loudspeaker
arrays studied. Through the use of subjective evaluation and objective
measurement methods, this series of studies will expand upon the body of
literature, delineating the specific effects of delay time and two specific array
geometries on the intelligibility of reproduced speech.
1.1 Motivation
In recent years, work has been done to examine the physical and electrical
aspects of sound system design and optimization [28, 42, 124]. It is known that
3
For the cases of musical theatre and music concerts, sound engineers are
faced with the challenge of maintaining an aesthetic balance of level between
voice and music while still maintaining intelligibility. The presence of music
amounts to an increase in the level of background noise, thus a decrease in the
signal-to-noise ratio of the speech signal [126, 162]. This research project will
therefore also include background noise level (signal-to-noise ratio) as an
experimental variable, to investigate interactions between the level of non-speech
sounds and speech intelligibility. Though the effect of signal-to-noise ratio (SNR)
itself is known, the potential for SNR to obscure the effects of other variables is
not.
While the use of music as the non-speech signal would be the most
ecologically valid, previous research has identified that temporal variations in
characteristics such as spectrum are a likely cause of unwanted variability in test
results [73]. In the interest of controlling variables, and limiting potential
contextual clues and listener preference biases, pink-weighted pseudo-random
noise, exhibiting average spectral and power characteristics similar to popular
music, will be used in the place of music.
To obtain the stimuli for this study, the MRT word lists will be reproduced
by a sound system set up in Corbett Auditorium at the University of Cincinnati.
A head and torso simulator (manikin with microphones in the ear canals) will be
used to capture and record the sound produced by the sound reinforcement system.
Stimulus sets corresponding to different experimental treatments will be obtained
by delivering and capturing the MRT word lists using different sound system
alignment conditions. These different alignment conditions correspond to
variations in the four independent variables to be investigated: Delay time
between the arrivals from two loudspeakers, level offsets between the arrivals, the
orientation and focus of the two loudspeakers (vertical point-destination vs.
horizontal point-source array) and the signal-to-noise ratio.
Though the associated stimuli were recorded, it should be noted that the
variable “level offset” was ultimately not used in any of the studies in this project.
It is included here, and mentioned throughout this dissertation, as it was present in
6
the original project design and its incorporation was considered during the design
process of each study. While it was not possible to evaluate the effects of level
offset within the scope of this project, it is the author’s intention that this variable
will be used in future studies.
In chapters 3–5 the acquisition and preparation of the stimulus sound files
will be detailed, as will the headphone suitability study.
Chapter 3 – Stimuli were recorded for use in subjective evaluation and
measurements were made for comparison with the results from listening tests.
Chapter 4 – A study was conducted to determine which, if any, available
headphones would be acceptable for use in listening tests.
Chapter 5 – Stimuli were prepared and compensatory equalization was
applied to counter the effects of the recording apparatus.
Chapter 7 – The first phase of the main study was conducted. The goals
of this phase were to investigate a reduced set of variable treatments, to test 6
hypotheses and to make note of effects and relationships observed in an attempt to
further narrow the number of variable treatments before proceeding to the second
phase of the main study.
Tested Hypotheses:
Ho1: Delay time between multiple arrivals does not affect the
intelligibility of speech reproduced by a sound system.
Ho2: Signal-to-noise ratio does not affect the intelligibility of speech
reproduced by a sound system.
Ho3: Array geometry does not affect the intelligibility of speech
reproduced by a sound system.
Ho4: (interaction) Signal-to-noise ratio does not affect how delay time
between multiple arrivals affects the intelligibility of speech
reproduced by a sound system.
Ho5: (interaction) Signal-to-noise ratio does not affect how array
geometry affects the intelligibility of speech reproduced by a sound
system.
Ho6: (interaction) Array geometry does not affect how delay time
between multiple arrivals affects the intelligibility of speech
reproduced by a sound system.
Ho2: (interaction) Array geometry does not affect how delay time
between multiple arrivals affects the intelligibility of speech
reproduced by a sound system.
2. Review of Literature
paths. Energy that travels directly from the loudspeaker to the listener, and thus
has the shortest path and travel time, is termed the direct sound (see chapter 2.2.2
for an explanation of sound propagation). Energy that arrives at the listener via
paths including a small number of reflections will reach the listener a short time
after the direct sound, and will generally be lower in level than the direct sound
due to energy absorption by the reflecting surfaces. This group of arrivals is
termed the early reflections, or early sound. Because there is a relatively small
number of possible transmission paths that contain only a few reflections, the
number of early reflections is also small and thus arrivals tend to be spaced. Also,
due to room geometry, the majority of early reflections tend to be incident on the
listener from angles in front of the lateral plane [17].
As the number of reflections in a transmission path increases, the number
of potential paths increases. Eventually, some time after the reception of the
direct sound, a condition is reached in which the number of reflections arriving at
the listener is so numerous, and the reflections are so tightly packed in time, that
individual reflections can not be distinguished from the mass. The sound energy
of these later-arriving reflections is termed the reverberant energy. Due to the
increased number of possible transmission paths, reverberant energy is not limited
to frontal incidence, but rather tends to be incident from all angles [17]. 1 The
total sound reaching a listener will thus include the components of the direct
sound, early reflections and reverberant energy.
Reverberant sound pressure in an impulse response usually decreases with
time due to absorption during the increased number of reflections. Eventually, as
the number of reflections increases, the level of the arrivals will drop below the
audible threshold. In essence, the time required for the level of sound energy in a
room to decay to a point below the audible threshold is referred to as the
reverberation time of the room [42]. Lingering reverberant energy has the
potential to mask the direct sound produced by a loudspeaker, depending on the
1
A special condition, a diffuse reverberant field, exists when reverberant energy
arrives from all directions, with equal level vs. direction. For the purposes of
discussion in this dissertation, a reverberant sound field is not assumed to be
diffuse.
12
relative level of the reverberant energy compared to the level of the direct sound.
Thus both the reverberation time and the ratio of direct-to-reverberant sound
energy (D/R ratio) are factors that affect intelligibility [106, 107, 110, 116, 143].
2) Discrete Echoes
With notable exceptions [67], the direct sound is generally perceived as
being the source of the sound. However the auditory system incorporates a
portion of the early sound into the direct sound. This process, termed fusion (or
subjective masking), is a result of a phenomenon called post-masking, wherein a
signal has the ability to mask other signals that arrive shortly after the original
[107, 183]. As such, the portion of the early sound that arrives soon enough after
the direct sound that it can be effectively fused is indistinguishable from, and
serves to fortify the level of, the direct sound. The portion of the early sound that
does not arrive soon enough to be fused will be perceived as individual reflections
(echoes) [107]. Perceivable echoes have a negative impact on intelligibility, as
they are not only a distraction for listeners but, like reverberant energy, serve to
decrease the D/R ratio [67, 107, 116].
The question arises regarding the temporal location of the cutoff point
between the direct and reverberant energy for calculation of the D/R ratio. There
is considerable debate and data in the literature with regard to the specific
integration time used in the fusion process (e.g. [17, 41, 67, 107]). There is
further debate as to whether fusion affects different perceptual attributes, such as
loudness and intelligibility, in the same manner or to the same degree. It is
suggested by some [41] that fusion plays different roles for echo perception and
intelligibility, in that even undetectable echoes can affect intelligibility. At this
time, it seems clear that there is still uncertainty as to the effect on intelligibility
of echoes with short delay time, which is why the author has focused on the topic
as one of the primary research questions in this dissertation.
Haas [67] references earlier investigations on echo disturbance in
telecommunications. These studies provide widely varying conclusions, placing
the cutoff delay time required for an echo, with level equal to the original signal,
to cause noticeable deterioration in sound quality between 40 ms and 100 ms.
13
Haas’ own work [67] toward determining what he termed the “critical delay
difference”, concludes that 50% of listeners can detect an audible disturbance
above around 40 ms for equal-level echoes incident from the front, with higher
delay times for echoes incident from the side, rear and above, ranging up to 60 ms.
Lochner and Burger [107] and Mapp [114] report similar relationships between
critical delay difference and angle of incidence of reflections. Haas also
determined that the delay time required for disturbance increases for echoes with
level lower than the original signal.
Davis and Davis [41] cite previous works that have determined the cutoff
for what “constitutes a detrimental reflection” to be anywhere from 0 ms to 95 ms.
Conclusions from their own experiments comparing objective and subjective
assessment results suggest that with regard to intelligibility, as opposed to the
effects of fusion on loudness, no integration time is used. Klepper [94] cites
previous works that have determined the cutoff to be between 10 ms and 60 ms,
with shorter times corresponding to discrete echoes vs. echoes in the presence of
reverberant “fill”. Janssen [87], on the other hand, used the single value of 50 ms
for his integration time.
Results from the multi-decade series of studies carried out by Peutz [144]
indicate that, compared to the magnitude of effects due to other factors, echoes
have a relatively small effect on intelligibility. However Peutz states that his
results show that decreases in intelligibility resulting from a horizontally arriving
echo are noticeable at delay times as short as 15 ms, with increasing detrimental
effect up to 45 ms, above which point the effect becomes constant.
As mentioned previously, there is question as to whether fusion plays the
same role in terms of echo detection and intelligibility. Lochner and Burger [107]
state that level gains due to fusion, for delays of 95 ms and under, will increase
intelligibility. However, Lochner and Burger also state that total fusion is only
achieved up to 30 ms, beyond which the contribution of the echo diminishes [105].
In the work of Davis and Davis [41], it is acknowledged that fusion does occur up
to around 50 ms to 80 ms, but the conclusions of their study suggest that, while
14
some portion (10 ms–20 ms) of the fused sound does cause an increase in
perceived loudness, it does not lead to gains in intelligibility.
It has been further suggested [38, 111] that the auditory system could be
viewed to employ various fusion integration times, depending on the perceptual
attribute in question. In citing the findings of others, Mapp suggests that a delay
time of 3 ms be applied to transient effects, 35-60 ms for information (such as
speech and music) and 125-200 ms for perceived loudness.
3) Background noise
Background noise, like reverberant energy, has the potential to mask
speech signals resulting in decreased intelligibility [18, 20, 107]. In essence,
reverberant energy can be viewed as a type of background noise, though the two
factors are addressed separately. Background noise is a blanket term used to
describe many possible distracters to speech reception, including electronic noise
(noise, buzz and hum in a sound system), noise due to air movement (heating and
ventilation systems), other speech signals and music.
The metric used to describe the level of background noise, and its effect
on intelligibility, is the signal-to-noise ratio (SNR, in dB) – the ratio of the level
of the speech signal to the level of background noise. Results from the work of
Peutz [144] indicate that variance in intelligibility can be found for values of SNR
ranging from −5 dB to 25 dB.
4) Spectrum
There are several aspects of spectrum that relate to speech intelligibility,
and can generally be broken into three categories: Production, reception and
transmission. Figure 2.1 shows what could be considered the average spectrum of
speech [59], though it is known that variations exist between male and female
spectra [110]. It is also known that the formants of vowel sounds are relatively
low in frequency (approximately 200 Hz–800 Hz) compared to consonants which
extend to higher frequencies (approximately 1 kHz–8 kHz) [111, 117].
Comparing these frequency ranges to figure 2.1, it can be seen that the amount of
energy produced by vowels is greater than the amount produced by consonants.
15
2
A rigorous definition of Q is beyond the scope of this text. The interested reader
is directed to [142].
18
Over the past century, a variety of objective methods have been developed
for the estimation of speech intelligibility over a communication system. These
methods employ a combination of acoustical calculations and the
electroacoustical measurement of many of the factors listed in the previous
section. Through correlation with the empirical results from subjective testing,
relationships have been determined that allow objective calculation-based and
measurement-based methods to produce a result indicating intelligibility. In
recent years, several of these methods have been included in measurement
software as well as acoustical prediction programs [3, 49].
Over time the accuracy and precision of objective methods has increased,
as more of the factors that affect intelligibility have been incorporated into the
calculations and measurement protocols. However to date, no such method is
believed to account for all of the factors that affect speech intelligibility [117].
Recent investigations have explored the integration of binaural mechanisms,
higher frequency resolution and pre- and post-masking sensitivity [19, 37, 102,
155].
21
reverberation were not considered in the original design of the metric. Later
adaptations by Kryter [97] and others incorporate the use of 1/3- and 1-octave
frequency bands, and provide adjustment factors to account for reverberation. In
the end, the treatment of reverberation has been found lacking [114] and the
method does not allow for the separate analysis of intelligibility impairment due
to the loss of articulation of vowels and consonants [143].
3) Articulation Loss of Consonants
The percent loss of the articulation of consonants (%ALcons), developed by
Peutz [143] and later modified by Peutz [144] and Davis [40], is a predictive
metric based primarily on acoustical parameters of a room and loudspeaker
directivity. The metric focuses on consonants as Peutz found losses in consonants
to be higher and a more accurate predictor than losses in vowels [143].
Though there are several variations of the %ALcons equation, the general
form includes reverberation time, room volume, listener-loudspeaker distance,
loudspeaker Q and a factor to account for interference from other loudspeakers
[41]. Originally developed as a calculated predictor of the intelligibility of
speech, %ALcons was later implemented as a direct measurement method in
Techron TEF measurement devices [39]. Additionally, it is possible to
determine %ALcons via conversion from other measurements such as speech
transmission index (by Farrel Becker, presented in [42]).
The predictive and measurement accuracy of the %ALcons method has
been shown to have high correlation to the results obtained through subjective
assessment [41, 143], though correlation appears to decrease under conditions of
diminished intelligibility [114]. The primary limitation of %ALcons is that the
formulae and measurements are limited to the 2 kHz 1-octave band, and is thus
not sensitive to bandwidth, frequency response anomalies and critical band
masking in other bands resulting from noise or equalization.
Additionally, %ALcons does not account for SNR, distinct echoes, distortion or
non-linear effects, and it is difficult to correctly determine the effect of
interference from other loudspeakers [110].
23
present in the sound system, differences in excitation should not affect test results
[114]
Further developments with STI include variations in frequency weighting
of the transmission indices to account for male and female speech spectra [112].
5) Rapid Speech Transmission Index
Full STI measurements are difficult to perform and the computational
requirements were beyond the ability of the computers available at the time of its
inception [110]. The Rapid Speech Transmission Index (RaSTI) was introduced
and integrated into a hand-held measurement system, incorporating a nine
measurement subset of the full STI. RaSTI measurements include four
modulation frequencies in the 500 Hz octave band, and five frequencies in the 2
kHz band [77]. The RaSTI test signal is a speech-like modulated noise signal.
RaSTI suffers many of the drawbacks of STI, though it has long been
recognized that the full STI measurement is more accurate than the smaller RaSTI
measurement [114]. Also, as with %ALcons, the use of a reduced set of frequency
bands leads to lack of sensitivity in many areas of the spectrum.
Recently, a version of the STI measurement for public address (STIPa) has
been introduced, though it has yet to be standardized. Like RaSTI, STIPa uses a
modulated speech-like test signal and a subset of the full number of STI
measurements [114]. However unlike RaSTI, the calculation of STIPA uses
measurements (14) in all seven of the STI frequency bands. Thus, though STIPa
uses a reduced number of measurements, it should still be more sensitive to the
effects of frequency anomalies than RaSTI. RaSTI and STIPa metrics can both be
derived from IR measurements. However, STIPa results obtained in such a way
are referred to as “STIPa equivalent” [49, 118].
the sound system and, despite the fact that this system would obviously greatly
impair intelligibility, the RaSTI measurement returned near-perfect results (RaSTI
= 0.98). The measurement scheme was unable to detect this impairment due to
lack of resolution, as its measurements are based solely on information in the 500
Hz and 2 kHz bands. It is likewise conceivable that narrow-band frequency
effects may elude other measurement methods with inadequate frequency
response resolution (e.g. STI and %ALcons).
sound systems may lead to erroneous results. While this research may allow for
further investigation of such error, the use of other measurement methods should
be investigated.
Such clues could be present if sentences were to be used as the core stimuli for
testing. Another issue is that of subject learning. Even with the use of sentences
that are not highly predictable [50] or that are semantically anomalous [139],
learning effects make it so that the same sentence cannot be used twice for the
same listener, resulting in the need for very large stimulus sets [156].
The prohibitively large stimulus sets required, coupled with the ambiguity
resulting from semantic and contextual clues, make sentence-based tests
undesirable for use in communication system evaluation requiring repeated
evaluations by each subject.
With regard to intelligibility, consonants carry more useful information
and contain less energy than vowels, and are therefore more susceptible to
degradation in a communication system [5]. Thus, intelligibility is primarily a
byproduct of a listener’s ability to differentiate between consonant-based
phonemes [143]. As such, most single-word based listening tests that study
intelligibility focus on the ability of subjects to distinguish between consonants
[156].
Fairbanks [55] details the desire to create a speech intelligibility test based
on phonemic differentiation, that would use words as the stimulus, test subjects’
ability to identify said words and that would produce results relatable to the task
of identifying everyday speech. The result of his work is the Fairbanks rhyme test
(FRT): A testing method which employs monosyllabic English words, and tests
subjects’ ability to discriminate between consonantal and consonant-vowel
transitional variations. The test contains 250 words which are grouped into 50
ensembles, each containing 5 rhyming words with identical stem spelling (e.g.
-eel, -ook, -ale). The rhyming words differ only by the initial consonant.
Each test is constructed by selecting one word from each ensemble,
resulting in a 50-word test list. Subjects are provided with a response sheet, and
responses are indicated by writing the first letter of the word in the blank space
preceding each of the 50 word stems. For each word stem, Fairbanks estimates
that there are between 6 and 16 possible words in the English vocabulary from
which a subject can choose.
29
a carrier sentence. Like the FRT, the PB is an open format test – meaning that
responses are generated by subjects, rather than selected from a list of alternatives.
One major disadvantage of the PB is the training time required to achieve
stable results from listeners. Additionally, subjects continue to show
improvement in scores even after sufficient training, sometimes taking weeks, is
completed. As such, comparison of test results obtained at different phases of
testing is difficult [156].
The diagnostic rhyme test (DRT) is the final test mentioned in the ANSI
recommendation. The DRT employs a smaller set of stimuli as opposed to the
similar MRT and FRT. The DRT is a closed format test, containing 192 stimulus
words grouped into 96 rhyming pairs. Stimuli are delivered, without a carrier
sentence, to subjects who are presented with the two-alternative forced choice
task of identifying the correct word [176].
As the DRT stimulus set is smaller than those of its rhyming predecessors,
this method has the advantage of shorter testing times. However, this method has
the disadvantage that, within each word pair, differences are found only between
the first consonants of the words.
Other examples of tests that employ the presentation of single
monosyllabic English words include the speech identification test (SIT) [71] and
the California consonant test (CCT) [141]. The SIT was devised as part of an
experiment which studied the effects of context on both listeners’ ability to
identify words, and their certainty about said identifications [71]. Though similar
to the MRT, the CCT was developed specifically for the purpose of clinical
testing on hearing impairments [141].
2.1.3.2 Conclusions
The MRT seems the most appropriate test for use in this research project.
The use of a test which employs monosyllabic stimuli will allow for repetitive
testing on subjects, and will remove the potential for errors in results due to
contextual and semantic clues. The subject training requirements for use of the
MRT are far less than what is required for the PB.
31
Unlike the FRT and DRT, the MRT includes variation in both initial and
final consonants. When the potential effects of post-masking are considered [183],
it is clear that the MRT has the potential for sensitivity to a wider variety of
factors that affect intelligibility (e.g. masking of the initial-consonant phoneme by
the carrier sentence and the final-consonant phoneme by the stimulus word’s own
vowel).
absolute pass/fail criteria, nor is it concerned with variations associated with the
use of different talkers. For example, it is not the intent of this document to prove
the effective function of an emergency address system.
Among other specifications, the recommendation indicates that a
minimum of five talkers and five listeners should be used and that, as the talker
has a greater effect on intelligibility, more talkers than listeners should be used.
The document also recommends that a minimum of three MRT words lists should
be evaluated by each listener for each variable treatment.
While a review of the literature involving practical application of the MRT
does show a common testing method, it also shows a great number of differences
between the specific implementations of the test. As the purposes and overall
goals of an experiment will vary from study to study, so too it seems do the
applications of testing methods. The obvious difference between the reviewed
studies lies in the number of subjects used. Subject population size is governed
by the experimental design of a study, type of subjects used (e.g. naïve vs. expert)
and the amount of statistical strength needed to show significance [14]. However,
other notable variations between uses of the MRT do exist, including:
1) Number of word lists used
In the literature, the number of MRT word lists used ranges from one list
[73, 123] to the full set of six, as seen in Kreul et al. [95, 96], Atkinson and
Catellier’s study on the intelligibility of radio-communication systems for
firefighters [9] and many others. Examples also exist of the use of three lists [99,
122].
2) Number of word lists used per variable treatment
Beyer et al. [21], studying the validity of the use of the MRT for clinical
study and Nixon [138], studying the effectiveness of the MRT response foil words,
used the original Stanford Research Institute (SRI) recordings made by Kreul et al.
[95] which include an incomplete matrix of six lists and three talkers. Kusumoto
et al. [99], in a study of speech enhancement in reverberant environments, used
three word lists, but only one word list per reverberation time treatment. Nabelek
and Mason [136] used six lists total, but only two per treatment, to study the
33
3
Multiple main and satellite systems can be used for multi-channel (e.g. Left-
Right and Left-Center-Right) reproduction.
39
subsystem arrays. Each of these arrays, primary and subsystem, may be oriented
vertically or horizontally.
4
A thorough discussion of individual Fresnel/Fraunhofer, chaotic and collective
Fraunhofer regions of array radiation is beyond the scope of this dissertation. The
interested reader is directed to [69].
41
The sine wave is one of the fundamental building blocks of sound. Figure
2.9 shows some of the basic properties of a sine wave, including amplitude,
5
It should be noted that the discussion of multiple arrivals and summation in this
text is intended to describe the summation of direct, coherent sound radiated from
loudspeakers and/or discrete echoes, not the summation of reverberant energy.
43
wavelength and the propagation of sound over time. As a sine wave propagates,
its amplitude alternates back and forth between peak and minimum amplitude.
The speed at which the wave completes its cycles is called frequency (f), and is
measured in cycles per second (Hz). One full alternation (from peak to minimum
back to peak) is called one cycle of the wave, and is also referred to as 360
degrees of rotation through a cycle. The time required to complete one full cycle,
the period of the sine wave, is equal to the inverse of the wave’s frequency, as
shown in equation 2.2. The physical length (in air) of one cycle of the wave, the
wavelength (λ), is determined by equation 2.3.
1
T= (Eq. 2.2)
f
c
λ=fc= (Eq. 2.3)
T
When comparing two sine waves, one can describe many things including
differences in amplitudes, frequencies and a third term: Relative phase. Relative
phase is expressed in terms of degrees, radians or wavelength, and describes
whether two waves are at the same point in their respective cycles. For example,
two waves that are at identical points in their cycles are deemed to have a zero-
degree difference in relative phase (equal), while waves that are offset by ½-
wavelength (figure 2.10) are deemed to have 180° difference in relative phase
(inverted).
Basic Summation
When two sound waves arrive at a location in space they have the
potential to constructively or destructively sum [42]. For example, if two sine
waves of equal frequency, relative phase and amplitude arrive at a point in space,
the summation of the two waves will produce a resulting wave with twice the
amplitude of the original waves (figure 2.11). Alternatively, if the same two sine
waves have a relative phase difference of 180°, the resulting summation will yield
total cancellation (figure 2.12). The term summation refers to both the addition
and cancellation of waves.
Figure 2.11 Summation of equal level sine waves with 0° relative phase
Figure 2.12 Summation of equal level sine waves with 180° relative
phase
points out in his book, “1 + 1 = 1 (± 1)”. In other words, the summation of two
equal level signals can produce a result anywhere between double the original
level (+6 dB) and no level at all (−∞ dB) depending on the relative phase of the
two summed signals. When the summed sine waves are not of equal level, the
potential for addition and cancellation is not as great.
Figure 2.14 Equal relative phase summation of audio signals with non-
identical angle of incidence to the receiver (Reprinted with
permission from [124])
Figure 2.15 Equal relative phase summation of sine waves and coherent
complex audio signals (Reprinted with permission from
[124])
While the properties of summation are the same, the results are somewhat
more complicated. As complex signals are effectively composed of a
combination of sine waves, complex summation involves the many individual
summations of these components. Figure 2.16 describes a simple example of the
complex summation of identical signals that contain four frequency components
each. The arrivals of the two signals at the spatial summation point differ by
47
some amount of delay. As each frequency has a different period, a shift in time
will affect a different relative phase shift for each frequency. Thus, summation of
each of these frequency components will yield different results. The example
shown in figure 2.16, involves the summation of an original signal with a copy
which has been delayed by 5 ms (approximately 1.7 m in air, 1 wavelength at 200
Hz). In this example, the 100 Hz components (½ λ offset) completely cancel,
while the 125 Hz components (5/8 λ offset) only partially cancel. The 200 Hz
components (1 λ offset) fully sum as, seen in figure 2.13, an offset of one full
wavelength is effectively no offset: A 360° offset is equivalent to a 0° offset. It
should be noted that, in practical application, signals rarely perfectly sum or
cancel [124].
Delay
Figure 2.16 Summation with delay for a complex waveform with four
different frequency components
Figure 2.17 shows the result of the summation of two real-world signals
with the same delay time (5 ms), but for a more complex signal. The total
cancellation at 100 Hz (½ λ) is still present, as are the effects at the other three
frequencies used in figure 2.16. Proceeding higher in frequency, we note
cancellation at 300 Hz (3/2 λ), summation at 400 Hz (2 λ), cancellation at 500 Hz
(5/2 λ) and so on as wavelength offset increases vs. frequency. The result of the
time-delayed summation of identical complex signals is a frequency response
structure called a comb filter [46, 54].
48
Figure 2.18 The effect of a 0.1 ms time offset on the frequency and
phase response of summed coherent audio signals with
equal level and relative phase (Reprinted with permission
from [124])
Figure 2.19 The effect of a 1 ms time offset on the frequency and phase
response of summed coherent audio signals with equal
level and relative phase (Reprinted with permission from
[124])
50
Multiple arrivals with short delays (0 ms–20 ms) create loudness, tonal
coloration and spatial localization effects. Temporal fusion in the auditory system
will integrate two arrivals falling in this region and affect an increase in perceived
loudness. A short delay time can affect the perceived location of a sound source,
known as the precedence effect [67], and can also create severe comb filters in the
spectrum of the received, summed sound. As delay time increases, the strength of
the precedence effect diminishes. Also, analogous to the phenomenon known as
the Schroeder frequency [108], the frequency above which compactness of modal
resonances in a room leads to inaudibility and statistical insignificance, higher
delay times between multiple arrivals leads to greater notch density – eventually
reaching a point wherein the resulting comb filter is inaudible.
Though there is some dissent, it is generally agreed that delay times of
medium length (20 ms–80 ms) between multiple arrivals will still incur an
increase in perceived level due to fusion [17, 41, 150]. Also in this time region, a
spatial enhancement effect is seen, in the form of enlargement of the perceived
width of the sound source (the auditory source width or ASW) [17, 150]. From
the perspective of architectural acoustics, increased ASW is believed to add
pleasantness and fullness of tone to sound sources.
The transition point between medium and long delay times is also vague,
and depends on the nature and number of arrivals. In the case of many arrivals,
e.g. reverberation, the perceived loudness of the original arrival is not affected.
However, as arrivals tend to reach a listener from many directions, including the
rear, the listener will experience an increase in the spatial enhancement known as
listener envelopment (LEV) [150]. For the case of few arrivals with long delays,
a listener will likely perceive these arrivals as separate distinct echoes.
between arrivals, resulting in different patterns of comb filters for different areas
of the audience. The consequence of this is wide variance in the spectrum of
sound received across the total audience that cannot be removed through
loudspeaker equalization [124].
In order to preserve uniformity of sound system frequency response over
the entire audience, the arrivals of signals from multiple loudspeakers would need
to be aligned for all locations in the audience listening area. However this is
rarely possible. As such, two distinctly different theories, each employing the use
of electronically added delays, have developed regarding the method to
compensate for the effects of multiple arrivals. Detailed below, these two theories
are alignment and intentional misalignment.
Acknowledging that a single delay time cannot align the arrivals of two
loudspeakers over an appreciable listening area, researchers have developed
methods to intentionally misalign loudspeaker arrivals. As can be seen in figure
2.18 and figure 2.19, small offsets in arrival time (0.1 ms–1 ms) result in a sparse
comb filter pattern with wide notches in the frequency band attributable to speech
intelligibility (approximately 250 Hz–8 kHz). Conversely larger time offsets,
such as 10 ms as seen in figure 2.20, result in a more compact structure of comb
notches. The argument for misalignment is that the tight grouping of notches
resulting from larger offsets would be less detrimental to sound quality than the
sparse notch structure resulting from smaller offsets. It should be noted that while
studies on misalignment have addressed sound and speech quality, speech
intelligibility has been only peripherally discussed.
Mochimaru [131] points to the fact that, for acoustical summation, the
level difference between peak and notch decreases to less than 6 dB SPL for
frequencies above the 3/2 λ notch. He states that if the 3/2 λ notch falls at a
frequency below the effective range of a loudspeaker, the resultant level variance
due to combing would be less than 6 dB. Thus Mochimaru recommends that a
minimum offset between arrivals, corresponding to 2 λ at the lowest reproducible
frequency, should be ensured through the use of electronic delay.
53
For coupled point-source array configurations in which the low- and high-
frequency transducers are separate, such as in “long-throw” and “near-far” horn
arrays, the delay times required to create misalignment are quite short (2 λ at 500
Hz–2 kHz ranging from 4 ms–1 ms). However in more modern loudspeaker
arrays, with the majority of frequencies being reproduced by transducers in the
same enclosure, misalignment would require much larger delay times (e.g. 2 λ at
100 Hz would be 20 ms). Additionally, some sound system configurations make
it difficult to ensure that a minimum arrival offset can be maintained. For
example, when loudspeakers are located near the audience (e.g. an uncoupled
array of front-fill loudspeakers), large delay offsets may be needed to maintain
minimum offsets for listeners seated close to a loudspeaker whose signal is
intended to arrive second.
El-Saghir and Maher [53] offered a similar misalignment technique called
“milli-delay”. Again citing that very small time offsets between arrivals can
cause audible comb filtering, results of their study suggest the use of intentional
time offsets in the range of 10 ms–35 ms to compact the pattern of combing to the
point of inaudibility. As with Mochimaru’s study, El-Saghir and Maher focus on
coupled point-source arrays that are located some distance away from audience
members.
Augspurger et al. [11] propose a different method of intentional
misalignment. Their theory was that loudspeaker arrivals should be aligned as
closely as possible and that the use of electronic stereo synthesis, to pre-comb
loudspeaker signals, could effectively “scramble” the residual audible effects of
combing due to multiple arrivals. Through subjective testing to evaluate fidelity,
sharpness and brightness, it was found that this method could be effective in the
800 Hz–5 kHz range. However Augspurger et al. caution that the use of this
alignment technique could result in timbral changes and that the frequency region
of effectiveness is highly dependent on the ability to phase-match transducers. As
with the previous studies mentioned, the loudspeaker arrays used in this study are
located well away from audience members.
54
delay time between arrivals changes creating a combing structure – but level
differential changes as well, resulting in reduced depth of comb notches [1]. For
uncoupled arrays, the spacing of loudspeakers is an additional factor in
determining the useable listening area. As is the case with coupled point sources,
the location(s) of highest volatility for uncoupled arrays is found where the levels
from two loudspeaker arrivals are equal. Again, alignment of arrivals at the point
of maximum volatility results in acceptable levels of spectral variance throughout
a larger listening area [124].
An additional viewpoint regarding the alignment of loudspeaker signals
has been offered by Brown [28], with regard to the preservation of stereo imaging
of reinforced sound in large rooms. With stereo reproduction, one must consider
not only the effects of the summation of coherent loudspeaker signals, but also the
summation of incoherent microphone signals. An example given is the use of a
spaced pair of microphones to capture a two-channel stereophonic sound image of
an instrument(s) (e.g. drum overhead). Each microphone will capture sound
arriving from its side of the instrument as well as a delayed sound arriving from
the opposite side. As time delays are involved, the summation of these
microphone signals could result in comb filtering [28, 124].
In his discussion of electrical vs. acoustical summation, Brown [28]
advocates the maintenance of stereo signals in supplementary satellite
loudspeaker systems. More relevant to this discussion, Brown also states that the
arrivals from satellite systems should lag the arrivals of the main sound system
and that, by employing the precedence effect, the main sound system will be the
perceived sound source.
have been as diverse as the transmission methods studied. The works cited herein
focus on determining the effects of multiple arrivals on intelligibility.
Steeneken & Houtgast [167] published a series of studies including the
effects of SNR and a single echo on intelligibility as measured by the full STI and
subjective tests (PB word list). Figure 2.21 shows the measured MTF’s for the
500 Hz band as well as PB word scores. As expected, inspection of the graphs for
the “+∞ dB” SNR (no noise) treatments reveals notches in the MTF at 10 Hz, 5
Hz and 2.5 Hz (for the 50 ms, 100 ms and 200 ms conditions respectively).
Figure 2.21 MTF graphs of the 500 Hz band, showing the interaction
effects of SNR and delay time on intelligibility (dark lines),
compared to the MTF graphs of the unaffected transmission
system (Reprinted with permission from [167])
However, two interesting things can be found in these graphs. The first is
that a higher noise level (lower SNR) appears to mitigate the effects of the echo –
the notches found in the no noise treatments are less defined in the graphs of the
57
higher noise conditions. Also the PB word scores, again for the no noise
treatments, are 96%, 93% and 94% respectively. What is not shown is that the
word score for the unaffected transmission system was 99%. It is interesting that,
though the reductions in word score were minimal, it would appear that there is at
least some correlation between the measurements and the subjective results. If
the echo were equal in level to the direct signal, it follows that the effect of the
delay would have been greater. Angle of incidence of the echo is not mentioned,
but is assumed to be from the same direction as the direct sound.
An investigation by Peutz [144] using %ALcons indicates that, for a
horizontally arriving echo, intelligibility appears to be affected (loss of
0.5 %ALcons) with delay times as short as 15 ms. The results of the study further
indicate that intelligibility will decrease for delay times up to 45 ms (loss of 3%
ALcons), remaining constant beyond that delay time. Again it is unclear as to how
well this calculation correlates with subjective impression, and whether a
difference of 0.5–3 %ALcons would be detectable in practical application. In a
study by Davis and Davis [41] it is stated that echoes arriving within 3 ms of the
direct sound can affect intelligibility by creating deep comb filter notches in the
spectrum of speech. Though RaSTI and %ALcons measurements were made for
such delay times, it is unclear as to whether these findings are supported by
results from subjective evaluation.
Teuber and Völker [170] conducted a series of studies involving the use of
RaSTI to determine the effects of single and multiple echoes, and the use of
compensatory delay, on intelligibility. In a laboratory study, employing a digital
delay device to create echoes of the RaSTI test signal, Teuber and Völker
measured the effects of varied delay times and echo level. The results, shown in
figure 2.22, indicate that RaSTI values drop significantly between 20 ms and 60
ms, with a continued overall decrease as delay time increases. Similar results
were found when multiple echoes were injected into the measurement. As
mentioned previously, echoes create comb filters with spectra determined by
delay time. A summed echo with delay time of 40 ms will have its ½ λ notch at
12.5 Hz, which is very close to one of the modulator frequencies used by RaSTI
58
(see [110] for RaSTI MTF matrix). Similarly, echoes between 40 ms and 70 ms
will have an effect on the RaSTI MTF at the upper two modulator frequencies.
Echoes with longer delay times will result in ½ λ cancellations at lower modulator
frequencies with 3/2 λ cancellations eventually entering the MTF spectrum.
Considering that RaSTI only uses 9 measurements, it is debatable whether these
data are indicative of actual intelligibility.
Figure 2.22 Measured RaSTI values vs. delay time and echo level
obtained by injecting an electronically delayed echo into
the measurement process. The bottom line corresponds to
an echo with level equal to the direct signal (Reprinted with
permission from [170])
whether array geometry would affect the degree to which delay times between
multiple arrivals degrade intelligibility.
While it is questionable whether objective measurement methods are
capable of dealing with the types of phenomena addressed in this text, it is not the
author’s intention to discredit the use of objective measurements. On the contrary,
such methods are extremely useful, and further research can only serve to
improve these well established procedures. As such, one goal of the current
research project would be to add to the body of knowledge regarding the absolute
effects of multiple arrivals, evaluating the correlation between measurements and
subjective impression and incorporating effects due to variables such as array
orientation. However, a second more practical goal could be to determine the
weight, or credence, that should be afforded these potential effects. A sound
system designer is often, for a variety of reasons, forced to make compromises
that result in deviation from an ideal design. It would be valuable to know the
relative contributions of misalignment and array geometry to the reduction of
intelligibility, so that such knowledge could be factored into design decisions.
For this research project, it was decided to employ the use of binaural
recording to capture the stimuli needed for subjective testing. Binaural recording
is a 2-channel recording method that relies on the creation and capture of sound
field modifications caused by the human anatomy [92, 132]. Binaural recordings
capture inter-aural level differences (ILDs), inter-aural time differences (ITDs)
and the shadowing effects and resonances caused by the geometry of the ears,
head and (if applicable) torso.
In theory, if all of the auditory cues found in an actual sound field are
present in, and delivered to a listener by, a binaural recording, the listener will
experience all of the cues found in the original sound field. As such, the listener
would be virtually transported to the original sound field [146, 173].
For subjective evaluations of sound system performance in a room, it is
necessary for listeners to experience the sound field which exists in the natural
61
2.3.1 Limitations
increase in both localization errors for sound sources in the medial plane and
errors in the judgment of sound source distance [129, 134, 135]. Given that
localization of sound sources in the medial plane does not rely on ILDs or ITDs,
and that ILD and ITD are not effective distance cues (for sources beyond 1 meter),
the aforementioned errors are most likely caused by differences in the head-
related transfer functions between the recording ear and the listener’s ear [150,
161]. In other words, the individual listener’s ear geometry is different from that
of the ear used for recording. The result is the presentation of inaccurate static
cues to the listener [134].
An additional complication is variance in the size and shape of the head
and torso between listeners. Though the design of head-and-torso simulators has
been based on anthropometric measurements of many thousands of people, the
resulting dimensions used for construction have been determined by averaging
measured data [31, 32, 35]. As with differences in ear geometry, differences in
head and torso size between the simulator and real listener will lead to inaccurate
presentation of static cues [64].
The question remains, would the use of binaural recording be viable for
the research project detailed in this document. The project focuses solely on the
intelligibility of speech reproduced by a sound system. As the entire sound
reproduction system is nominally in front of the listening position, and
reverberation is not a studied variable, the maintenance of sound cues resulting
from non-frontal incidence would be of little concern. Sound source distance
cues would also be of little concern. As seen in figure 2.23 from Møller et al.,
most localization errors are due to front-back and above confusions and distance
confusions [135]. Very few localization errors (when using a KEMAR manikin,
for example) occur between front and front high, or between ±45 degree locations
on the forward half of the horizontal plane.
Though care should be taken with regard to transduction and coupling
issues, the literature suggests that binaural recording would be acceptable for this
research project.
63
however differences exist even between standards. Burandt et al. [31] point out
that the dimensions listed in the standards are not representative of the size of the
average person. He also states that headphones often do not stay properly situated
on the HATS.
The HATS used for the capture of stimuli for this research project was the
Knowles Electronics Manikin for Acoustic Research (KEMAR). The KEMAR
HATS conforms to the geometrical and acoustical requirements of the ANSI and
ITU recommendations [7, 85, 129]. More detail on the specific KEMAR setup
used is included in chapter 3.1.4.
A HATS can create differences in ILD, ITD and spectrum between the
locations of the ears on the manikin. The addition of pinnae serves to simulate the
spectral shadowing and resonance functions of portions of the outer ear. However
if recording microphones are located at the eardrum position, ear canal simulators
are required to recreate the resonance and acoustical coupling functions of the
remainder of the outer ear [33, 159, 173].
A flat-plate coupler, with measurement microphone flush-mounted in a
hole in the center, is one way to measure the response of headphones [159, 172].
However, if it is desired to know the response of the headphone at the eardrum, an
ear-canal simulator is required.
An ear-canal simulator provides a bridge between the recording
microphone (artificial eardrum) and headphone driver. The simulator attempts to
replicate the resistive and reactive components of the acoustical impedance of the
human outer ear, thus providing a realistic acoustical load for the driver. As with
transfer gain of voltage through electrical circuits (e.g. a voltage divider), the
impedance of the headphone-driven load affects the transfer of sound pressure
from the entrance of the ear canal to the eardrum [132, 133, 177]. Thus, an ear
canal simulator that better approximates the loading characteristics of the outer
ear will provide more accurate sound pressures at the artificial eardrum. Note that
65
6
The effective frequency range of IEC 711 compliant ear canal simulators is
limited to frequencies below 8 kHz.
66
system and the resulting output of the system is the frequency-domain product of
the effects of the system (the transfer function) with the input.
If sound passes sequentially through more than one system, the resulting
output is the frequency-domain product of all of the individual transfer functions
with the input. Figure 2.25 details this type of compound transfer function for the
case of an individual listening to a sound system in a room. 7
For the case of binaural recording and headphone playback, capture and
playback are two discrete processes (see figure 2.26) [146]. Comparison reveals
that the compound transfer functions of in situ evaluation and evaluation via
binaural recording/playback are quite similar. However the binaural method
contains several additional individual transfer functions: The concha and ear
canal of the manikin, the recording system and the headphone [64, 100]. If these
three elements were to be completely removed from the compound transfer
7
The author acknowledges that these are highly simplified diagrams, and that the
effects of the sound system and room, and the neurological and cognitive
processes of the auditory mechanism, are vastly more complicated than single
transfer functions would suggest.
67
Before work could begin on the series of studies detailed in this document,
it was first necessary to create stimuli for the research subjects to evaluate. It was
decided to employ the Modified Rhyme Test (MRT) for these studies and, as such,
the MRT word lists would need to be processed through a transmission line –
transmitter(s), medium and receiver(s) – in order to create the project stimuli. As
specific properties of the transmission line were the focus of these studies, the
different combinations of these properties were defined as treatments before the
recording process began. Once recording began, it was necessary to have a
procedure for manipulating and verifying that the properties of the transmission
line matched the desired properties of each treatment.
This chapter details the audio systems and equipment, variable treatments
and procedures used to obtain the stimuli which were used in both the pilot and
main studies of this project.
The venue used for recording stimuli was Corbett Auditorium. Containing
approximately 650 seats in the orchestra level and 200 in the balcony, it is the
largest of four performance spaces at the College Conservatory of Music (CCM),
University of Cincinnati. The auditorium is rectangular in shape, with a moderate
wall splay from the middle to the rear of the house, and has an approximate
volume of 16,000 m³. The volume is reduced to around 9,000 m³ when its large
orchestral shell is in place (as it was for this study), producing a measured average
71
reverberation time of 1.55 s (see Figure 3.1). 8 The ambient background noise
level ranges from a low of 30 dB SPL(A) to a high of 37 dB SPL(A) when the
HVAC system is active.
The construction of the theatre is primarily of wood, however the side and
rear walls contain quadratic-residue diffusers and the flat fascia of the balcony
contains absorptive paneling. The floating ceiling is composed of semi-reflective,
perforated acoustic panels. The seats, which were in the upright position for all
measurements, are composed of wood covered in porous upholstery. The floors
8
Reverberation time measurements conformed to ISO 3382:1997 [83] for “Low
Coverage” conditions, with the exception that a directional sound source (Genelec
1032a) was used. As all of the stimulus reproducing sound sources involved in
this study would be nominally directional, and the impulse responses generated
would resemble those from a directional source, the Genelec was considered to be
an acceptable substitute for an omni-directional sound source in these
measurements.
72
for the noise subsystem was set to 100 Hz, which eliminated the need to angle the
woofer enclosures. The mid-high-frequency loudspeakers were located 5.1
meters from center with a 30 degree angle towards center, yielding a 60 degree
subtended angle at the center listening position.
in the loudspeaker control processors such that all impulse responses were aligned
to the UPA. Time alignment was not performed on the noise subsystem aside
from phase alignment of the woofers to the mid-high loudspeakers.
Prior to level alignment, all input channel faders on the console were set at
unity and all power amplifier input attenuators (where applicable) were set to
−0 dB. The level of each loudspeaker was adjusted only at the outputs of the
mixing console. The level was set such that a MLS signal delivered to the input
of the mixing console, with pre-fader level of −12 dB FS in the channel strip,
would produce a signal that measured 80 dB(C) SPL at the measurement position.
Verification of level calibration was performed at the end of each day.
The Meyer UPA and center UPM (vertical array configuration) were
calibrated using the measurement system’s microphone in the manikin-center
position (see appendices A, B and C for details). The center and house-right
Meyer UPMs (horizontal array configuration) were equalized and initially level
and time aligned with the measurement microphone located on-axis with each
loudspeaker. Calibration of the left and right channels (noise reproduction) was
carried out with the measurement microphone located at the manikin-center
position.
For the studies in this project, it will be of interest to know the ratio of
levels between speech signals and noise signals (signal-to-noise ratio). In order to
calculate these figures, it will be necessary to know both the level of the noise
distracter and speech signals.
Procedures for measuring sound pressure level in a sound field are well
documented (e.g. [126]). Psophometric filters can be used to produce
measurement results which more closely relate to perceived loudness.
Measurement integration times can be adjusted so that results represent
instantaneous or continuous (peak vs. average) sound pressure levels, and many
levels of averaging in between. Several problems arise, however, when
attempting to measure the level of running speech [118].
76
Figure 3.2 shows a fairly typical waveform for a piece of modern rock-
and-roll music. 9 Upon visual inspection, a clear difference is noticed between the
peak and average levels. In contrast, figure 3.3 shows the waveform of running
speech (MRT word list A). While there is still a clear difference between the
peak and average levels of the speech signal, there is also a good deal of
nominally silent space between utterances. These silent spaces are a result of the
pace at which the talker delivered the list of words.
9
This excerpt is taken from the intro section of “Sunday Morning After” by
Amanda Marshall, from the album Everybody’s Got a Story (as used in the
Technical Ear Training course at McGill University).
77
of reducing the level of the measured signal – like adding a large number of zeros
to the list of numbers to be averaged.
As the end goal is to compare the level of a noise signal with that of the
speech signal, the silence between speech signals is not of interest and should
somehow be removed from the measurement. This can be accomplished in one of
two ways: 1) Measure only while speech signal is present or 2) remove the
silence from the signal prior to measurement.
In figure 3.4, we see the waveform of the same MRT list, but with the
silent spaces removed through editing (referred to as the condensed word list).
Measurement of this signal should produce a result which indicates the level of
the actual speech signals, rather than the level of the overall signal. Measurement
of this modified signal is in accordance with ITU-T P.56, wherein the level of
speech is measured including the brief low- to zero-levels (down to at least 16 dB
below the average speech level), but excluding noise that is not part of the speech
signal [84].
Table 3.1 details four of the methods used to measure and calibrate the
level of speech signals. These methods differ in integration time (FFT size), the
number of measurements that are averaged together to produce the result and
speech signal used (stimulus). They also differ in measurement method:
Methods 3 and 4 rely solely on the measurement equipment, while methods 1 and
2 required the author to “only measure while the speech signal is present.”
The results of methods 1 and 2 have the greatest degree of variance. This
was to be expected considering that the unaltered word list was used. Also,
78
considering that the condensed word list was used for measurement methods 3
and 4, it is not surprising that the results of method 3 indicate levels which are
slightly higher. The results from method 4 however indicate that this increase in
level disappears with a sufficient number of averages.
It is clear that the different methods produce different results, both in
terms of overall level and range of variance. What is interesting is that the mid
points in the range of results for each method are quite close.
Parameters Method 1 Method 2 Method 3 Method 4
Measurement EASERA & EASERA &
EASERA EASERA
Equipment Researcher Researcher
8192 32768 32768 32768
FFT Size
(186 ms) (743 ms) (743 ms) (743 ms)
Number of 2 1 4 8
Averages (372 ms) (743 ms) (3 sec) (5.9 sec)
Condensed Condensed
Stimulus MRT List A MRT List A
& Looped & Looped
Result
78.6-81.6 78.9-80.8 80.5-81.7 79.5-80.6
(dB C-weighted)
Mid Point 80.1 79.9 81.1 80.1
Table 3.1 Methods used to measure the level of running speech with
results.
[32]. The KEMAR unit used for these recordings was fitted with two neck rings
(two rings being mean shoulder-to-ear height for males, mean height for a female
is zero rings) and the “two-sigma-ears” (auricles). The ears were further fitted
with Industrial Research Products (IRP) model DB-050 ear canal extensions,
connecting to IRP model DB-100 Zwislocki-style occluded ear simulators. The
transducers used were Brüel & Kjaer type 4165 (½” diaphragm) microphone
capsules, which were connected to a Brüel & Kjaer type 2807 power supply via
type 2660 preamplifiers and type UA-0122 microphone adaptors (see appendix A
for schematic). Figure 3.5 shows the location and orientation of transducers,
couplers and adaptors inside of the KEMAR head.
Microphone signals were sent to the Yamaha PM5D mixing console for
fine level adjustment and A/D conversion. Signals were then were routed to a
80
MOTU 896HD MKII for connection to a Macintosh MacBook and were recorded
using Cubase LE 4. All recordings were captured at a resolution of 96 kHz, 24 bit.
Prior to physical mounting in the KEMAR unit, the level of each
microphone was calibrated (re: 94 dB SPL @ 1 kHz) using a Larson Davis CAL-
200 acoustic calibrator and the measurement system in “live mode” (detailed in
chapter 3.1.2). Level adjustment was performed at input channel head amplifiers
in the mixing console. This calibration was repeated every time the KEMAR unit
was moved or if microphone signals were interrupted (power down of power
supply). Verification of calibrated levels was performed at the end of every day.
The KEMAR manikin was located in a different location for each of the
two different loudspeaker array geometries studied (see appendices B and C for
plan and section views). The manikin-center position was located on the theatre’s
center line, 8.7 m downstage from the measurement origin and was used for
recording the vertical array configuration. The manikin mid-right position, used
to record the horizontal array configuration, was located 2.9 m house-right of
center and 8.6 m downstage from the origin. When in the mid-right position, the
manikin was angled slightly to face halfway between the two front fill
loudspeakers.
3.2 Stimuli
As mentioned, the Modified Rhyme Test (MRT) was selected for this
series of studies. Three of the six word lists (lists A, D and F) were used,
employing one trained, native English-speaking, male talker with no discernable
accent. Each of the test stimuli (target words) was embedded in a carrier sentence.
The original recordings were purchased through a distribution company,
but appear to have been made in a non-reverberant vocal recording booth. 10 It
should be noted that in the set of 50 six-word ensembles that were used in these
studies, several ensembles differ from those detailed in the ANSI standard [5].
10
The raw MRT recordings were purchased through Cosmos Distributing, 4744
Westridge Dr, Kelowna, B.C. Canada V1W 1A1.
81
The three MRT lists were delivered through the reproduction sound
system under a variety of variable combination treatments.
Without knowing where the results from the series of studies would lead,
and considering the difficulty of coordinating equipment and facilities, it was
decided to record stimuli using a wide variety of variable treatments. The
experimental variables in question for these studies would be 1) delay time
between the arrivals from two loudspeakers, 2) level offset between the multiple
arrivals and 3) array geometry (vertical vs. horizontal array). The three MRT lists
were recorded for each of the variable combinations (treatments) in the 6×2×2
matrix formed by delay time, level offset and array geometry. The fourth
experimental variable, Signal-to-Noise Ratio (SNR), would be synthesized (via
electronic mixing) at a later date (see chapter 5.2).
Delay Time (ms) 0 5 10 20 30 40
Level Offset (dB SPL) 0 6
Array Geometry Vert Hor
Noise Level (dB SPL) ≈ 30 68 71 74 77 80 83
Table 3.2 System optimization variable values used during the
capture process
One possible approach toward studying the variable array type would be
to isolate all of the associated variables (e.g. on- vs. off-axis listening, monaural
vs. binaural listening and measurement/equalization location). However, as
mentioned in chapters 1.2 and 2.2.1, the ecological/organic approach of this
research project lead to the inclusion of two commonly employed loudspeaker
array types. The result of this approach is that array geometry becomes a
complex, or compound variable – an amalgam of several variables.
On one hand, one could view array geometry as an insufficiently
controlled or nuisance variable. Alternatively, one could view this as a study of
82
inherently off-axis from both loudspeakers. Thus, while the comb filters observed
in the off-axis response of the front fills could be corrected via equalization, this
would not be done in practical application. As such, the equalization of all
loudspeakers was performed using on-axis measurements.
Figure 3.6 Magnitude vs. frequency response of the UPA and two
UPM loudspeakers, as measured on the axis of each
loudspeaker
3.3 Procedures
different sound pressure levels, as delivered by the noise reproduction system and
recorded at each of the manikin locations (12 noise recordings total). If unwanted
noises were detected during any of the recordings (doors, airplanes, piano movers,
etc.), said recording was restarted.
86
The listening tests conducted in the pilot and main studies would use a
head-mounted auditory display to deliver the stimuli to subjects. This raised the
obvious question of which specific auditory display to use. Toole [173] suggests
that the preferred solution is to use earphones which are inserted into the ear canal.
This type of device generally offers greater response consistency as 1) it
altogether avoids reflections caused by the concha and pinna, and 2) because it is
easier to achieve consistent device placement on/in the head, thus more consistent
coupling of the drivers to the eardrum [173].
While in-ear devices are preferable, for this study it was decided to focus
on circum-aural and supra-aural headphones for the sake of practicality, comfort,
equipment availability and hygiene. The goal therefore was to identify the pair of
available headphones which provide the most consistent mounting on a listener’s
head and eardrum, and which also have a reasonably flat magnitude versus
frequency response.
4.3 Results
The results from these measurements can be seen in figures 4.1 through
4.4. The figures show the results for each headphone, with repeated
measurements superimposed, on a graph of magnitude versus frequency using
1/24th-octave smoothing.
One can see from these graphs that several of these devices do indeed fail
to be consistently mounted on the manikin’s head. 11 In the graph for the supra-
aural MDR-7506 (figure 4.1), for example, there is variance in low-frequency
response indicating variation in the quality of seal between the headphone and the
pinna. While fairly consistent in the middle frequencies, note the variance due to
reflections in the outer ear at and above 2 kHz. The MDR-V600 (figure 4.2)
produced similar results though the degree of variance was much higher. The
graphs for these two headphones were evaluated during the measurement process
and, after only five repeated measurements, it was concluded that these two
headphones would be inadequate.
In figure 4.3, one can see that the response of the RS-1 is consistent (±1
dB) up to approximately 4.7 kHz, indicating an acceptable seal between the
headphone and the manikin’s pinna. However, the degree of variance above this
11
The harmonic noise found in the low-frequency region of each graph was
caused by the power supply of the measurement laptop.
90
point is an indication of changes in the pattern of reflections in the outer ear. This
suggests inconsistency in the placement of the headphone, likely due to the fact
that the headphone is supra-aural.
While there are some irregularities at the very low and very high
frequencies, the response curves for the HD-650 (figure 4.4) are consistent (±1 dB)
in the range from 40 Hz–8 kHz, which is the functional range of the Zwislocki-
type ear canal simulator [80]. This indicates that this circum-aural headphone is
making a consistent seal with the manikin’s head. It also indicates that though
there is evidence of some degree of change in the pattern of reflections in the
outer ear, these changes are not affecting the frequency range (125 Hz–8 kHz)
that impacts speech intelligibility [110].
4.4 Discussion
The two pro-audio quality headphones (Sony MDR-7506 and V600) both
fail to meet the first two criteria. The RS-1 meets the second but fails the first.
The HD-650 was the only headphone to meet both of the first two criteria (for 40
Hz–10 kHz).
Though the various headphones produced varying degrees of consistency,
none of the headphones tested produced a response curve which could be
considered flat. Remembering that measurements were made at the eardrum of
the manikin, it would be useful to consider the transfer function of the path
between the headphone driver and the eardrum of the manikin.
Figure 4.5 shows the response of a KEMAR manikin with a Zwislocki-
type occluded ear canal simulator [90, 185]. When comparing this response with
91
that of the HD-650 one notes that both contain a prominent narrow peak at
approximately 2.7 kHz and a wider peak centered at roughly 4 kHz. As noted by
Klepko [92], the peak between 2 kHz and 3 kHz is a result of resonance in the
concha and is not directionally dependent. This frequency response trend is also
reported in the ITU-T P.58 recommendation [85] regarding free and diffuse field
response tolerances for head-and-torso simulators.
By visually removing the effects of the manikin, the resulting response for
the headphone would contain 1) a slight elevation in the low frequencies, 2) high-
frequency roll-off starting at around 9 kHz, and 3) the effects of outer ear
reflections. Given that the headphone response will show evidence of outer ear
reflections for any headphone-ear coupling, and that the pattern of reflections will
change from ear to ear, it was decided to focus on the components of the
headphone response that are not reflection-based.
Figure 4.6 shows the response for the HD-650 with 1/3rd-octave
smoothing. This level of smoothing, though inappropriate for many applications,
is used here to remove many of the effects of outer ear reflections through
averaging. What remains is more a spectral tilt than a response, but it allows for
better inspection of the point of high-frequency roll-off. Again, visually
removing the effect of KEMAR from the response, one can see that the response
92
Figure 4.7 Free- and diffuse-field responses for blocked ear canal
(Reprinted with permission from [14])
93
4.5 Conclusions
The Sony MDR-7506, Sony MDR-V600 and Grado RS-1 do not meet the
declared criteria of suitability for use in this research project.
The Sennheiser HD-650, through 10 repeated measurements, met the first
two criteria. The device produced results which indicated consistent reflection
patterns in the outer ear, coupling of the headphone to the head and coupling of
the driver to the eardrum in the frequency range from 40 Hz–10 kHz, which is
acceptable for this research project.
As for the third criterion, the Sennheiser HD-650 has an acceptably flat
frequency response from 50 Hz–8 kHz. Though the high-frequency limit is lower
than would be desired, a reasonably conservative high-shelf filter, applied during
stimulus equalization, could conceivably extend the response to 10 or 12 kHz
with minimal degradation to the integrity of the stimuli. As such, the HD-650
would meet all of the criteria and is thus acceptable for use in this research project.
94
5. Preparation of Stimuli
The stimuli that were acquired from the Corbett Auditorium recordings
would need to be processed before they could be used in listening tests. In their
captured form, the stimuli consisted of 24 wave files, each containing three 50-
word MRT lists recorded sequentially and without additional noise, and twelve
wave files containing noise recorded at a different levels and locations.
The stimulus files that would be used for this project’s studies would need
to be equalized, mixed with noise recordings and then parsed. The result would
be 108 wave files used for the pilot, 2400 files for phase 1, and 1200 files for
phase 2.
5.1 Equalization
The use of IIR filters was rejected due to this type of filter’s potential for
issues arising from causality and instability [140]. If an FIR filter were to be used
to invert the impulse response of the headphone-KEMAR combination, this
would lead to narrow-band notches and peaks in the region above 4 kHz which, as
stated above, would not be appropriate. The remaining options are parametric
equalization or no equalization.
As seen in figures 4.5 and 4.6, both responses have a prominent peak in
the mid- to high-frequency region. It is reasonable to conclude that some sort of
equalization would be needed, and this was verified upon listening to the raw
stimulus recordings. It was decided that the solution would be to employ
parametric equalization filters to cancel the narrow peak at 2.7 kHz and the wider
peak centered at 3 kHz, and to smooth the overall spectral tilt of the response.
Though slight differences exist (likely due to the response of the headphones),
this is essentially the diffuse-field corrective equalization method used by the
Etymotic ER-11 KEMAR microphone preamplifier, with the addition of a high-
frequency boost similar to that used by Toole [173].
The corrective equalization detailed in table 5.1 was applied, before
merging and parsing, to all sound files obtained from the Corbett Auditorium
recordings using Sony’s Sound Forge software package. A graph comparing the
original and corrected transfer functions is shown in figure 5.1.
Filter Frequency B.W. (Octaves)
Gain Purpose
Type (Hz) / Slope (dB / Oct.)
PEQ 2700 0.4 −6 dB Cancel Narrow Peak
PEQ 1600 1.0 −2.5 dB Cancel Wide Peak
PEQ 3000 2.5 −5 dB Cancel Wide Peak
PEQ 150 1.6 −3 dB Spectral Tilt
H-Shelf 6500 6 +6 dB Spectral Tilt
Table 5.1 Corrective equalization settled upon for use on all stimuli
96
As mentioned in chapter 3.3.1, the MRT word lists (recorded under each
of the 24 treatments) and twelve different noise distracter recordings were
captured separately. Considering that the ambient noise level of the recording
space was 30‒40 dB SPL below the level of the speech and noise distracter
signals, the electronic summation of two such signals would produce a negligible
increase in system noise level. It was also considered that speech and noise
distracter signals would be produced by separate loudspeakers, thus there was no
possibility for inter-modulation distortion between signals within a loudspeaker.
As such it was determined, through discussions with advisors, that it would be
viable to use electronic rather than acoustic summation of speech and noise
97
5.3 Parsing
After the appropriate noise level was added, the stimulus files would need
to be parsed into separate and new files of the desired size/number of MRT words.
The stimulus files that would be used for the pilot study would contain a single
50-word MRT list. Each file was approximately three minutes long.
The stimulus files that would be used for both phases of the main study
would be further parsed, such that each 3.5 sec wave file contained a single word
from an MRT wordlist. The composition of the files used for both phases of the
main study was as such: Noise fades up over 0.75 sec, the target word plays
within the carrier sentence and then the noise fades out over 0.5 sec.
98
6. Pilot Study
Given that it takes approximately 3‒5 minutes for one subject to evaluate
one MRT word list, it could take a single subject over 40 hours to fully evaluate a
7×6×2×2 matrix of variable treatments using three word lists per treatment. It is
fairly obvious that this type of study would be foolish to attempt and impossible
to complete. It was therefore decided to employ a pilot study in this research
project, the goal of which would be to reduce the total number of variable
treatments to a more manageable figure.
After the stimulus preparation process was completed, the author (and
others) listened to several of the MRT word lists as recorded under several
variable treatments. It was noted that word identification was extremely easy for
treatments that contained no added noise, and that identification was extremely
difficult for treatments that contained the highest level of added noise. This
preliminary evaluation, coupled with data from the literature [45, 107, 144], led to
the conclusion that it would be important to identify the range of the variable
Signal-to-Noise Ratio (SNR) that would yield variance in intelligibility scores
without overpowering the effects of the other experimental variables.
The pilot study would have three main purposes:
1) Identify a useful range for the variable Signal-to-Noise Ratio
2) Identify a range of values for the other experimental variables that
would be of interest for subsequent studies
3) Determine if there exists any issues or problems with testing procedures
or principles.
6.1 Hypotheses
such the hypotheses that could be tested would likely have to be limited to first-
order effects if there would be any hope of finding statistical strength or
significance in the results. The resulting hypotheses were as follows:
Ho1: Delay time between multiple arrivals does not affect the
intelligibility of speech reproduced by a sound system.
Ho2: Signal-to-noise ratio does not affect the intelligibility of speech
reproduced by a sound system.
Ho3: Array geometry does not affect the intelligibility of speech
reproduced by a sound system.
It should be noted that while the second null hypothesis has been rejected
many times in many different studies (e.g. [45, 107, 144]), its presence in this
study would serve as a test of the methods and procedures of the study itself.
At 40 hours per subject, the full set of variable treatments would be too
large to evaluate even in a pilot study. As such it was necessary to construct a
reduced set of variable treatments – one that would allow for the fulfillment of the
pilot’s three main purposes and also provide insight towards evaluating its’ three
hypotheses.
As can be seen in table 6.1, it was decided to use all but one of the
possible values for the variable SNR. During the author’s preliminary evaluation
of the stimuli (noted above), it was determined that the highest SNR value that
still contained added noise (SNR = 12 dB) would not have a significant impact on
intelligibility scores. While the highest SNR value (SNR = 50 dB, the condition
with no added noise) was also believed to be inadequate to provide significant
impairment for subjects, it was left in this study as a proof of concept to show the
need for the manipulation of SNR.
Delay Time (ms) 5 20 40
Approx. SNR (dB) 50 9 6 3 0 -3
Array Geometry Vert Hor
Table 6.1 Variable values used in the pilot study.
100
6.3 Equipment
Subjects in the pilot study listened to stimuli via headphones and recorded
their answers on provided response sheets (see appendix J). The stimulus
playback system consisted of a MacBook laptop computer, MOTU 896 HD audio
interface and a pair of Sennheiser HD-650 headphones. Stimulus audio files were
played using iTunes. As mentioned in chapter 5.2, all audio files had a resolution
of 44.1 kHz, 16 bit. The track labels for each stimulus indicated the order in
which the files should be played, and made no mention of the underlying
experimental design.
The level of playback was calibrated using a hand-held sound level meter
(IEC 651 Type II). The meter was attached to a 6 cc coupler to approximate ear
canal loading effects, and positioned on the left headphone using a flat-plate
coupler. Playback level was adjusted at the MOTU such that playback of a
stimulus set which contained the loudest noise level (SNR = −3 dB) would
produce a measured result of 83 dB SPL (C-weighted, slow integration) – the
level of the original sound field, as recorded in the original acoustical
environment.
6.4 Subjects
This study used four native English-speaking subjects, 3 male and one
female, ranging from 24 to 31 years of age. All subjects were professional audio
engineers with a Bachelor’s degree or higher in fields relating to sound
engineering. All subjects were verified to have unimpaired hearing (re: 25 dB HL
at octave frequencies from 250 Hz to 8 kHz) through the administration of
hearing acuity tests [62]. Subjects were compensated $5 USD for each listening
session.
102
6.5 Locations
6.6 Procedures
All four of the subjects evaluated the 108 stimulus sets in nine sessions,
each session containing 12 stimulus sets and taking approximately 36 minutes (45
minutes including breaks) to complete. At the beginning of the first session, each
subject was given written and oral instructions regarding the types of sounds they
would be evaluating, operation of the playback device and the method of response
(see appendix H for instructions).
The subject would then undergo a training process to become familiarized
with the stimuli and testing procedures. The training used for this study involved
the evaluation of 6 stimulus sets comprised of: All three word lists delivered
under variable treatment 1 (5 ms, 50 dB, Vertical Array), list A under treatment
19 (5 ms, 50 dB, Horizontal), list D under treatment 18 (40 ms, −3 dB, Vertical)
and list F under treatment 29 (20 ms, 3 dB, Horizontal). Though this is obviously
not all-inclusive of the total number of treatments used in this study, this
combination follows established recommendations by providing subjects with the
opportunity to hear all of the individual words that will be presented (under one of
the most intelligible variable treatments), and experience the magnitude of the
differences between auditory attributes of the various treatments to be used in the
study [14]. This level of training was deemed appropriate as this study employs
identification rather than discrimination or scaling tasks, and does not involve the
study of listener preference.
Once the training process was complete, the test administrator spoke with
the subject to verify that they understood the instructions and operation of the
103
The response sheets for the listening tests were scored and the results
recorded as the number of correct responses out of 50. Though the probability of
a subject randomly guessing the correct answer is quite low (16.7%) for a six-
alternative forced choice task, it was decided that the results should be adjusted to
account for this probability, as is recommended by ANSI standard S3.2 [5], using
the following equation:
Correct Responses – Incorrect Responses
Adjusted Score (Ra) = (Eq. 6.1)
Number of Choices – 1
Thus the 50 dB SNR variable value should be excluded from future studies and,
in fact, from much of the analysis in this study. Similarly the distribution of the
data for the 9 dB case should be considered. Though not exhibiting an obvious
ceiling effect, the data shows that perfect scores (10 out of 10) were recorded,
suggesting that it would be possible to find such an effect if a larger subject
population were used.
Figure 6.1 Box and whisker plot of adjusted score vs. SNR for all
tested levels of SNR.
For the data sets corresponding to the remaining SNR values, there
appears to be variance between and within groups with no evidence of a ceiling
effect. For the set of treatments where SNR = −3 dB however, note that the
highest recorded adjusted score is a 5.2, which in terms of percentage of words
correctly identified corresponds to a score of 76%. This suggests that the level of
the noise present in these treatments could be too high, making it too difficult for
subjects to perform the necessary identification tasks. This data also suggests that
the strength of the effect of SNR at this level may overpower the effects of the
105
other experimental variables. Also, considering that variability is higher for lower
scores [156], as can be seen from the results in figure 6.1, the SNR value of −3 dB
should likely be excluded from future studies.
The data from the remaining three groups of SNR values show that there
is variance within each group and between groups, allowing for both the study of
the effect of SNR and the effects of the other experimental variables within SNR
groups. For future studies it would therefore make sense to exclude the SNR
values of −3 dB, 9 dB and 50 dB. With the exception of investigating the
effectiveness of listener training and within-study learning, the remainder of the
pilot study analysis will also exclude the data from treatments including these
three SNR values. The remaining data is referred to as the SNR stratified data set.
A frequency plot (figure 6.2) of the original data set confirms that the data
is left-skewed and that the upper limit of the range of adjusted score may be the
culprit. A frequency plot of the SNR stratified data set shows that ceiling effects
have been eliminated and that degree of skew has been reduced (figure 6.3). The
result is a distribution that more closely resembles a normal distribution, though
still exhibiting a bi-modal shape.
The standard method for the analysis of numerical data obtained from
listening tests is analysis of variance (ANOVA) [14]. The ANOVA process
however makes several assumptions, including assumptions about the
homogeneity of variance of the data to be analyzed [153]. Table 6.2 shows the
results of two standard tests for homogeneity of variance for both the original and
SNR stratified data sets. As can be seen from the probability statistics from each
of these tests (p < 0.001), it is clear that neither of these data sets contains
107
Figure 6.4 Detrended normal Q-Q plot of adjusted score in the SNR
stratified data set.
it is the standard (and expected) analysis method, ANOVA results will also be
presented.
The results of the first-order analysis of the data are shown in table 6.3.
Not surprisingly, SNR and subject have a very significant effect on the results.
As mentioned previously, the effect of SNR on the intelligibility of speech
through transmission systems is well documented (e.g. [45, 107, 144]). It is also
easy to conceive that different people, having different listening skills and hearing
acuity, might have differing performance in this type of test. The performance of
one subject was of particular interest and will be detailed in chapter 6.8.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Subject 7.277 <0.001 14.642 0.002
Word List 2.396 0.094 4.289 0.117
Delay Time
3.622 0.028 7.104 0.029
(5 – 40 ms)
Delay Time
0.237 0.627 0.201 0.654
(5 – 20 ms)
Delay Time
4.151 0.043 4.316 0.038
(20 – 40 ms)
Array Type 9.262 0.003 9.120 0.003
SNR 86.029 <0.001 104.528 <0.001
Table 6.3 Results of ANOVA and Kruskal-Wallis tests for first-order
effects of experimental variables on adjusted score (SNR
stratified data set)
6.8 Discussion
The first, and probably most obvious, point that might be made is that
more data points are required before this project’s research questions can be
adequately addressed. Yet even with the small data set generated by this
preliminary study, there are several things that can be learned and applied to the
remainder of this project.
From table 6.3 it can be seen that array geometry has a fairly strong and
very significant effect on scores (K-W test: χ² Stat = 9.120, Sig. = 0.003) in the
SNR stratified data set. If one were to take a moment to consider the actual array
geometries, the cause of this effect will become clear.
111
The choice to use separate sets of noise recordings for the different array
types is another possible factor that could have contributed to the observed effect
of array type. As two different sets of noise recordings were used, this ultimately
should be viewed as an insufficiently controlled variable. The possibility must
therefore be considered that this may have been the cause of the observed effect.
112
Figure 6.5 Box and whisker plot of adjusted score vs. array type for
SNR stratified data set
As mentioned earlier, one of the main purposes of this pilot study was to
identify problems with the experimental design and methods before proceeding to
the main studies. The use of two different noise sets is a design element that
would apparently fall into that group.
In determining whether there were issues with the testing procedures and
principles, one area of particular interest was the training of listening test subjects.
Subjects need to engage in sufficient training such that they understand how to
use the testing apparatus, understand the method and/or scales used to indicate
responses, and are familiar with the magnitude of variance of auditory attributes
between the variable treatments that will be presented during the tests. This raises
the question of, “How much training is enough?” A subject’s time is limited and,
113
given the tradeoff between training time and testing time available in a testing
session, it is important to find an appropriate balance between the two.
If a subject’s training is insufficient, one way that this may manifest itself
is in a noticeable improvement in subject performance over the course of the
experiment. Table 6.9 shows the results of the analysis of the effect of
presentation order on adjusted scores. Remembering that the presentation order
of variable treatments was randomized for each individual subject, a presentation
order effect would likely indicate some form of learning on the part of the
subjects. As can be seen, the results indicate that presentation order does not have
a significant effect on scores. Considering the large number of word lists that
each subject was asked to evaluate, and the fact that evidence of a learning effect
could be obscured by such a large data set, the analysis was repeated using only
the first 10 lists evaluated by each subject. Again, the results indicate that there is
no significant effect. Combined with positive feedback from subjects, the data
leads to the conclusion that the amount of training provided was adequate.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Order (all) 0.888 0.765 97.429 0.735
Order (1-10) 0.846 0.581 9.473 0.395
Table 6.9 Results of ANOVA and Kruskal-Wallis tests for the effect
of stimulus presentation order on adjusted score (original
data set). Results are shown for all word lists evaluated and
for the first 10 sets evaluated by each subject.
Bech and Zacharov [14] discuss listening test duration and the importance
that a researcher must place on preventing subjects’ loss of attention or boredom.
They note that 20 minutes appears to be a suitable duration for a listening session,
and that sessions lasting 30‒40 minutes are also acceptable if subjects are able to
take breaks when they feel themselves getting fatigued or bored.
The listening sessions for the study detailed in this chapter, which lasted
approximately 36 minutes, fall into the latter of these groups. As such, subjects
114
were advised, verbally and in writing, regarding the importance of breaks. As can
be seen in figure 6.7, there is some question as to whether this session duration is
appropriate for all subjects.
Figure 6.6 is a reprise of figure 6.1 from chapter 6.7, with the addition of
outlier identification. It is interesting to note that all of the outliers come from the
data set of subject #4, whose data point indices range from 325 to 432. During
the debriefing after the subject’s final session, the subject indicated to the author
that there were several points during the overall testing process when the subject
was aware of the onset of fatigue and probably waited too long before taking a
break. Figure 6.7 further shows that the data set acquired from subject #4
contains a significantly different range of variance towards the lower bounds of
the scale. The outlier seen in the data for subject #3 suggests that this individual
may have experienced a similar period of fatigue.
Figure 6.6 Box and whisker plot of adjusted score vs. SNR for all
tested levels of SNR. Outliers are identified by data point
index number.
115
Figure 6.7 Box and whisker plot of adjusted score vs. subject (SNR
stratified data set)
The information from the debriefing and from figures 6.6 and 6.7
reinforces the importance of breaks in the testing process. It is clear that for
future studies session duration and break spacing should be reevaluated, and
possibly adjusted, to avoid issues arising from listener fatigue.
both statistical tests indicate that the tests fail to rule out the possibility that the
effect is solely due to chance.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Word List 2.396 0.094 4.289 0.117
Table 6.10 Results of ANOVA and Kruskal-Wallis tests for the effect
of MRT word list on adjusted score (SNR stratified data set)
Figure 6.8 Box and whisker plot of adjusted score vs. word list (SNR
stratified data set)
will always be “SIT”, the challenge is lost and the evaluation of further stimulus
sets becomes moot. However if the choice of word list used for evaluation of a
particular treatment will have no effect on the results, it would be possible to have
subjects evaluate stimulus treatments using only one list, randomized by treatment.
This would amount to a 66% reduction of total test size and time.
It has been mentioned that, given the meager amount of data, the results of
this study should be viewed as a preliminary venture. Significance seen (or more
likely, not seen) in these results may be purely a result of the size of the study.
After all, the data contains only 12 data points for any given treatment. However,
with 144 data points per word list, confidence in the results can be somewhat
higher. When one considers the potential for design reduction that this variable
presents, it is certainly worth further investigation.
6.9 Conclusions
As stated at the beginning of this chapter, the three main purposes of this
study were to find a useful range for the variable SNR, attempt to reduce the
number of treatments needed for future studies and determine if there are any
problems with the design of the study or testing procedures. Additionally, three
research questions were posed in the form of hypotheses.
6.9.1 Hypotheses
Ho1: Delay time between multiple arrivals does not affect the
intelligibility of speech reproduced by a sound system.
From the results of both parametric and non-parametric tests, there is
sufficient evidence to reject this null hypothesis. The results indicate that the
variable delay time does have a significant effect on intelligibility scores. The
tests also indicate that this variable begins to have an effect somewhere in the
region between 20 ms and 40 ms.
118
This research project was charged with finding correlations between sound
system optimization parameters and speech intelligibility. What originally
resulted was a four-dimensional matrix of variables containing 168 possible
variable treatments. This pilot study was implemented as a way to cast a broad
net over the research questions, in an attempt to reduce the size and complexity of
the task at hand. To that end, the study was a success. In addition, the net that
was cast has returned some information that will serve as veritable guideposts for
the following main study. Issues with study design and testing methodology have
been identified and addressed, and preliminary data has indicated that there is
indeed value in continuing the project.
The pilot study began with three purposes and three questions. Heading
into the next study, some answers, and many more questions, have been
discovered.
122
The pilot study was able to significantly reduce the size of the project’s
variable matrix from 7×6×2×2 (168 treatments) to 3×6×2×2 (72 treatments).
Additionally the results from the pilot, though not conclusive, do suggest a range
of interest for the variable delay time, which could further reduce the size of the
matrix to 3×3×2×2 (36 treatments). Finally, the results from the pilot suggest that
it may be possible to reduce testing time by reducing the number of MRT word
lists needed to evaluate each treatment.
36 treatments would be a manageable set of stimuli if the number of
required word lists were reduced. However, suggestions are not conclusions and,
at this point, it has not been conclusively shown that it would be appropriate to
carry out the requisite reductions in the treatment matrix. Though it is an
improvement over the size of the original treatment matrix, 72 treatments remains
a prohibitively large number of stimuli to attempt conclusive subjective
evaluation. It is clear that further reduction is necessary.
This first phase of the main study will serve as an intermediary stage
between the pilot study and the second phase of the main study. It could be
considered a method of successive approximation: Each study in this project
seeks to evaluate hypotheses and facilitate greater clarity and statistical strength in
subsequent studies. Casting a less broad net over the variable treatments, the
main goal of this phase would therefore be the further reduction of the number of
variable treatments such that the size of the treatment matrix used in the second
phase would be manageable.
This first phase of the main study would have 4 main points:
1) Attempt to further reduce the number of treatments of interest
2) Determine if the effects of delay time values in the region less than 20
ms are still found insignificant with a larger test group.
3) Test the validity of using only one MRT word list per treatment
4) Evaluate hypotheses
123
7.1 Hypotheses
Ho6: (interaction) Array geometry does not affect how delay time
between multiple arrivals affects the intelligibility of speech
reproduced by a sound system.
7.3 Equipment
As with the pilot study, subjects would evaluate binaural recordings via
headphone display. The audio playback equipment used in the current study
includes an IBM Lenovo S10 Ideapad (PC-based) netbook computer, Lexicon
Lambda USB audio interface and Sennheiser HD-650 circum-aural headphones.
All audio files had a resolution of 44.1 kHz, 16 bit.
The level of playback was calibrated using a hand-held sound level meter
(IEC 651 Type II). The meter was attached to a 6 cc coupler to approximate ear
canal effects, and positioned on the left headphone using a flat-plate coupler.
Playback level was adjusted at the audio interface such that playback of a
stimulus set which contained the loudest noise level used in this study (SNR = 0
126
7.4 Subjects
7.5 Locations
Listening tests were conducted at two locations. For the first 14 subjects,
the location used was the Sound Design Studio at the College Conservatory of
Music, University of Cincinnati, OH, USA. For the remainder of the subjects,
tests were carried out during off-hours in the main backstage area of the Norton
Opera Hall at the Chautauqua Institution, Chautauqua, NY, USA. Both spaces
were found to have background noise levels corresponding to NC-30 or less [82]
when measured (see chapter 3.1.2 for measurement equipment specifications).
7.6 Procedures
All 25 of the subjects evaluated one MRT word list for each of the 16
variable treatments. For each subject, this was completed in two sessions, each
session containing 8 stimulus sets and taking approximately 36 minutes (45
minutes including breaks) to complete. At the beginning of each subject’s first
session, a hearing acuity test was administered to verify that the subject did not
have a hearing impairment. Each subject was then given written and oral
instructions regarding the types of sounds they would be evaluating, operation of
the playback device and the method of response (see appendix H for instructions).
128
Results of the listening tests were stored by the test software, scored as the
number of correct responses out of 50. As was the case in the previous study, the
results were adjusted to account for the probability of chance-guessing using the
following equation:
Correct Responses – Incorrect Responses
Adjusted Score (Ra) = (Eq. 7.1)
Number of Choices – 1
Figure 7.2 Box and whisker plot of adjusted score (full data set)
Figure 7.2 shows the range and general distribution of the adjusted scores
obtained from all subjects for all treatments used in this study. Identified in the
plot are three outliers, falling more than 1.5 times the box length from the 25th
130
Figure 7.3 Box and whisker plot of adjusted score vs. subject (full data
set)
131
Figures 7.4 and 7.5 show the histogram and detrended Q-Q plot for the
full set of data. As was the case in the pilot study, the data does not conform to a
normal distribution. This conclusion is confirmed by the results of two tests for
the homogeneity of variance (Table 7.2) as, for the Kolmogorov-Smirnov and
Figure 7.5 Detrended normal Q-Q plot of adjusted score (Strat_7 data
set)
Additionally the author has been made aware of, and will thus employ, a form of
log-linear analysis called multiway contingency tables analysis (MCTA) [120,
165, 180].
The results of the first order analysis of the Strat_7 data are shown in
Table 7.3. Subject is once again, and unsurprisingly, found to have an effect on
results.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Subject 4.811 < 0.001 89.946 < 0.001
Delay Time
4.326 0.005 11.329 0.010
(0 – 40 ms)
Delay Time
0.565 0.569 1.094 0.579
(0 – 10 ms)
Delay Time
9.239 0.003 7.747 0.005
(10 – 40 ms)
Array Type 3.789 0.052 3.488 0.062
SNR 156.175 < 0.001 119.157 < 0.001
Table 7.3 Results of ANOVA and Kruskal-Wallis tests for first-order
effects of experimental variables on adjusted score (Strat_7
data set)
Also, the variable SNR shows high strength and significance. As the first
order effect of SNR was included as a dummy variable, to check the function of
the experiment, confidence in the testing methodology and execution is raised.
The first order effect of array type was also included as a dummy variable.
Though the box plot (figure 7.6) of adjusted score vs. array type does indicate
some difference between the two arrays, results from this initial analysis of the
effect of array type do not quite meet the standard for sufficient statistical
significance (p = 0.062 > 0.05). 12 It is possible that said effect may become
significant when further stratified data sets are analyzed.
12
For the analysis of the data set that included subject seven, both the ANOVA
and Kruskal-Wallis tests found array type to be a significant variable (p = 0.044
and 0.043 respectively))
134
Figure 7.6 Box and whisker plot of adjusted score vs. array type
(Strat_7 data set)
Delay time is seen to have a significant effect for the ranges of 0 ms–40
ms and 10 ms–40 ms, however no significant effect is found in the range 0 ms–10
ms. As can be seen in figure 7.7, the results for delay times of 0 ms and 5 ms are
nearly identical. The results for the 10 ms treatments show a slight reduction in
overall variance, though the median value is lower than those found for the 0 ms
and 5 ms treatments.
As seen in table 7.4, neither strength nor significance is found for
any of the three 2-way interactions. It was suggested in chapter 6.7.2 that the
acquisition of more data points could possibly reveal significant interaction
effects, however 24 points per treatment (as opposed to twelve) has not yielded
further clarity.
135
Figure 7.7 Box and whisker plot of adjusted score vs. delay time
(Strat_7 data set)
ANOVA ANOVA
Variable
F-Stat Sig.
SNR ×
0.592 0.621
Delay Time
Delay Time ×
1.776 0.151
Array Type
SNR ×
2.582 0.109
Array Type
Table 7.4 Results of ANOVA for second-order effects of
experimental variables on adjusted score (Strat_7 data set)
The work of Lochner and Burger [107] showed that the effects of rollover
interact with SNR (see chapter 2.1.1). Also, the work of Steeneken and Houtgast
[167], with regard to the measured effects of delay and SNR on the MTF,
indicates that the effect of delay may be mitigated when higher SNR values are
136
used (see chapter 2.2.4). Based on the findings from these studies, and
considering the relative strength of the effect of SNR, it has been proposed by the
author that said effect could serve to obscure the effects of array type and delay
time. Thus the Strat_7 data set was further stratified by SNR into two data sets:
Strat_7-SNR6 and Strat_7-SNR0. The results of the analyses for these data sets
are shown in tables 7.5 and 7.6.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Delay Time
3.354 0.070 3.000 0.068
(0 – 5 ms)
Delay Time
1.83 0.164 3.331 0.189
(0 – 10 ms)
Delay Time
4.599 0.004 14.821 0.002
(0 – 40 ms)
Delay Time
0.707 0.403 1.050 0.305
(5 – 10 ms)
Delay Time
7.373 0.001 14.974 0.001
(5 – 40 ms)
Delay Time
8.069 0.006 7.397 0.007
(10 – 40 ms)
Array Type 9.437 0.002 8.210 0.004
Table 7.5 Results of ANOVA and Kruskal-Wallis tests for effects of
delay time and array type on adjusted score (Strat_7-SNR6
data set)
The effect of delay time is more easily identified in the higher SNR
condition. Both strength and significance are increased for the 0 ms–40 ms, 5
ms‒40 ms and 10 ms–40 ms ranges. An illustration of these increases can be seen
in figure 7.8. For the 6 dB SNR data, a comparison of medians shows a
downward trend in intelligibility scores as delay time increases. This trend is not
seen for the 0 dB SNR data. 13
Figure 7.8 Box and whisker plot of adjusted score vs. delay time, by
SNR (Strat_7 data set)
It is possible that the addition of more data points could refine the results
for the lower SNR data, removing some degree of variance, thus revealing a
relationship between delay time and scores. However, as was noted in chapter 6.7,
13
The analysis of the data set that includes subject seven revealed that, for the
SNR0 condition, the only significant effect was delay time in the range 10 ms–40
ms)
138
Figure 7.9 Box and whisker plot of adjusted score vs. array type, by
SNR (Strat_7 data set)
A similar trend is found to exist for the effects of array type, shown in
figure 7.9. Contrary to the results from the analysis for the full Strat_7 data set,
array type is seen to have a clear effect at the higher SNR condition (χ² = 8.210, p
= 0.004). From a review of the works of Holman [74] and Shirley et al. [162,
163], it was anticipated that the scores obtained from the vertical array type would
be higher. Of particular interest in these results is that, while the scores for the
vertical array are higher for the 6 dB SNR, there is no distinguishable difference
139
in scores for the 0 dB SNR. These results would suggest that the gains in
intelligibility usually afforded by the use of a center channel are reduced if not
negated by the presence of higher noise levels.
The results of the various two-way ANOVA tests for the complete Strat_7
data set did not reveal significant interactions between variables. However, as
seen in the previous section, it can be possible to divine knowledge of variable
relationships through stratification of the data set by a single variable.
The Strat_7 data set was again divided into two data sets, this time
according to array type: Strat_7-Vert and Strat_7-Hor. Tables 7.7 and 7.8 show
the results of analyses carried out on these two data sets.
First, one can see that the strength of the effect of SNR is greater for the
vertical array. This is in agreement with the results from the previous section,
shown in figure 7.9.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Delay Time
3.565 0.062 3.292 0.070
(0 – 5 ms)
Delay Time
1.842 0.162 3.142 0.208
(0 – 10 ms)
Delay Time
1.931 0.126 5.182 0.159
(0 – 40 ms)
Delay Time
0.735 0.394 0.749 0.387
(5 – 10 ms)
Delay Time
2.312 0.103 4.242 0.120
(5 – 40 ms)
Delay Time
1.635 0.204 1.286 0.257
(10 – 40 ms)
SNR 107.941 < 0.001 72.526 < 0.001
Table 7.7 Results of ANOVA and Kruskal-Wallis tests for effects of
delay time and SNR on adjusted score (Strat_7-Vert data
set)
140
From these results it would appear that delay time does not have a
significant effect on intelligibility scores for the vertical array configuration.
Results for the horizontal configuration, on the other hand, indicate that there are
significant differences in scores for the delay time ranges of 0 ms–40 ms, 5 ms–40
ms and 10 ms–40 ms.
A graphical comparison of the intelligibility scores from both arrays is
shown in figure 7.10. There appears to be good agreement between the scores for
both array types for the 0 ms and 5 ms conditions. At 40 ms it is clear that there
is a difference in scores for the two arrays. While the statistical analysis of the
effect of the 5 ms–10 ms delay range for the horizontal array did not show
significance (χ² = 0.186, p = 0.666), an inference could be made from the plot. It
appears that the separate effects of delay time on the two arrays begin to diverge
for delay times greater than 5 ms, with effects becoming significant somewhere
between 10 ms and 40 ms.
These results are surprising. The experiments reported by Haas [67]
indicates that the critical delay difference, required for an echo to disturb listening,
increases as the angle of echo incidence deviates from front/center. While the
vertical array configuration does employ a loudspeaker located at an elevated
angle, Haas’ results indicate that elevation has less impact on lengthening the
141
critical delay difference than lateral angular-offset. As such, one would expect to
see the scores for the vertical array decline before those from the horizontal array.
Figure 7.10 Box and whisker plot of adjusted score vs. delay time, by
array type (Strat_7 data set)
One possible explanation for this disparity lies in the difference between
the goals of the studies of Haas and the current research project. The
investigation of echo detection/disturbance is not the same as the investigation of
intelligibility. As mentioned in chapters 2.1.1 and 2.2.2.2, there is debate
surrounding the relationship between the fusion (post-masking) of early
reflections and intelligibility. The results of the current study would seem to
agree with other researchers [41, 111] indicating that fusion plays different roles
for echo perception and intelligibility.
Another possible explanation, along the same lines, has to do with the fact
that the current study does not focus explicitly on angle of incidence; rather it
142
examines two different types of arrays – each differing in orientation and focus.
Rather than being purely an issue of monaural vs. binaural hearing, it is equally
likely that the observed effect of array type may indicate a difference in the
compound effect that also includes effects from array orientation and focus. This
is an area of interest for further study.
Also, based on the work of Mochimaru [130] and others [11, 41, 170], it is
unlikely that delay time would have no significant effect for the vertical array
type. What is more likely is that the effect of delay time is stronger for the
horizontal geometry, and that tests on the data from this study were merely unable
to find significance for the less-strong effect in the vertical geometry.
repeats until no factors can be removed without affecting the prediction accuracy
of the model. The remaining factors are referred to as the generating class.
For the current study, the MCTA would contain one 4-way contingency
table including array type, SNR, delay time and “pass/fail”. Several 3-way tables
were also constructed using the four stratified data sets (by SNR and by array type)
previously mentioned. An example of a 3-way contingency table is shown in
table 7.9. Note that all independent variables have been converted in to
categorical form.
Array Type SNR Delay Time Pass / Fail Frequency
Vert = 1, SNR0 = 1 0 ms = 1 Fail = 1
Hor = 2 SNR6 = 2 5 ms = 2 Pass = 2
10 ms = 3
40 ms = 4
1 2 1 1 12
1 2 1 2 13
1 2 2 1 6
1 2 2 2 19
1 2 3 1 7
1 2 3 2 18
1 2 4 1 11
1 2 4 2 14
2 2 1 1 12
2 2 1 2 13
2 2 2 1 7
2 2 2 2 18
2 2 3 1 11
2 2 3 2 14
2 2 4 1 20
2 2 4 2 5
Table 7.9 Multiway contingency table for Strat_7-SNR6 data set,
pass criterion: adjusted score ≥ 8 (90%)
In order for MCTA to be possible, the scalar result data from this study
would need to be transformed into categorical frequency data. In order to do this,
a pass/fail criterion must be set. Additional restrictions of MCTA dictate that the
choice of this criterion must result in no frequencies being less than one, and not
more than 20% of frequencies should be less than 5. Analysis of adjusted score
by treatment revealed that scores of 7.6 (88% correct) and 8 (90% correct) would
144
fulfill these requirements. Analysis was performed using both of these criteria
and results are reported in table 7.10.
MCTA of the Strat_7 data set, using the 88% criterion, did not reveal any
significant effects aside from the first-order effect of the dummy variable SNR. 14
Analysis of the SNR-stratified data sets (Strat_7-SNR0 and Strat_7-SNR6)
revealed that array type was removed from the model early in the process for the
SNR0 set, yet was included in the generating class for the SNR6 data set. This
result is in agreement with the results of the parametric and non-parametric tests
detailed in tables 7.5 and 7.6 and figure 7.9. The results of MCTA (88%) on the
array-stratified data determined that SNR was the only effect in the generating
class for the vertical array type, while SNR and delay time composed the
generating class for the horizontal array type. These results are also in accord
with the results from parametric and non-parametric tests (shown in tables 7.7 and
7.8 and figure 7.10).
Data Set 88% Crit. 90% Crit.
SNR
Strat_7 SNR
Delay Time
Strat_7-SNR0 None None
Delay Time
Strat_7-SNR6 Array Type
Array Type
Strat_7-Vert SNR SNR
SNR SNR
Strat_7-Hor
Delay Time Delay Time
Table 7.10 Results showing generating class for MCTA using 88% and
90% pass criteria
Using the 90% pass criterion, MCTA of the full Strat_7 data set revealed
that both SNR and delay time compose the generating class. Analysis of the two
SNR-stratified data sets revealed that delay time and array type were significant
for the SNR6 condition but not for SNR0. Analysis of the array-stratified data
sets found delay time significant for the horizontal but not vertical array type.
The use of the two different pass criteria has shown that the choice of the
criterion point has an effect of the sensitivity of the MCTA test. The specific
14
For all MCTA tests, the maximum number of iterations was set to 10 and the
criterion for significance was set to 0.05.
145
implications of these differences are unclear (e.g. whether delay time alone is
incapable of reducing scores below 88%). What is clear is that the results of the
90% pass criterion tests are in agreement with the results found earlier in this
chapter, namely: 1) Delay time and array type are significant effects, though
these effects are reduced / obscured at low SNR levels, and 2) delay time is found
to have a significant effect for the horizontal array geometry.
7.8 Discussion
When compared to the results of the pilot study, it can be seen that the
increase in size of the subject population has led to greater statistical strength and
significance in the results of the current study. The increased number of data
points has also allowed for significant results to be found in stratified analyses.
In addition to evaluating the first- and second-order effects of the
experimental variables, it was also of interest to evaluate the effects of potential
nuisance variables.
As was the case with the pilot study, it was desired to know whether
subjects received a sufficient amount of training prior to beginning the battery of
subjective evaluations. Considering that the presentation order of variable
treatments was randomized for each subject, if the analysis of the effects of
presentation order on scores were to reveal an improvement in subject
performance, this would be an indication of learning. As mentioned in chapter
6.8.2, learning during the first few sessions would indicate inadequate training,
and learning over the course of all 16 sessions could indicate that an insufficient
number of word lists were used. The results of this analysis are shown in table
7.11 and figure 7.11.
Inspection of the plot reveals no trend over the entire span of sessions.
However there is a slight upward trend in medians for the first three sessions,
suggesting that learning may be present over the short term. One might also infer
146
some degree of a decrease in variance beyond session number nine, which could
be an additional indicator of the quality and sufficiency of subject training.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Order (all) 0.939 0.521 13.004 0.602
Order (1-4) 0.614 0.608 1.574 0.665
Table 7.11 Results of ANOVA and Kruskal-Wallis tests for effects of
presentation order on adjusted scores (Strat_7 data set)
Figure 7.11 Box and whisker plot of adjusted score vs. presentation
order (Strat_7 data set)
Figure 7.12 Box and whisker plot of adjusted score vs. word list
(Strat_7 data set)
As seen in table 7.12, there is little doubt that word list does indeed have
an effect. Specifically, as seen in figure 7.12, though the results from lists two
and three (MRT lists D and F, respectively) appear quite similar, there is a clear
difference between those and the results from list one (MRT list A). This result is
148
curious considering that studies noted in the literature have used reduced sets of
word lists (e.g. [9, 73, 122, 123]) or assigned different word lists to different
treatments (e.g. [21, 95, 99, 136, 138]).
Inspection of the matrix of subject responses showed that subjects
consistently selected the incorrect response for word number 48 in list A. The
correct word “BAT” was mistaken for the word “BATH” - the correct answer for
list 3 (MRT list F). The author listened to the two sound files used to create these
stimuli and indeed, even with no added noise, the two words were virtually
indistinguishable. As it was possible, though unlikely, that an error was made
during stimulus preparation, a visual inspection of the two waveforms and
spectrographs was performed on samples from the same treatment, verifying that
the two words were indeed different.
On the adjusted score scale (-10 ≤ Ra ≤ 10), each word missed
corresponds to a 0.4 point drop in score. Such a drop, as would be found when
subjects routinely miss the word bat, could account for the differences in scores
seen between word list number one and the other two lists. From the analysis, it
is evident that the assignment of different word lists to different treatments
constituted an insufficiently controlled variable or, at best, imbued the additional
error associated with a fractional factorial design. Thus, unfortunately, future
studies should employ a fully populated word list by treatment matrix.
The question remains as to the validity of the results of the current study.
As Bech and Zacharov explain, the lack of control of a variable constitutes a
disturbing (or nuisance) variable [14]. They state that the method for dealing with
such variables it either to control them or employ randomization that will break
relationships between the nuisance variable and independent variables. They
further state that such randomization increases a statistical model’s error
component and/or residual variance.
As mentioned, for this study word lists were randomly assigned to
treatments for each subject, using a uniform distribution. Thus any increase in
error and/or variance due to the difference between word lists would be spread
149
across all treatments. Even so, an attempt at post-hoc control of this nuisance
variable is mandated if one is to have faith in the results of the study.
The offending data, word number 48, was censored from the results of all
word lists and the full statistical analysis presented in this chapter was repeated.
The data resulting from the removal of the 48th word contained 49 words each,
and were thus identified with the marker “49w” – compared to the original 50
word (“50w”) data.
Figure 7.13 Box and whisker plot of adjusted score vs. word list
(Strat_7, 49w data set)
Figure 7.13 shows the box and whisker plot of the effect of word list for
the 49w data set. It would appear that the censoring of the results for the 48th
word in all data effectively compensates for the differences observed between
word lists, given that the means for the three word lists are now identical. The
results shown in table 7.13 confirm that the effects of word list are no longer
found to be significant. Differences do exist in result variances between the three
word lists and, as the statistical strength and confidence are not negligible (χ² =
150
Aside from the effects of word list, the majority of results obtained did not
differ between the 50w and 49w data sets in the current study. For effects that
were found to be significant with the 50w data set, the removal of the residual
variance caused by the 48th word merely increased the statistical strength and
confidence. 15 For most of the effects that were not found significant in the
analysis of the 50w data set, the same was found for the 49w data set. Two
notable exceptions to this were the effect of array type for the Strat_7 data set and
the effect of delay time (0 ms–5 ms) in the Strat_7-Vert data set.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Array Type
3.789 0.052 3.488 0.062
(50w)
Array Type
6.365 0.012 6.323 0.012
(49w)
Table 7.14 Differences between statistics for array type between the
50w and 49w data sets (Strat_7 data set)
15
For example, the analysis of the full data set revealed that the statistics for delay
time (10 ms–40 ms) shifted from χ² = 7.877, p = 0.005 (50w) to χ² = 9.121, p =
0.003 (49w).
151
nuisance variable has increased the clarity of the statistical test, rendering a
clearly significant effect where borderline significance was previously found.
Likewise, the reduction in residual variance has yielded significance
regarding the effect of short delay times. As seen in table 7.15, there is sufficient
evidence to conclude that there is a difference between scores for the 0 ms and 5
ms conditions for the vertical array type.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Delay Time
(0 – 5 ms) 3.565 0.062 3.292 0.070
(50w)
Delay Time
(0 – 5 ms) 4.007 0.048 3.925 0.048
(49w)
Table 7.15 Differences between statistics for 0 ms–5 ms delay times
between the 50w and 49w data sets (Strat_7-Vert data set)
It can be seen in figure 7.10 that scores are generally lower for the 0ms
condition than for the 5 ms condition. What is interesting is that these differences
are only found to be significant for the vertical array type. In fact, as shown in
table 7.16, it would appear that this effect is completely negligible for the
horizontal array type.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Delay Time
(0 – 5 ms) 0.058 0.810 0.005 0.941
(Horizontal)
Delay Time
(0 – 5 ms) 4.007 0.048 3.925 0.048
(Vertical)
Table 7.16 Differences between statistics for 0 ms–5 ms delay times
between the Strat_7-Hor and Strat_7-Vert sets (49w data
sets)
was discovered that an endemic calibration error was made during the stimulus
capture process. As seen in figure 3.8 (chapter 3.3), the calibration microphone
used for time alignment was placed just above the head of KEMAR. While this
would not affect the time alignment of loudspeakers in the horizontal array
geometry, it would affect alignment for the vertical array.
The difference in distance between the source and receiver would be
different for the measurement microphone and KEMAR’s ears. The distances
between the source and the two receivers would be approximately equal for the
front-fill loudspeaker, however the distance to the main loudspeaker (Meyer
UPA-1p) would be approximately 10 cm (≈ 0.3 ms travel time) shorter for the
case of the measurement microphone. Thus for the 0 ms condition, the
summation of the signals from the main and fill loudspeakers at the ears would
actually have a 0.3 ms offset, resulting in a ½ λ notch near 1.6 kHz, 3/2 λ notch at
4.8 kHz, 5/2 λ notch at 8 kHz, etc.
It is somewhat unfortunate that this error makes it impossible to determine
whether there would be a significant difference in intelligibility scores between
the 0 ms and 5 ms conditions for the vertical array type. However, if a fortuitous
error is indeed possible, this may have been just such a folly. Based on previous
studies and personal communicae, it was never anticipated that a 5 ms offset in
arrivals would have a significant impact on intelligibility [11, 28, 131, 125, 144].
As mentioned in chapter 2.2.3, the two popular methods for loudspeaker
alignment are absolute alignment and intentional misalignment. What the results
found in this study indicate is that, at least for vertically oriented point-destination
arrays, very short delay times (e.g. 0.3 ms) do indeed have a greater negative
impact on intelligibility than only moderately short (e.g. 5 ms) delay times. These
subjective results confirm the findings of C. Davis [39] that were obtained
through objective (RaSTI and %ALcons) methods. Coincidently, the calibration
error found in this study corresponds to the exact delay offset used by Davis.
153
7.9 Conclusions
As with the pilot study, this first phase of the main study had several main
points, including the evaluation of a variety of hypotheses. This section will
address these questions and tasks, as well as indications of possible directions for
the final phase of the main study.
7.9.1 Hypotheses
Ho1: Delay time between multiple arrivals does not affect the
intelligibility of speech reproduced by a sound system.
From the results of both parametric and non-parametric tests, there is
sufficient evidence to reject this null hypothesis. The results indicate that the
variable delay time does have a significant effect on intelligibility scores. The
tests indicate that this variable begins to have an effect somewhere in the region
between 10 ms and 40 ms. The tests further indicate that the effects of very short
delay times also have a negative impact on intelligibility scores.
Ho4: (interaction) Signal-to-noise ratio does not affect how delay time
between multiple arrivals affects the intelligibility of speech
reproduced by a sound system.
The second-order ANOVA results do not indicate that an interaction effect
exists between SNR and delay time. However independent analysis of the two
SNR-stratified data sets reveals that an interaction does exist. It was observed
that higher noise levels (lower SNR) obscure much of the effect of delay time on
intelligibility. The results found in this study provide sufficient evidence to reject
this null hypothesis.
Ho6: (interaction) Array geometry does not affect how delay time
between multiple arrivals affects the intelligibility of speech
reproduced by a sound system.
Once again, the second-order ANOVA was unable to detect an interaction.
Through stratification by array type, it was found that the effect of longer delay
times on intelligibility is greater for the horizontal array geometry. This result
differs from the suggested, though not significant, trend seen in the pilot study.
The result also would seem to deviate from the expected, given the results of the
studies by Haas [67]. It was also seen that the effects of very short delay times
are also significant, at least for the vertical array geometry. Indubitably, this
relationship warrants additional investigation.
155
2) Determine if the effects of delay time values in the region less than
20 ms are still insignificant with a larger test group.
It has been seen that delay time begins to have an effect somewhere in the
range between 10 ms and 40 ms. These results are in agreement with those of the
pilot study. It is clear that the 5 ms variable value can be excluded from further
study. It would however be useful to include the 30 ms value, as this would
provide increased time resolution for determining the amount of delay required to
affect intelligibility.
The effects of very short delay times (e.g. under 5 ms) remain of interest.
However, as stimuli for this range of delay times were not captured for this
research project, such effects can not be addressed at this time.
156
3) Test the validity of using only one MRT word list per treatment
Results indicate a clear difference between the scores obtained using MRT
list A and those obtained from using lists D and F. Analysis of the results from
each of these word lists indicated one possible source of variance. However as
not all of the variance can be accounted for, it is recommended that a full factorial
design (list by treatment) should be used in future studies in the interest of
controlling this potential nuisance variable.
4) Evaluate hypotheses
As can be seen in the previous section, all six of the hypotheses posed at
the beginning of this study have been evaluated, resulting in the discovery of
several interesting variable interactions.
particularly as the results found were unexpected. In this arena, the question
arises regarding differences in the critical delay time required for each type of
array to produce a significantly adverse effect on intelligibility.
An additional avenue for potential future research would be to investigate
the extent to which a difference in intelligibility exists between the 0 ms and 5 ms
delay time conditions. It would appear that, as delay time increases from 0 ms, a
region of clear impairment exists, followed by a region of minimal impairment,
and followed by another region of clear impairment. As this relates directly to the
question of alignment vs. intentional misalignment of loudspeaker arrays, it seems
worthy of extended study.
At the end of this, the penultimate study, the ratio of questions answered
to new questions encountered has begun to tilt in the favor of the researcher.
While several questions remain, the final study will focus in one direction, leaving
many of these questions unaddressed.
158
After the completion of the pilot study and Phase 1 of the main study, it
was clear that there were many potential avenues for further research. In the final
study of this research project, it was decided to follow one of these paths – the
investigation of the interaction between array type and delay time. By focusing
on fewer variable treatments, it would be possible to obtain a greater number of
data points per treatment using the same number of subjects, yielding greater
power in the statistical tests.
This second phase of the main study would have 2 main points:
1) Evaluate hypotheses
2) Identify potential future research questions that the current study would
be unable to address.
It was discovered during the first phase that an issue existed with one of
the words in word list number one. It should be noted that this issue was not
uncovered prior to the commencement of the current study. As such, analysis of
the data obtained in this study will include results for the censored (49w) data set,
denoting any differences found between the censored and non-censored data sets.
8.1 Hypotheses
Ho1: Delay time between multiple arrivals does not affect the
intelligibility of speech reproduced by a sound system.
Employing a different range for the variable delay time, this hypothesis
will be tested. If the null hypothesis is rejected, it will allow for the determination
of the amount of delay required to affect a negative change in intelligibility.
159
Ho2: (interaction) Array geometry does not affect how delay time
between multiple arrivals affects the intelligibility of speech
reproduced by a sound system.
If the first null hypothesis is again rejected, it will be possible for the
second hypothesis to be tested. Using the new range of delay times, this study
will attempt to confirm the findings of the first phase of the main study. Further,
this study will attempt to determine whether the amount of delay required to
affect a negative change in intelligibility differs by array type.
The variable values to be used for this study are detailed in table 8.1.
These values form a 4×1×2 matrix of eight total treatments. As all subjects would
evaluate each treatment with each word list, the total number of evaluations (sets)
per subject would be 24. The presentation order of these sets was randomized for
each of the subjects using the Matlab function “randperm”.
Delay Time (ms) 10 20 30 40
Approx. SNR (dB) 3
Array Geometry Vert Hor
Table 8.1 Variable values used in the second phase of the main study.
8.3 Equipment
possible choices. The program would only allow an individual stimulus file to be
played once.
8.4 Subjects
8.5 Locations
Listening tests were conducted at two locations. For the first 16 subjects,
the location used was the Sound Design Studio at the College Conservatory of
Music, University of Cincinnati, OH, USA. For the remainder of the subjects,
tests were carried out at a listening facility in Merriam, KS, USA. Both spaces
were found to have background noise levels corresponding to NC-30 or less [82]
when measured (see chapter 3.1.2 for measurement equipment specifications).
162
8.6 Procedures
All 33 of the subjects evaluated three MRT word lists for each of the 8
variable treatments. For each subject, this was completed in three sessions, each
session containing 8 stimulus sets and taking approximately 36 minutes (45
minutes including breaks) to complete. At the beginning of each subject’s first
session, a hearing acuity test was administered to verify that the subject did not
have a hearing impairment. Each subject was then given written and oral
instructions regarding the types of sounds they would be evaluating, operation of
the playback device and the method of response (see appendix H for instructions).
The subject would then undergo a training process to become familiarized
with the stimuli and testing procedures. The training used for this study involved
the evaluation of 4 stimulus sets comprised of: List A delivered under variable
treatment 1 (10 ms, Vertical Array), list D under treatment 4 (40 ms, Vertical
Array) and list F under treatments 5 (10 ms, Horizontal) and 8 (40 ms, Horizontal).
This training set provided subjects with the opportunity to hear all of the
individual words that would be presented, and experience the magnitude of the
differences between auditory attributes of the various treatments to be used in the
study.
Once the training process was complete, the test administrator spoke with
the subject to verify that they understood the instructions and operation of the
apparatus, and to remind the subject of the importance of taking breaks to
minimize fatigue and distraction. As was the case in the previous study, a fixed
policy regarding the spacing of breaks was implemented. The subject was
instructed that, while they were free to pause the testing process at any point, they
would be required to take a 1- to 2-minute break after the completion of every two
50-word sets (approximately every 8‒10 minutes).
The subject then began the first session of stimulus evaluation. At the
completion of the listening session, the subject was debriefed to determine if they
had any concerns about the testing procedure and if they had experienced any
perceived hazards or issues with the testing apparatus. As mentioned, one subject
163
Results of the listening tests were stored by the test software and, as both
50w and 49w data sets would be analyzed, the results were scored as the number
of correct responses out of both 50 and 49. As was the case in the previous study,
the results were adjusted to account for the probability of chance-guessing using
the following equation:
Correct Responses – Incorrect Responses
Adjusted Score (Ra) = (Eq. 8.1)
Number of Choices – 1
As was the case in the previous study, the variance and means were
examined for each subject. Figure 8.1 shows the range and general distribution of
the adjusted scores obtained from all subjects for all treatments used in this study.
Identified in the plot are a number of outliers, falling more than 1.5 times the box
length from the 25th percentile.
Figure 8.1 Box and whisker plot of adjusted score (full 49w data set)
165
Figure 8.2 Box and whisker plot of adjusted score vs. subject (full
49w data set). Note that, as subjects 12 and 28 did not
finish all of the testing sessions, their user numbers have
been shifted to 112 and 128 for ease of identification and
exclusion.
there wide variance in the subject’s scores, but it can also be seen that the median
of these scores is well below the 25th percentile of any other subject.
As it was not possible to definitively determine the causes of the variance
or low scores, two sets of analyses were performed as recommended in [14]. The
results from both sets were quite similar, with only minor changes in statistical
strength and significance between. As such, the results from the data set that
excludes subject six, as well as the aforementioned exclusion of subjects 12 and
28, will be reported in detail (Strat_6 data set). Results from further stratification
of the Strat_6 data set by array type will also be reported.
As mentioned in chapter 7.8.2, an issue was found with the 48th word in
the first word list. As such, the results of the previous study were reanalyzed
using a data set that excluded the results for the 48th words from each word list
(49w data set). For the current study, the same data censoring method was
applied and statistical analysis was performed on both the censored and
uncensored data sets.
As can be seen from tables 8.3 and 8.4, there is little difference between
the analysis results from the two data sets. The added control of the nuisance
variable word list provided by the 49w data set does not change which effects are
or are not significant. For significant effects, statistical strength and confidence
are increased. Little change is noted for effects that were not found significant.
The effect of word list, which will be discussed further in chapter 8.8.2, is reduced
but not eliminated.
The results shown here confirm the findings from the previous study as
delay time has a clearly significant effect in the range of 10 ms–40 ms. As
significant effects were found in the ranges of 20 ms–40 ms and 30 ms–40 ms, yet
no significant effect was found in the ranges of 10 ms–30 ms or 20 ms–30 ms,
these results indicate that delay time begins to have a significant effect on
intelligibility somewhere above 30 ms.
167
would generally be expected, the underlying reasons for the difference would not.
Said reasons will be explored in the next section.
Figure 8.3 Box and whisker plot of adjusted score vs. array type
(Strat_6, 49w data set)
Out of 50 Out of 49
ANOVA ANOVA ANOVA ANOVA
Variable
F-Stat Sig. F-Stat Sig.
Array Type ×
Delay Time 2.241 0.082 2.200 0.087
(10 – 40 ms)
Array Type ×
Delay Time 3.309 0.037 3.172 0.043
(20 – 40 ms)
Array Type ×
Delay Time 3.905 0.049 3.116 0.078
(30 – 40 ms)
Table 8.5 Results of ANOVA for second-order effects of
experimental variables on adjusted score (Strat_6 data set)
169
The results from the 49w and 50w data sets showed no difference when
stratified by array type. The results from the two array types however displayed
great difference. As can be seen in tables 8.6 and 8.7, several delay time ranges
have an effect for the horizontal array type, while no significant effect on
intelligibility scores can be found for the vertical array.
The increased clarity provided through stratification shows the same
effects to be significant (vs. the analysis of the Strat_6 data set); however the
strength and confidence are greatly increased. From these analyses it is clear,
though again unexpected, that delayed multiple arrivals have a greater negative
impact on speech intelligibility when delivered from a pair of loudspeakers
oriented in a horizontal point-source array. It is also clear that noticeable
detriment is found for delay times above 30 ms. This is not to say that shorter
delay times have no effect on intelligibility. Merely, the findings suggest that, for
delay times shorter than 30 ms, the injurious effects are difficult to identify.
Other known factors, such as SNR and reverberation time, will likely carry more
weight in the ultimate determination of the intelligibility of a speech
reinforcement system. In other words, if the time offset between multiple arrivals
is kept to less than 30 ms and intelligibility is still found lacking for a sound
system, one should examine other factors for the underlying cause of the
significant impairment.
170
points per treatment and the use of only one SNR value, the requirements for
MCTA could be fulfilled by a greater number of criteria. 16 As such, criteria that
lie on or in between the medians of scores for the various treatments were selected.
The results of MCTA (table 8.8) were the same for the 50w and 49w data
sets, and are both in agreement with the results obtained through the other
analysis methods – namely that delay time has a significant effect on intelligibility
scores for the horizontal, but not vertical, array geometry. It should be noted that
for the 49w data set, the second-order interaction of array type × delay time
showed borderline significance (p = 0.074) for the 85% pass criterion. This is an
indication that the upper pass/fail criterion point did not provide sufficient
statistical resolution to detect the relationship.
Data Set 83% Crit. 85% Crit.
Array Type × Array Type &
Strat_6
Delay Time Delay Time
Strat_6-Vert None None
Strat_6-Hor Delay Time Delay Time
Table 8.8 Results showing generating class for MCTA using 83% and
85% pass criteria.
8.8 Discussion
It is clear from the various statistical tests used that the effect of delay time
on intelligibility scores is different for the two types of arrays studied. Figure 8.4
shows these differing relationships. It is clear from the graph that, as delay time
between arrivals increases, scores for the horizontal array decrease while scores
for the vertical array do not.
One can see that the median scores in the vertical array are identical for
the various levels of the variable delay time. Variance seen in the scores could be
an indication that an effect does exist. However it is also possible that, as the
statistical tests detailed in this study were unable to find significance, the power of
any such effect would pale in comparison to the effects of other factors. This is a
16
MCTA requires that all frequencies must be greater then 0 and no more than
20% of the count frequencies should be less than five.
172
good indication of the amount of weight one should give to this specific factor
when designing or optimizing sound systems for intelligibility. The variance seen
in these scores could also merely be caused by factors such as differences
between test subjects and/or word lists.
The same could be said about the observed variance in scores for the
horizontal array type. Specifically, comparison of the score distributions for the
20 ms and 30 ms conditions shows a drop in median score as well as increased
variance and a larger inter-quartile range. While this may be an indication that
delay time begins to have an effect in the 20 ms–30 ms range, the effect was not
prevalent enough to be found significant. Again, this suggests the amount of
weight that should be given this factor with regard to the intelligibility of
reinforced speech.
Figure 8.4 Box and whisker plot of adjusted score vs. delay time, by
array type (Strat_6 data set)
173
As was the case with the previous two studies, it was desired to know
whether subjects received a sufficient amount of training prior to beginning the
battery of subjective evaluations. Considering that the presentation order of
variable treatments was randomized for each subject, if the analysis of the effects
of presentation order on scores were to reveal an improvement in subject
performance, this would be an indication of learning. As mentioned in the
previous chapters, learning during the first few sessions would indicate
inadequate training, and learning over the course of all 24 sessions could indicate
that an insufficient number of word lists were used. The results of this analysis
are shown in figure 8.5 and table 8.9.
Figure 8.5 Box and whisker plot of adjusted score vs. presentation
order (Strat_6 data set)
span of 24 sessions. The only element of interest is the one outlier found in the
results of the first session.
ANOVA ANOVA K-W χ² K-W
Variable
F-Stat Sig. Stat Sig.
Order (1-24) 1.116 0.321 25.573 0.321
Order (1-8) 0.732 0.645 5.635 0.583
Order (1-4) 0.333 0.81 0.504 0.918
Table 8.9 Results of ANOVA and Kruskal-Wallis tests for effects of
presentation order on adjusted scores (Strat_6, 49w data set)
The strength of the effect was found to be considerably reduced in the 49w
data set, indicating that the removal of the offending data points was prudent.
However word list remains a significant factor. While an effect is present it
175
would have no more impact on the results of the study than the “subject” variable,
as a full factorial design of treatment vs. word list was used.
The results do indicate that, for future studies involving the same MRT
word lists used here, it would be necessary to treat word list as a nuisance variable
to be controlled through full factorial design.
8.9 Conclusions
As with the previous two studies, this final phase of the main study
involved the addressing of main points and the evaluation of hypotheses.
However unlike the previous studies, the scope and the number of research
questions had been reduced, allowing for greater focus and increased clarity. This
section will address the research questions posed and points of the study, as well
as possible directions for future study.
8.9.1 Hypotheses
Ho1: Delay time between multiple arrivals does not affect the
intelligibility of speech reproduced by a sound system.
The results of the statistical tests used in this study provide sufficient
evidence to reject this null hypothesis. Negative effects on intelligibility were
found to be significant for multiple arrivals separated by a 40 ms delay. The
specific amount of delay required to cause significant detriment was found to lie
in the range of 30 ms–40 ms.
Ho2: (interaction) Array geometry does not affect how delay time
between multiple arrivals affects the intelligibility of speech
reproduced by a sound system.
The results of this study also provide sufficient evidence to reject this null
hypothesis. It was found that time offsets between multiple arrivals, in the range
of 10 ms–40 ms, have no significant effect on intelligibility for the vertical array
176
type. However a significant effect was found for the horizontal array type under
the same conditions.
Measurements were made for all variable treatments and noise levels as
indicated in table 3.2. As the impulse response measurement method was dual-
179
channel FFT using maximum length sequences (MLS), the measurements had an
inherently high immunity to noise. For example, the measured frequency
response of the system varied little between very low- and very high-noise
conditions. An example of the differences between these measurements can be
seen in figure 9.1. Also seen in the figure, due to the fact that the frequency
response of vocal reproduction system rolls off in the lower-frequency range,
differences become significant in the frequency region below 100 Hz. Due to the
measurement method’s immunity to noise, it was necessary to manually input the
signal level and SNR (per octave) prior to intelligibility calculations.
Through subjective testing, significance was found for each of the 1-way
and 2-way effects of delay time, array type and SNR. The following is a
comparison of the objective and subjective findings for each of these effects. As
the variable level offset was not used in the subjective testing processes, only the
181
results from treatments using the level offset condition used in the testing (0 dB)
will be reported.
Delay Time
Delay time was found to begin to have a significant first-order effect for
offsets greater than 30 ms. As seen in figure 9.2, the STI scores remain fairly
constant over the range of 5 ms–30 ms, with a considerable drop between 30 ms
and 40 ms. While this is in agreement with the results of subjective testing, a
curiosity is noted with regard to the drop in STI score between the 0 ms and 5 ms
delay times.
Figure 9.2 Box and whisker plot of STI vs. delay time (all treatments)
Figure 9.4 Box and whisker plot of STI vs. SNR (all treatments)
Array Type
The first-order effects of array type were found to be significant for the
49w data sets in both the first and second phases of the main study. As seen in
figure 9.5, the nearly coincident ranges of values and large degree of overlap
between the inter-quartile ranges of the STI scores for the two array types make it
is difficult to definitively establish that a difference exists. Though some
difference can be noted, the equivalent plot of results from subjective testing
(seen in figure 8.3, chapter 8.7.2) shows a clearer difference, and thus a greater
apparent first-order effect.
184
Figure 9.5 Box and whisker plot of STI vs. array type (all treatments)
SNR conditions. The results from objective measurements, shown in figure 9.6,
appear to confirm these findings. A comparison of the STI scores for the 50 dB
and −3 dB SNR values shows that differences in score due to delay time are
significantly reduced for the lower SNR condition. As such, it is reasonable to
conclude that, in terms of detecting the interaction effects of SNR × delay time,
the STI measurement method functions in a manner similar to human perception.
Figure 9.6 3-dimensional box and whisker plot of STI vs. delay time,
by SNR (all treatments)
continue to show a reduction in the size of the range of scores for lower SNR
conditions.
Figure 9.7 Box and whisker plot of STI vs. SNR, by array type (all
treatments)
may be due to differences between the STI and word score scales, it is also
possible that this data indicates that the STI measurement method is not fully
sensitive to the effects of array type.
Figure 9.8 Box and whisker plot of STI vs. SNR, by array type (all
treatments containing SNR conditions 6 dB, 3 dB and 0 dB)
17
As previously mentioned, the effects of very short delay times were not tested
for the horizontal array type.
188
scores for the array types for the 0 ms condition. It should be reiterated that the
short (0.3 ms) delay time found in the audio recordings was not present in the
signals received by the measurement microphone, yet the notches in the frequency
response of the loudspeakers in the horizontal array type (seen in figure 3.7,
chapter 3.2.2) would be captured by the measurement microphone. While it is
possible that the underestimated effects of masking, due to the lack of spectral
shaping of the test signal, could reduce the effect of the higher-frequency notch on
the STI score, the notch at around 1.2 kHz would be virtually unaffected by
spectral shaping. Given that there is no difference between the scores for the two
array types under the 0 ms condition, it is reasonable to conclude that the STI
measurement method is not sensitive to the narrow-band frequency response
anomalies caused by very short delay times.
Figure 9.9 Box and whisker plot of STI vs. delay time, by array type
(all treatments)
189
Another pattern of note is that scores remain very similar between the two
array types in the range of 5 ms–30 ms. Given the results shown in figures 7.10
(chapter 7.7.5) and 8.4 (chapter 8.8), one would expect to see large differences
between scores for the 30 ms condition as well as less overall change in the scores
for the vertical array. In the interest of a more direct comparison, a plot of the
STI scores for the 0 dB–6 dB SNR conditions has been included in figure 9.10.
Figure 9.10 Box and whisker plot of STI vs. delay time, by array type
(all treatments containing SNR conditions 6 dB, 3 dB and 0
dB)
differences in the scores between array types, which is not detected by the STI
measurement.
The third thing of note is that the STI measurements correctly identify that
a difference in intelligibility exists between the two arrays at the 40 ms level.
Again, as the measurement method is single-channel, the results suggest that, at
least some portion of, the difference between scores is due to a monaural effect.
9.2.1 Conclusions
10. Discussion
“E = MC² ± 3 dB”
- David Engstrom (in [43])
Perhaps a bit whimsical, but this observation reminds one that the
acumination of knowledge often carries with it the introduction of new unknowns.
It also reminds one that certainty is rarely certain. Research, however, is a
process of successive approximation, wherein each step has the potential to yield
a greater understanding of the world around.
The research project detailed in this dissertation focused on the complex
question of how sound system optimization affects the intelligibility of reinforced
speech. As an early attempt in the deliberation process, the goal of the project
was to deconstruct this larger question, identifying noteworthy research avenues
and following several of the identified paths. As many of the potential research
questions were unknown at the onset of the project, an ecological approach was
adopted, studying real-world reinforcement scenarios in an actual performance
space.
Through the use of subjective and objective testing methods, many
potential research paths were discovered and a variety of hypotheses were
evaluated. The following two sections discuss the findings of this series of
studies, including both questions answered and unanswered.
Through subjective testing, it was found that all three of the experimental
variables used in this research project (SNR, delay time and array geometry) had
significant first-order effects on the intelligibility of reinforced speech. Through
stratified analysis of the various data sets, it was found that all three of the
second-order interaction effects between these variables also had significant
effects. The observed effects and relationships are as follows:
192
1) Delay times greater than 30 ms have an effect for the horizontal array
geometry.
2) Very short delay times (0.3 ms) have an effect for the vertical array
geometry.
3) Scores for the horizontal array geometry were found to be lower than
scores for the vertical array geometry.
4) As SNR decreases, the magnitude of the effects of delay time and array
geometry also decreases.
When the results of subjective tests were compared with the results of
objective measurements (STI), both correlations and discrepancies were
uncovered:
1) STI appears to accurately identify the effects of delay times greater than
30 ms.
2) STI also accurately predicts that scores are generally lower for the
horizontal array type.
3) STI predicts a larger negative effect, versus the results of subjective
testing, for delay times in the 5 ms–30 ms range.
The results of this research project, however, indicate that all loudspeaker arrays
may not be equal.
For the vertically-oriented point-destination array, the comb filter created
by the very short time offset had a noticeable effect on intelligibility scores. For
this same array, the addition of a 5 ms delay (4.7 ms total offset) improved scores,
indicating that misalignment improves intelligibility. For the horizontally-
oriented point-source array, a very similar comb filter was present in the
frequency response for the 0 ms condition. For this array, the addition of a 5 ms
delay did not improve intelligibility scores. The conditions studied with these two
arrays are analogous to the difference between acoustical and electrical
summation in a sound system. In one case, the comb filtering is created through
summation at the listener; in the other, the comb filter exists in the signal received
from each loudspeaker. 18 The results of this study suggest that misalignment is
effective at combating the effects of comb filters created by acoustical summation,
but are not effective at improving the intelligibility of “pre-combed” signals.
Additionally, the trend of lower scores for the horizontal vs. vertical arrays may
be due to the inherent comb filter in the response of the horizontal array, or to the
existence of a physical sound source in front of the listener. Further investigation
is, of course, required to rule out the effects of the complex variable array
geometry.
Though this research project was not charged with determining the
appropriate value for the integration time (if any) to be used with regard to fusion,
the results do suggest one conclusion. Given that delay times begin to have a
significant effect in the region between 30 ms–40 ms, and the fact that very short
delay times can have an effect, the author proposes that fusion may indeed play a
role in speech intelligibility for delay times less than 30 ms–40 ms, and that the
effects of very short delay times on intelligibility could be due to frequency
response anomalies rather than a lack of fusion.
18
For the case of the horizontal array, the comb filter is created via acoustical
summation of the signals from multiple drivers within each loudspeaker. As the
individual drivers were not delayed relative to each other, the analogy of electrical
summation still holds.
194
In addition to results regarding very short delay times, the studies in this
research project uncovered an interaction between delay time and array geometry
for longer delay times. Results indicate that multiple arrivals have a greater
potential to negatively affect intelligibility for the horizontal array geometry. As
this relationship was also, to some degree, detected by STI measurements, it is
unlikely that binaural listening is the sole cause of this interaction.
Finally, results from the comparison of subjective and objective
assessment methods clearly indicate that objective measurements can provide
inaccurate results for conditions involving multiple arrivals. In terms of
measurement reliability, these results implicate that objective measures should not
be used to evaluate the intelligibility of time-delayed multiple arrivals. Further,
prior to performing objective measurements, impulse response measurements are
needed to ensure that time-delayed multiple arrivals are not present in the
received test signal.
Through the course of this project, many potential research questions were
uncovered. Several of these questions have been addressed. Alas, owing to a
variety of factors, a number of questions have not. The charge of research,
however, is not restricted to the answering of questions. Rather, research should
create questions as well – both to sustain, and to spark new interest. This section
contains a number of prospective research questions for future study.
The first, and perhaps most obvious, avenue for future study is the
deconstruction of the compound variable array geometry. This research project
has shown that differences do exist between the relative intelligibilities of the two
array geometries. The question remains as to what factors cause these differences.
Isolation and control of the variables plane (medial vs. horizontal) and array focus
(point source vs. point destination) could aid in determining whether, or to what
degree, the observed effect of array geometry is due to hearing method (binaural
vs. monaural), on- vs. off-axis listening, equalization location or the existence of a
195
The motivation for this research project manifested from the author’s own
work as a sound system designer and system engineer. Often encountering
situations in which design and optimization decisions were required, yet no
information was available to guide these decisions, best-fit solutions were usually
found through a process of trial, error and modification. While much of the
19
The question was asked during the hearing acuity test: “What ear do you use
when talking on the telephone?”
196
available knowledge in the field of live reinforcement has been garnered by such
empirical means, a point can be reached wherein these methods are incapable of
providing further illumination. It is at that point that structured, scientific
research is required.
The author was recently asked how the knowledge gained from this
research project will affect his future design and optimization decisions. While it
will take time to fully cogitate and assimilate the actionability of these findings,
several decision-making aids are already apparent with regard to alignment.
Multiple arrivals with delay times ranging between 5 ms–30 ms do not
significantly affect intelligibility, though they do affect sound quality. Thus, if
intelligibility is of paramount importance in a given reinforcement situation,
intentional misalignment of vertically oriented point-destination arrays is
warranted for the preservation of intelligibility at the cost of the overall sound
quality of the vocal reinforcement system. Conversely, one may place higher,
though not sole, priority on sound quality. Though the effects of very short delay
times are not yet known for this type of array, due to the smaller region of overlap
and volatility, the use of aligned, horizontally oriented point-source arrays will
yield better overall quality and negatively affect intelligibility for the smallest
number of people.
The question of how to optimally optimize a sound system is far from
answered. While many aspects of the greater question have been clarified, many
questions remain. It is with great pleasure, and great reverence for the many
scientists, researchers and sound engineers who have framed the grimoire of
sound system engineering, that the author offers this dissertation as a contribution
to the ongoing effort to answer these questions, and to increase the body of
knowledge in the field of live sound reinforcement.
197
Reference List
[2] Ahnert, W., Feistel, R., “EARS Auralization Software,” J. Audio Eng. Soc.,
41(11), 894-904, 1993.
[4] Ahnert, W., Feistel, S., Maier, T., Miron, A. R., “Loudspeaker Time
Alignment using Live Sound Measurements,” from the proceedings of the 124th
convention of the Audio Engineering Society, Amsterdam, the Netherlands,
Preprint 7433, May 2008.
[6] ANSI, “American National Standard for an Occluded Ear Simulator,” (ANSI
S3.25-1979), American National Standards Institute, New York, NY, 1979.
[7] ANSI, “Specification for Manikin for Simulated in situ Airborne Acoustic
Measurements,” (ANSI S3.36-1985), American National Standards Institute, New
York, NY, 1985.
[8] Antilla, M., Kataja, J., Valimaka, V., “Sound Directivity Control Using
Striped Panel Loudspeakers,” from the proceedings of the 110th convention of the
Audio Engineering Society, Amsterdam, The Netherlands, Preprint 5306, May
2001.
[10] Augspurger, G., Brawley, J., “An Improved Colinear Array,” from the
proceedings of the 74th convention of the Audio Engineering Society, New York,
NY, Preprint 2047, Oct. 1983.
212
[11] Augspurger, G., Bech, S., Brook, R., Cohen, E., Eargle, J., Schindler, T. A.,
“Use of Stereo Synthesis to Reduce Subjective/Objective Interference Effects:
The perception of Comb Filtering, Part 2,” from the proceedings of the 87th
convention of the Audio Engineering Society, New York, U.S.A., Preprint 2862,
Oct. 1989.
[13] Bauer, B. B., Rosenheck, A. J., Abbagnaro, L. A., “External-Ear Replica for
Acoustical Testing,” J. Acoust. Soc. Am. 42(1), 204-207, July 1967.
[14] Bech, S., Zacharov, N., Perceptual Audio Evaluation – Theory, Method and
Application, John Wiley & Sons Ltd., West Sussex, England, 2006.
[15] Bell, D. W., Kreul, E. J., Nixon, J. C., “Reliability of the Modified Rhyme
Test for Hearing,” J. of Speech and Hearing Research, 15, 287-295, 1972.
[17] Beranek, L. L., Concert and Opera Halls: How They Sound, Acoustical
Society of America, Woodbury, New, York, 1996.
[19] Beutalmann, R., Brand, T., Kollmeier, B., “Prediction of Binaural Speech
Intelligibility with Frequency-Dependent Interaural Phase Differences,” J. Acoust.
Soc. Am. 126(3), 1359-1368, Sept. 2009.
[20] Beutelmann, R., Brand, T., Kollmeier, B., “Revision, Extension, and
Evaluation of a Binaural Speech Intelligibility Model,” J. Acoust. Soc. Am.
127(4), 2479-2497, April 2010.
[21] Beyer, M. R., Webster, J. C., Dague, D. M., “Revalidation of the Clinical
Test Version of the Modified Rhyme Words,” J. of Speech and Hearing Research,
12, 374-378, 1969.
[22] Blauert, J., Pösselt, C., “Application of Modeling Tools in the Process of
Planning Electronic Room Acoustics,” from the proceedings of the 6th
international conference of the Audio Engineering Society, Nashville, TN, May
1988.
213
[25] Boner, C. P., Boner, C. R., “A Procedure for Controlling Room-Ring Modes
and Feedback Modes in Sound Systems with Narrow-Band Filters,” J. audio Eng.
Soc., 13, 297-299, 1965.
[27] Breshears, V., Hinz, R., “An Integrated 3-Way Constant Directivity Speaker
Array,” from the proceedings of the 101st convention of the Audio Engineering
Society, Los Angeles, CA, Preprint 4323, Nov. 1996.
[29] Brüel & Kjaer, Head and Torso Simulator Type 4128. Product Information.
[31] Burandt, U., Pösselt, C., Ambrozus, S., Hosenfeld, M., Knauff, V.,
“Anthropometric Contribution to Standardizing Manikins for Artificial-Head
Microphones and to Measuring Headphones and Ear Protectors,” Applied
Ergonomics, 22.6, 373-378, 1991.
[33] Burkhard, M. D., “Measuring the Constants of Ear Simulators,” J. Audio Eng.
Soc., 25(12), 1008-1015, 1977.
[34] Cabot, R. C., “Equalization, Current Practice and New Directions,” from the
proceedings of the 6th international conference of the Audio Engineering Society,
Nashville, Tennessee, May 1988.
[35] Christensen, F., Jensen, C. B., Møller, H., “The Design of VALDEMAR –
An Artificial Head for Binaural Recording Purposes,” from the proceedings of the
109th convention of the Audio Engineering Society, Los Angeles, California,
Preprint 5253, Sept. 2000.
214
[38] Dau, T., Kollmeier, B., Kohlrausch, A., “Modeling Auditory Processing of
Amplitude Modulation. II. Spectral and Temporal Integration,” J. Acoust. Soc.
Am. 102(5), Pt. 1, Nov. 1997.
[39] Davis, C., “Measurement of %Alcons,” J. Audio Eng. Soc., 34(11), 1986.
[40] Davis, D., “Equivalent Acoustic Distance,” J. Audio Eng. Soc., 21(8), 646-
649, 1973.
[42] Davis, D., Davis, C., Sound System Engineering, 2nd Edition, Focal Press,
Newton, MA, 1997.
[43] Davis, D., Davis, C., If Bad Sound Were Fatal, Audio would be the Leading
Cause of Death, 1stBooks, Bloomington, IN, 2004.
[45] Dubbelboer, F., Houtgast, T., “A Detailed Study on the Effects of Noise on
Speech Intelligibility,” J. Acoust. Soc. Am., 122(5), 2865-2871, 2007.
[46] Eargle, J., Loudspeaker Handbook, 2nd Edition, Kluwer Academic Publishers,
Norwell, MA, 2003.
[47] Eargle, J., Gander, M., “Historical Perspectives and Technology Overview of
Loudspeakers for Sound Reinforcement,” J. Audio Eng. Soc., 52(4), 412-432,
2004.
[52] Elkins, E. F., “Evaluation of Modified Rhyme Test Results from Impaired-
and Normal-Hearing Listeners,” J. of Speech and Hearing Research, 14, 589-595,
1971.
[53] El-Saghir, E., Maher, M., “Virtual Shifting of Speaker Array Components,”
from the proceedings of the 106th convention of the Audio Engineering Society,
Munich, Germany, Preprint 4893, May 1999.
[55] Fairbanks, G., “Test of Phonetic Differentiation: The Rhyme Test,” J. Acoust.
Soc. Am., 30(7), 596-600, Jul. 1958.
[56] Fidlin, P., Carlson, D., “The Basic Concepts and Problems Associated with
Large-Scale Concert Sound Loudspeaker Arrays,” from the proceedings of the
86th convention of the Audio Engineering Society, Hamburg, Germany, Preprint
2802, Mar. 1989.
[58] Fletcher, H., Galt, R. H., “The Perception of Speech and its Relation to
Telephony,” J. Acoust. Soc. Am. 22(2), 89-151, Mar. 1950.
[65] Green, I., Maxfield, J., “Public Address Systems”, J. Audio Eng. Soc., 25(4),
184-195, 1977.
[67] Haas, H., “The Influence of a Single Echo on the Audibility of Speech,” J.
Audio Eng. Soc., 20(2), 146-159, 1972.
[68] Harrell, J., “End-Fire Line Array of Loudspeakers,” J. Audio Eng. Soc.,
43(7/8), 581-591, 1995
[69] Heil, C., “Sound Fields Radiated by Multiple Sound Source Arrays,” from
the proceedings of the 92nd convention of the Audio Engineering Society, Vienna,
Austria, Preprint 3269, Mar. 1992.
[70] Hilliard, J., “Unbaffled Loudspeaker Column Arrays,” J. Audio Eng. Soc.,
672-673, 1970.
[71] Hochhaus, L. and Antes, J. R., “Speech identification and ‘knowing that you
know’.” Journal of Perception and Psychophysics, 13, 131-132, 1973.
[73] Holley, S. C., Lerman, J., Randolph, K., “A Comparison of the Intelligibility
of Esophageal, Electrolaryngeal, and Normal Speech in Quiet and in Noise,” J.
Communication Disorders, 16, 143-155, 1983.
[74] Holman, T., “New Factors in Sound for Cinema and Television,” J. Audio
Eng. Soc., 39(7/8), 529-539, 1991.
[75] House, A. S., Williams, C. E., Hecker, M. H. L., Kryter, K. D., “Articulation-
Testing Methods: Consonantal Differentiation with a Closed-Response Set,” J.
Acoust. Soc. Am., 37(1), 158-166, 1965.
[77] Houtgast, T., Steeneken, H. J. M., “A Review of the MTF Concept in Room
Acoustics and its use for Estimating Speech Intelligibility in Auditoria,” J. Acoust.
Soc. Am., 77(3), 1069-1077, Mar. 1985.
[78] Humes, L. E., Boney, S., Loven, F., “Further Validation of the Speech
Transmission Index (STI),” J. Speech and Hearing Research, 30, 403-410, 1987.
[81] IEC, “Provisional Head and Torso Simulator for Acoustic Measurements on
Air Conduction Hearing Aids,” (IEC 959, First Edition), International
Electrotechnical Commission, Geneva, Switzerland, 1990.
[82] ISO, “Background Acoustic Noise Levels in Theatres, Review Rooms and
Dubbing Rooms,” (ISO-9568), International Standards Organization, Geneva,
Switzerland, 1993.
[85] ITU, “Head and Torso Simulator for Telephonometry,” (ITU-T P.58),
International Telecommunications Union, Geneva, Switzerland, 1996.
[87] Janssen, J. H., “A Method for the Calculation of Speech Intelligibility under
Conditions of Reverberation and Noise,” Acustica, 7, 305-310, 1957.
[88] Keele, D. B., “Effective Performance of Belles Arrays,” J. Audio Eng. Soc.,
38(10), 723-748, 1990.
[89] Keele, D. B., “Implementation of Straight Line and Flat Panel Constant
Beamwidth Transducer (CBT) Loudspeaker Arrays using Signal Delays,” from
the proceedings of the 113th convention of the Audio Engineering Society, Los
Angeles, CA, Preprint 5653, Oct. 2002.
218
[95] Kreul, E. J., Nixon, J. C., Kryter, K. D., Bell, D. W., Lang, J. S., Schubert, E.
D., “A Proposed Clinical Test of Speech Discrimination, J. Speech and Hearing
Research, 11, 536-552, 1968.
[96] Kreul, E. J., Bell, D. W., Nixon, J. C., “Factors Affecting Speech
Discrimination Test Difficulty,” J. of Speech and Hearing Research, 12, 281-287,
1969.
[97] Kryter, K. D., “Methods for the Calculation and use of the Articulation
Index,” J. Acoust. Soc. Am., 34(11), 1689-1697, 1962.
[98] Kuriyama, J., Tokko, T., Suzuki, S., Hashiguchi, Y., “A Compact Digital
Signal Processing system Consisting of DSP Modules,”, from the proceedings of
the 86th convention of the Audio Engineering Society, Hamburg, Germany,
Preprint 2771, Mar. 1989.
[99] Kusumoto, A., Arai, T., Kinoshita, K., Hodoshima, N., Vaughn, N.,
“Modulation Enhancement of Speech by a Pre-Processing Algorithm for
Improving Intelligibility in Reverberant Environments,” Speech Communication,
45, 101-113, 2005.
[100] Larcher, V., Vandernoot, G., Jot, J., “Equalization Methods in Binaural
Technology,” from the proceedings of the 105th convention of the Audio
Engineering Society, San Francisco, California, Sept. 1998.
[101] Leembruggen, G., Packer, N., Goldburg, B., Backstrom, D., “Development
of a Shaded, Beam Steered Line Array Loudspeaker with Integral Amplification
and DSP Processing,” from the proceedings of the 105th convention of the Audio
Engineering Society, San Francisco, California, Sept. 1998.
219
[102] Leembruggen, G., Hippler, M., Mapp, P., “Further Investigations into
Improving STI’s Recognition of the Effects of Poor Frequency Response on
Subjective Intelligibility,” from the proceedings of the 128th convention of the
Audio Engineering Society, London, UK, May 2010.
[103] Leonard, J., Theatre Sound, Routledge, New York, NY, 2001.
[105] Lochner, J. P. A., Burger, J. F., “The Subjective Masking of Short Time
Delayed Echoes by Their Primary Sounds and Their Contribution to the
Intelligibility of Speech,” Acustica, 8, 1-10, 1958.
[109] Mapp, P., “A Comparison between STI and RASTI Speech Intelligibility
Measurement Systems,” from the proceedings of the 100th convention of the
Audio Engineering Society, Copenhagen, Denmark, Preprint 4279, May 1996.
[111] Mapp, P., “From Loudspeaker to Ear – Measurement and Reality,” from the
proceedings of the 12th UK conference of the Audio Engineering Society, London,
UK, April 1997.
[112] Mapp, P., “Relationships between Speech Intelligibility Measures for Sound
Systems,” from the proceedings of the 112th convention of the Audio Engineering
Society, Munich, Germany, Preprint 5604, May 2002.
[113] Mapp, P., “Modifying STI to Better Reflect Subjective Impression,” from
the proceedings of the 21st international conference of the Audio Engineering
Society, St. Petersburg, Russia, June 2002.
220
[116] Mapp, P., “Intelligibility – Winning the Acoustics Battle,” from the
proceedings of the 18th U.K. conference of the Audio Engineering Society,
London, England, Apr. 2003.
[118] Mapp, P., “Systematic & Common Errors in Sound System STI and
Intelligibility Measurements,” from the proceedings of the 117th convention of the
Audio Engineering Society, San Francisco, CA, Preprint 6271, Oct. 2004.
[122] Matthews, R., Legg, S., Charlton, S., “The Effect of Cell Phone Type on
Drivers Subjective Workload During Concurrent Driving and Conversing,”
Accident Analysis and Prevention, 35, 451-457, 2003.
[123] McBride, M., Hodges, M., French, J., “Speech Intelligibility Differences of
Male and Female Vocal Signals Transmitted Through Bone Conduction in
Background Noise: Implications for Voice Communication Headset Design,” Int.
J. of Ind. Ergonomics, 38, 1038-1044, 2008.
[124] McCarthy, B, Sound Systems: Design and Optimization, 1st Edition, Focal
Press, Burlington, MA, 2007.
[128] Meyer, J., Seidel, F., “Large Arrays: Measured Free-Field Polar Patterns
Compared to a Theoretical Model of a Curved Surface Source,” J. Audio Eng.
Soc., 38(4), 260–270, 1990.
[129] Minnaar, P., Olsen, S., Christensen, F., Møller, H., “Localization with
Binaural Recordings from Artificial and Human Heads,“ J. Audio Eng. Soc.,
49(5), 323-336, 2001.
[131] Mochimaru, A., “An Advanced Concept for Loudspeaker Array Design in
Sound Reinforcement Systems,” from the proceedings of the 91st convention of
the Audio Engineering Society, New York, NY, Preprint 3139, October 1991.
[133] Møller, H., Hammershøi, D., Jensen, C., Sørensen, M., “Transfer
Characteristics of Headphones Measured on Human Ears,” J. Audio Eng. Soc.,
43(4), 203–217, 1995.
[134] Møller, H., Jensen, C., Hammershøi, D., Sørensen, M., “Using a Typical
Human Subject for Binaural Recording,“ from the proceedings of the 100th
convention of the Audio Engineering Society, Copenhagen, Denmark, Preprint
4157, May 1996.
[135] Møller, H., Hammershøi, D., Jensen, C., Sørensen, M., “Evaluation of
Artificial Heads in Listening Tests,“ J. Audio Eng. Soc., 47(3), 83-100, 1999.
[136] Nabelek, A. K., Mason, D., “Effect of Noise and Reverberation on Binaural
and Monaural Word Identification by Subjects with Various Audiograms,” J. of
Speech and Hearing Research, 24, 375-383, 1981.
[138] Nixon, J. C., “Investigation of the Response Foils of the Modified Rhyme
Hearing Test,” J. of Speech and Hearing Research, 16, 658-666, 1973.
222
[139] Nye, P. W., Gaitenby, J., “Consonant Intelligibility in Synthetic Speech and
in a Natural Control (Modified Rhyme Test Results),” Haskins Laboratories
Status Report on Speech Research, SR-33, 77-91, 1973.
[145] Peutz, V. M. A., “Speech Information and Speech Intelligibility,” from the
proceedings of the 85th convention of the Audio Engineering Society, Los
Angeles, CA, Preprint 2732, Nov. 1988.
[148] Rife, D., Vanderkooy, J., “Transfer Function Measurement with Maximum
Length Sequences,” J. Audio Eng. Soc., 37(6), 419-444, 1989.
[149] Rijk, K., Breuer, F., Peutz, V., “Speech Intelligibility in Some German
Sports Stadiums,” J. Audio Eng. Soc., 39(1/2), 37-46, 1991.
[150] Rumsey, F., Spatial Audio, Focal Press, Oxford, England, 2001.
[151] Sachs, R. M., Burkhard, M. D., “Earphone Pressure Response in Ears and
Couplers,” from the proceedings of the 83rd meeting of the Acoustical Society of
America, Buffalo, NY, Apr. 1972.
[152] Sachs, R. M., Burkhard, M. D., “Zwislocki Coupler Evaluation with Insert
Earphones,” Technical Report 8, Knowles Electronics, Nov. 1972.
223
[153] Scheaffer, R. L., McClave, J. T., Probability and Statistics for Engineers,
Duxbury Press, Belmont, California, 1995.
[161] Shinn-Cunningham, B., “Distance Cues for Virtual Auditory Space,” from
the proceedings of the 1st Pacific Rim conference on multi media of the Institute
of Electrical and Electronics Engineers, Sydney, Australia, Dec. 2000.
[163] Shirley, B., Churchill, C., “The Effect of Stereo Crosstalk on Intelligibility:
Comparison of a Phantom Stereo Image and a Central Loudspeaker Source,” J.
Audio Eng. Soc., 55(10), 852-863, 2007.
[165] Smith, D., “Discrete Element Line Arrays – Their Modeling and
Optimization,” J. Audio Eng. Soc., 45(11), 949-964, 1997.
[167] Steeneken, H, Houtgast, T., “A Fast Method for the Determination of the
Intelligibility of Running Speech,” from the proceedings of the 44th convention of
the Audio Engineering Society, Rotterdam, The Netherlands, Feb. 1973.
[170] Teuber, W., Völker, E., “Improved Speech Intelligibility by Use of Time
Delays in Sound Systems,” from the proceedings of the Audio Engineering
Society 84th Convention, Paris, France, Preprint 2562, Mar. 1988.
[172] Toole, F. E., “The Acoustics and Psychoacoustics of Headphones,” from the
proceedings of the 2nd international conference of the Audio Engineering Society,
Anaheim, U.S.A., Preprint 1006, May 1984.
[174] Ureda, M., “’J’ and ‘Spiral’ Line Arrays,” from the proceedings of the 111th
convention of the Audio Engineering Society, New York, NY, Preprint 5485,
Sept. 2001.
[175] Ureda, M., “Line Arrays: Theory and Applications,” from the proceedings
of the 110th convention of the Audio Engineering Society, Amsterdam, the
Netherlands, Preprint 5304, May 2001.
[178] Van der Wal, M., Start, E. W., De Vries, D., “Design of Logarithmically
Spaced Constant-Directivity Transducer Arrays,” J. Audio Eng. Soc., 44(6), 497-
507, 1996.
225
[180] Wickens, T. D., Multiway Contingency Tables Analysis for the Social
Sciences, Psychology Press, New York, 1989.
[183] Zwicker, E., Fastl, H., Psychoacoustics: Facts and Models, 2nd updated
Edition, Springer Publishing House, Berlin, Germany, 1999.
[184] Zwislocki, J., “An Acoustic Coupler for Earphone Calibration,” Report
LSC-S-7, Laboratory of Sensory Communications, Syracuse University, Syracuse,
NY, 1970.
[185] Zwislocki, J., “An Ear-Like Coupler for Earphone Calibration,” Report
LSC-S-9, Laboratory of Sensory Communication, Syracuse University, Syracuse,
NY, 1971.