Nothing Special   »   [go: up one dir, main page]

Wishart, Trevor - Encounters in The Republic of (TES 2012 KEYNOTE)

Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 14
At a glance
Powered by AI
Trevor Wishart is interested in exploring the human voice through amplification and developing software tools to analyze and transform recorded sounds. He has created works for both electroacoustic music and scored vocal music.

Wishart works with recorded speech and explores techniques like morphing one sound into another through gradual changes in their frequency spectra over time. However, normal speech is more challenging to transform due to its rapidly changing spectra.

Wishart describes techniques like gradually stretching the frequency spectrum of a vocal sound in a non-linear way to morph it into another sound like a bell. He also explores narrowing the vocal cavity of certain syllables to mimic how resonance decays over time.

[TES 2012 KEYNOTE]

Encounters in the Republic of


Heaven
by Trevor Wishart
As a composer of electroacoustic music, I have a particular interest in the human
voice. When performing as a solo free improviser, I explore the outer reaches of
my own voice, using only amplification. The electroacoustic work Tongues of Fire
takes as its source material a short fragment of such an improvisation. I’ve also
developed new methods of notating extended vocal sounds and used these to
make fully notated scores for performance, such as Anticredos and the VOX
cycle of vocal works for professional performers. In the studio I’ve concentrated
on developing software tools and musical approaches, for organising sounds
collected from the real world — traffic, birdsong and, in particular, human
speech and other utterances. The signal processing software is written in “C”
and forms the core of the Composers Desktop Project (CDP) suite of programs,
while the Sound Loom, its independent graphic interface, is written in TK/Tcl.
More details about my approaches to both sound processing and musical
organization can be found in the books On Sonic Art (1996), Audible Design: A
Plain and Easy Introduction to Sound Composition (1994) and the recently
published Sound Composition (2012).

Electroacoustics and the Voice


The first software tools I wrote were concerned with morphing one
recognizable sound into another, Developed at IRCAM in the 1980s, these
tools took recordings of time-extended speech syllables (e.g., “zz”) and
morphed them into similar environmental sounds (in this case, a swarm
of bees), interpolating between the amplitudes and frequencies of the
data in the Phase Vocoder representations of the sound spectra.
Whilst exploring these possibilities, I also discovered alternative ways to
achieve convincing sound morphs from vocal sounds. For example, the
vocal syllable “ko→u” can be morphed gradually into a bell sound through
a series of intermediate steps. At each stage the spectrum of the source
is stretched further in frequency in a non-linear way so that the simple
relationships between the partial frequencies of the original sound (they
are multiples of the fundamental frequency) are gradually negated and
the spectrum becomes increasingly inharmonic. The bell-like result also
depends on the specific morphology of the vocal syllable beginning with a
brief broad-band attack (“k”) which will eventually morph into the clang of
the bell, settling on a steady pitch (“o”) from which the high frequencies
are gradually filtered out by narrowing the vocal cavity (“o→u”). This
mimics the way in which the high frequencies of a bell’s resonance decay
more rapidly than the low frequencies, as they are more quickly absorbed
by the environment.
However, such processes take a certain perceptual time to evolve (so that
we have sufficient time to recognize both the source and the goal, as well
as the morph between them) and are not directly appropriate to dealing
with the rapidly changing spectra of normal speech.
Two particular aspects are worth emphasizing about working with speech
recordings. In general, creating electroacoustic music for spoken voices
differs in kind from traditional scored vocal music. For notated musics
(whether it’s Renaissance vocal music or opera, for example) there’s a
performance tradition that determines the kind of vocal sonority or
articulation that you can expect. Individual performers will bring slightly
different qualities to the interpretation, but there’s a consistency of vocal
quality that you can rely upon. In this situation the voice aspires to the
condition of an “instrument”, something with a reliable, reproducible
timbre. In contrast, amplified popular vocalists — Elvis, Mick Jagger,
Björk, Janice Joplin, Lily Allen — all trade upon the unique quality of their
vocal production to market their particular brand of vocality. Once we
begin to deal with recorded speech in the electroacoustic domain, we are
much closer to the popular music model than the classical notated
tradition of composing for the voice; we are faced with the sound of a
unique human being and the particular quirks of the recorded materials
we have collected.
Secondly, when words and music meet in the traditional context, we are
usually concerned with “setting” the words to the music — providing some
appropriate sequence of pitches to complement the sequence of timbres
(and meanings) provided by the text. In contrast, what interests me is
uncovering the musical features inherent in spoken language and using
these as the basis of sonic organization.
Previous Approaches to Working with the Voice
In a previous work, Globalalia (2004), I decided to organize the materials
at the level of the syllable. Poetically speaking, the piece is concerned
with what we have in common as human speech communicators.
Although there are many millions of words in all the world’s languages,
and one language may be incomprehensible to the speakers of a different
language, all these languages are built from a much smaller set of
sounds, the syllables. So Globalalia is a celebration of human speech
through this shared vocabulary of sound objects.
I began by asking colleagues around the world to collect speech from
local radio stations. In addition, a friend who is a language teacher had
access to a worldwide array of broadcast material via her two satellite TV
dishes. After this collection process, I had accumulated sources in 26
different languages and proceeded to cut these into their constituent
syllables. The editing process is not so straightforward as it appears as, in
real speech, syllables are part of a continuous speech stream and flow
into one another. Eventually I devised a program which took into account
the slight overlap between syllables when editing them apart but (as is
often the case) only perfected this program when I had almost finished
the task of dissecting my material. This resulted in a set of over 8300
source sounds.
To help organize this material I created a musical database for the Sound
Loom, in which arbitrary properties (anything from the generally agreed
“pitch” or “duration” to the personally defined “fuzzy” or “I like this”) can
be assigned user-defined values of any type (numeric, verbal, codings,
etc.), together with tools to search the sources for specific materials. For
Globalalia, I used the properties:
• original language;
• gender;
• start consonant or consonant cluster (e.g., “skr”), if any;
• vowel or vowel glide (e.g., a→oo);
• end consonant or consonant cluster, if any;
• pitch or pitch glide;
• vocal quality (e.g., shouted, raspy, breathy, etc.).
I was then able to interrogate the database to gather together sounds
with specific properties, e.g.: all syllables beginning with “m” and the
vowel “a”, in a particular pitch range, with gliding pitch.
The syllables have differing musical properties. For example, the syllable
“ma” can be time-stretched as a whole to “mmmmmmaaaaaaa” without
losing its perceived recognizability, whereas the syllable “ka” cannot be
similarly time-stretched as the “k” is a transient sound (a rapid change of
state) and is destroyed by the time-stretching process, and the iterative
(rolled) “rr” is even more problematic. This suggests different musical
processes might be appropriate to composing with the different sound
objects. Hence Globalalia is essentially a set of studies worked on
different sets of syllables (e.g., “ma” and Dutch “rrr”, or the attacked
sibilants “ts”, “ks” and “ps”) and is bound together in a rondo-like
structure by a thematic utterance constructed from a wide variety of
syllables taken from many languages and individual speakers, which
recurs, with variation, at key points in the piece. This may sound a little
dull, but this is not the case — listen to the following example (the “pi”,
“pa”, “bo” etc. study).
Audio 1 (1:09). Excerpt of Trevor Wishart’s Globalalia (2004) featuring a passage using
a small set of syllables spoken by several speakers.

The slightly humorous plucked-string-like sounds towards the end result


from time-stretching the very short vowels of these syllables, which are
unstable in pitch, so that we hear gliding tones once they are time-
stretched.
With each musical project I like to tackle some new technical challenge,
as I enjoy this aspect of composing. In Globalalia, the main technical
innovation was a set of programs to time-stretch vocal iteratives, like
rolled “rr” sounds or vocal grit. The technical problem arises from the fact
that iteratives are sequences of attacked events. As we’ve discussed, the
“k” in “ka” or the “p” in “pa” cannot simply be time-stretched if we are to
preserve their “k”-ness or “p”-ness, but we could use a time-stretching
function that preserved the initial “k” or “p” part of the sound but then
time-stretched the “a” tail. This is what is happening in the plucked-
string-like sounds above.
Unfortunately, with iteratives we have a whole set of attacks and would
need to define a stretching function which varied with every attack in the
source. Also, we would then end up with a sound like “r---r---r---r---r---”,
a series of widely time-separated tongue flaps, which is not really what
we want. Rather we want to convert “rrrr” into “rrrrrrrrrrrrrrrr.”
Superficially it would seem obvious merely to form a loop of our short set
of tongue flaps and repeat the material. Unfortunately, perceptually
speaking, this does not work. In fact, when I first did this and played back
the resulting sound, I thought I had played the wrong sound as the result
was utterly unlike the source — it sounded self-evidently synthetic. It
seems that the human brain has a very good exact-repetition detector
which causes it to categorize sounds, quite spontaneously, as “unnatural”
or “implausible”, so our new sound is not heard as related to our natural
source. The tongue flaps in a natural rolled “r” are all of slightly different
loudness, slightly irregularly timed and slightly different in sound quality.
To create a plausible extension of such a sound we need to preserve this
subtle randomness.
So the new process first searches for the attacks in the sound (which
repeat approximately every 50 milliseconds) with an appropriately sized
envelope-tracking window (about three times smaller than the typical flap
duration). This gives us an indication of where the tongue flaps are. The
process then searches away from the peaks of the flaps to find the
minimum energy troughs between the events and finally cuts the flaps
apart at (identically oriented) zero crossings. (The zero cuts avoid
introducing splices into the sounds, which might subtly alter the sonority
at this time scale). We then reconstruct our source, using random
permutations of this edited set of flaps. For example, with flap sequence
abcd, we generate random permutations of the order — e.g., dbac, cadb,
etc. — and then join these together avoiding repetitions between
permutations — e.g., abcd-cadb-dbac is fine, but abcd-dbac-cabd is not,
as it gives us a repetition of flap c. In this way, we preserve the
randomness of amplitude, timing and sonority of the original and the
resulting sounds are completely plausible — they appear to be source
recordings themselves. In the following example, the first sound of each
set is the source and the ensuing sounds are derived via this special time-
stretching process.
Audio 2 (0:15). Example of time stretching in Globalalia.

Once, however, we have achieved this plausible extension we can go on


to develop the material in more radical ways — slowing the tempo,
synchronising the events in different streams, focusing the pitch quality of
the material through filtering, moving the streams in space — creating an
apparently surreal sound-landscape, where the listener might imagine
events recorded in an unfamiliar world rather than the output of a
synthetic mechanism.
Audio 3 (1:12). Excerpt of Globalalia showing the use of time-stretched elements in a
surreal soundscape.

There are two important things to note about working with speech at the
syllable level. First of all, we have eliminated the meaning or narrative
content of the material — we don’t need to deal with this at all, although
certain expressive aspects of speech utterance persist in the syllables
themselves and contribute to the way the music is organized. Even more
significantly, we have eliminated the perception of individual speakers,
dissolving language in a kind of universal Ur-speech where individual
human utterances are subsumed in a music of sonority.
1. The full title is An Inquiry into the Nature and Causes of the Wealth of Nations.

A different approach to spoken language is used in The Division of Labour,


where the text plays the sonic equivalent of a melodic subject in
traditional music. Instead of a series of pitches which we can then
transform — by transposition, time expansion or contraction, modal
change from major to minor and so on, broadly conserving the original
sequence — we have a series of sound objects, the syllables, which can
form a similar template for variation. The words here are taken from
Adam Smith’s The Wealth of Nations, one of the sacred texts of our
materialist culture. (1) They describe the division of labour in an
Edinburgh pin factory and are sufficiently significant to appear on the
back of the British £20 note.
Adam Smith is concerned to show that this process leads to an enormous
increase in industrial productivity. However, a side effect of the process is
that work becomes monotonous and unsatisfying. The piece, therefore
takes the original text, spoken in a Scottish accent by Alex Gordon, an old
friend of mine, and then “divides the labour,” generating a diverse set of
variations on the original recorded material, in a sense musically
demonstrating the efficacy of the division of labour. Each variation
preserves the order of phrases in the original text, even where the
transformations are extremely radical. Towards the end of the piece,
however, the original text recurs in its original setting but the syllables
within each phrase have been scrambled so the text loses all meaning.

“Encounters” and the Music of Speech


Phrases
For the piece Encounters in the Republic of Heaven, I wanted to explore
musical features of natural speech which only become apparent on a
larger time scale, at the level of the spoken phrase — pitch contour (or
melody) and the implied harmonic field; tempo, meter and rhythm and,
especially, the sonority of individual voices. We are all able to instantly
recognize a large number of individual speakers from our friends and
acquaintances; would it be possible, by some clever technical process, to
extract and musically distil the essence of an individual human voice? A
second aim was to somehow represent the musical diversity of human
speech across an entire community, and for this reason I decided, for the
first time, to work in 8-channel surround sound so that the community of
speakers would surround the audience. As a result, as the project
developed, I spent a good deal of time extending almost all the CDP
software to work in a multi-channel context and developing new sound
spatialization tools (for example, to rotate the entire frame of an
8-channel scene), new approaches to reverberating or texturing over
surround sound and an environment to allow files with from one to
sixteen channels to be mixed together in any conceivable spatial
distribution.
The idea for this project had been in mind for a long time, but collecting
recordings of natural speech is not straightforward. You can’t simply
wander up to someone in a pub, tell them they have a very interesting
voice and switch on a recorder! I imagined I would make such recordings
in my home region of Yorkshire, where I would more easily be accepted
as a regular, if slightly mad, member of the local community, but finding
the situations where recording might be appropriate and establishing
sufficient rapport with the recordees needs a good deal of organization
and local knowledge, plus high quality portable recording equipment and
transport (I don’t drive) and of course, the finance to proceed with the
project. I could not see a way forward with this idea until the post of
Composer Fellow was advertised at the University of Durham, in the North
East of England. This was not my home region, but sufficiently close and
sufficiently similar in industrial culture, to attract my interest. In addition,
the Durham post required input to the local community and experience in
electroacoustic music. This seemed an ideal opportunity so I applied for
and was appointed to the three-year post and proceeded with the project,
with Durham as my base.
The first year of the project was largely taken up with establishing
contacts in the community through existing local organizations — local
government leisure services, schools, dialect societies, community arts
projects, old people’s centres, local poets and so on — arranging
meetings and making recordings. I wanted to capture a cross-section of
human voices in the community, men and women of all ages; the
youngest person I recorded was 4 years old and the oldest 93. In schools,
I ran composing workshops for the children, where they composed their
own pieces based on speech rhythms, as a quid pro quo for my recording
work. In the recording sessions the youngest children talked freely, but
teenagers were more problematic. One approach was to ask the
headmaster to send me those children who talked too much in lessons (!).
Faced with a microphone, however, many teenagers were suddenly
tongue-tied. Only by getting together a group of at least three kids could
I guarantee animated speech. However, the problem now arose that they
interrupted one another’s utterances. Hence, separating the end-
overlapped phrases became a technical challenge and I developed new
spectral cleaning processes, integrating them with existing CDP programs,
to allow me to do this in a sufficiently detailed way. Recording older
people proved easier as there were many reminiscence groups or
discussion gatherings where people were happy to talk to others about
their life experiences and to share their opinions. In other situations,
providing a relaxed atmosphere for natural conversation to occur (rather
than an interview situation or a studio setting) meant that the incidental
noises of a house, a crackling fire or a pub had to be removed from the
recordings.
In this kind of project it’s not possible to choose in advance the exact
vocal characteristics of the people who will be recorded, so many more
voices were recorded than were finally used. Only after the recordings
had been gathered could choices be made about the materials. Firstly, I
wanted, in some way, to represent the diversity of human beings through
the diversity of the sounds of speech. So as well as representing each
gender and age band, I needed voices with contrasting sonic qualities to
present a diversity of sonic substance. Secondly, I had to decide which
recorded phrases I wanted to use in the piece — typically I would have
one to two hours of recordings of a voice but would end up using no more
than two minutes of material. Finally there was the “radio programme
editing” of the materials, removing hesitations, word or phrase repetition
and various glossalalia (but in some cases collecting these together for
musical use).
Cataloguing the Materials
In order to keep track of the material, I extended and developed the tools
associated with the source database, allowing me to enter melodic
contours and rhythmic shapes (graphically), and texts spoken as
properties of the sounds. Associated programs allowed me to statistically
analyse the melodic / harmonic, textual and rhythmic content of the
materials, and search for common melodic shapes or harmonic content,
common tempi, or common words or sub-words (e.g., “mem” in
“remember” and “memory”).
This was not entirely straightforward. For example, I developed ways to
search texts for similar sounding word starts and ends, but the bizarre
nature of English spelling made word endings particularly problematic. I
also developed a tool to graphically enter motivic (pitch timing)
information for each vocal phrase. I had already written software to
directly track the pitch of speech (allowing for interpolation over silences
and pitch-free sibilants). In general, the pitch contours of speech are free
of lattice restraints (the pitches don’t lie on some pre-existing lattice like
the tempered scale) and research suggests that when we first hear
speech we are not (consciously) aware of its melodic shape. However, if a
speech recording is immediately repeated we do tend to pick up a melodic
contour (and with a third repetition the melodic effect is even more
pronounced). My own experience suggests that we approximate these
melodic patterns to the lattices with which we are familiar (they appear to
fall somewhere close to the tempered scale, at least to listeners immersed
in tempered-scale musics). But because the speech lines are not strings
of steady pitches, there is a certain fluidity to how we assign scale pitches
to the speech syllables. Finally, for certain parts of the piece I wanted to
be able to bring into tune the speech materials of different speakers, so I
needed these tempered approximations to fall on the same tuning grid.
(e.g., it would be possible in principle for two melodies to each lie on a
tempered scale, but the two scales to be a quarter-tone apart, so they
would not harmonically gel). Thus I tracked the pitch contour of each
speech phrase to a concert-pitch tempered-scale approximation.
Statistical analysis of the tempi of phrases (not of complete sentences)
threw up an interesting pattern. One can plot the tempo against the
number of phrases that have that tempo (the phrase population) and one
might expect a graph that starts at zero (no phrases are spoken) where
the tempo is too slow for normal speech, rises gradually towards some
average value (typical speech speed) and then falls away gradually to
zero (where speech phrases would be too rapid to understand or even to
articulate). Surprisingly, however, the graph turned out to have
prominent peaks (lots of phrases at these tempi) at crotchet = 120
(dance music tempo), crotchet = 180 (triplets in the same tempo) and a
lesser peak at crotchet = 144 (symphonic Allegro)!! This proved useful
when choosing phrases to synchronize in the rhythmic section of Act 1
(see below).
My original idea to extract the essence of the sonority of an individual
voice proved impossible in practice — what characterizes an individual
voice is too complex a confection of tempi, hesitation types, melodic tics
and glossalalia to capture easily with any technical procedure. But some
voices had such strong characteristics (nasality, grittiness or cross-
register breaks in the pitch contour) that I was able to develop these
further.
The selection, sifting, cleaning and cataloguing of these recordings took
up a further 18 months of the project.
Aesthetic and other Contraints
Working at the level of the phrase, two new issues arose. First of all, it
was no longer possible to ignore the narrative content of the speech. In
most of the scored vocal music I have composed, I have used invented
languages (with no explicit semantic content) to allow me complete
control of the sonic content. Here, however, there was no way to avoid
dealing with the narrative, so the piece combines story-telling and sound
art, reshaping the telling to musical demands.
Thus, in addition to the radio-style editing (see above) the stories are
slightly reshaped, e.g., key phrases are repeated as refrains, in the
manner of simple poetry, without altering the substance of what is being
said.
Secondly, it was no longer possible to ignore the presence of real
individuals in the recordings. It’s often said that native Americans
originally objected to being photographed because the process was
thought to capture something of a person’s soul. If this is true of
photography, it is even more true when recording a person’s voice. If I
then take these recordings and develop and manipulate them, ethical
issues are involved. If this were my own voice, the voice of a willing
friend or a professional musician, I would be happy to stretch and warp it
in any way that seemed musically necessary. However, most of the
people I recorded were non-professionals who would not necessarily have
any knowledge of or interest in the esoteric aspects of contemporary
music. I therefore felt that I could not treat there voices with complete
abandon — there would need to be constraints on the processes I applied
to the voices, not unduly warping or disfiguring the original speech.
However, all composing involves working within restraints, so this was not
a major æsthetic problem.

There was no way to avoid dealing with the narrative, so


the piece combines story-telling and sound art, reshaping
the telling to musical demands.
Over the long preparatory period, the final form of the piece gradually
took shape. It is divided into four Acts of approximately 20 minutes each.
Acts 1 to 3 have four portrait sections based on an individual speaker or a
group of children, telling stories. These are presented mainly in wide
stereo. Each Act also has a central interlude in 8-channel surround sound,
using a multitude of speaking voices but organized differently in each
case: in Act 1, in terms of their tempo and rhythm; in Act 2 in terms of
sonorities (syllable or phoneme qualities) in the text; and in Act 3 in
terms of the harmonic field of groups of spoken phrases. And each Act
also has a surround sound finale, in which materials previously derived
from the voices (in earlier sections of the Act) are developed more freely
(they are no longer so strongly tied to the narrative). Act 4 has just two
portraits and then reworks the materials from previous Interludes and
Finales, culminating in the transformation of the speaking voices into song
(see below).
The whole work is bracketed by the “voicewind” sound, an extremely
dense texture of voices where all vocal detail is lost; what remains is a
band of noise which judders over the eight loudspeakers like the sound of
strong wind blowing around one’s ears. At the opening of Act 1 (and
Act 3) the texture gradually thins to reveal the voices, while at the end of
the piece, the texture of speaking voices gets denser and denser,
returning to the sound of wind with which the piece begins.
The Portraits
Having extracted and refined the text materials, I had to decide how to
create each portrait. Perhaps the most obvious way to track the melodic
contour of speech is to use other musical instruments to imitate the
melodic line, and in the first example of the portraits, this is what I’m
doing. However, this is the only place in the piece where I use sources not
derived from the recorded speech. Here, the male narrator recalls going
to a beer festival dressed as a belly dancer and dancing with various men
who mistake him for a woman. Three brass players perform a rhythmic
figure behind the speech, picking up prominent melodic motifs from the
speech line such as “you turn up in fancy dress,” “and as the night went
on,” “had to take the yashmak off” and particularly “It’s a bloke!”
Audio 4 (0:57). Excerpt of Trevor Wishart’s Encounters in the Republic of Heaven (2011)
featuring the beer festival portrait, with rhythmic figures in a brass band related to
speech contour and rhythm.

In another portrait (The Dancer’s Tale) the pitch contour of the spoken
voice is tracked using a filter (approximated to the tempered scale) which
is applied to the voice line at the original moment-to-moment pitch and
all of its harmonics. With a low Q, this merely adds a warm resonance to
the speech line and nudges it towards its tempered-scale approximation.
With high Q, the strong speech markers (e.g., the sibilants) are
obliterated and we are left with the pure pitch contour of the speech line.
Other filters are used in banks which reproduce the entire harmonic field
of a particular phrase, filtering the whole phrase. In fact, all portraits use
a wide variety of approaches to their material. Here, for example, you will
hear vocal hesitations and glossalalia, or vocal sibilants, gathered
together in textural groupings.
Audio 5 (0:54). Excerpt of the dancer portrait in Encounters, with harmonic fields
extracted from the voice line.

I’ve previously mentioned the various difficulties encountered in recording


teenage kids. One further problem arises in that, once they are persuaded
to chat they tend to talk about various personal or embarrassing things
which they probably would not want to be broadcast to the world (which
might include their peers, or their parents!). I therefore needed a way to
capture the excitement of “gossiping” without revealing the slightly risqué
content. To do this I used an envelope follower with a large window set to
recognize individual vocal syllables, then retained the centre of each
syllable, discarding the onset and tail. The syllable cores were then
rejoined in a rapid fixed-tempo stream. This reconstruction maintained
the pitch contour, the vowel stream and the expressiveness of the speech
line (e.g., laughter is perceptible) while completely disguising the
semantic content. This material is then juxtaposed with clear text
utterances (e.g., “ginger hair!!”) with different processes applied.
Audio 6 (0:56). Excerpt of the teenager portrait in Encounters, with the gossiping group
rendered in a rhythmic string that blurs the clarity of their discussion.

The next example illustrates one of the more successful attempts to work
with the actual sonority of an individual speaker. The speaker is a
93-year-old woman who lived on a remote farm in Upper Teesdale. Her
voice has distinct cross-register breaks, both up and down and often by
the interval of a fifth, particularly when her speech becomes animated. I
extracted many examples of the cross-break articulations and used the
pitch-tracking filters, time-stretching and other approaches to develop
articulated events like the individual notes of a bagpipe-like musical
instrument. This voice-derived instrument is then used to accompany the
voice.
Audio 7 (0:46). Excerpt of the Teesdale woman portrait in Encounters, in which vocal
leaps across a natural register break are used to create an accompanying bagpipe-like
instrument.

So far, the examples preserve quite closely the authentic voice and
narrative of the speaker. The next storyteller is an experimental poet and
because of this I felt I could take more liberties with the treatment of her
voice. She had lived in both Liverpool and Newcastle and had a striking
accent and a strongly nasal intonation. Various techniques are used to
extend and develop the speech. Vowels in the phrase “Heathcliffe come
here!!” are extended in time by a new process which recognizes the
individual wave packets in the vocal stream (more details below), while
the syllables of the word “democracy” are repeated, permuted (in fact,
using patterning from English bell-ringing practice) and simultaneously
gradually spectrally morphed, becoming more bell-like with time.
Audio 8 (0:54). Excerpt of the experimental poet portrait in Encounters, with the sound
qualities of certain vowels extended and bell-ringing patterns applied to the repeated
articulation of single words.

The Multi-Channel Sections


Materials derived from the two voices we have just heard are worked
together in the finale of Act 3, illustrating the more abstracted, less
narrative-based character of these finale sections. Syllables of the older
woman are time-extended and float over the texture sometimes sounding
like horns, and vocal flutters in her articulation are extended into
fluttering events. The spectral morph of the democracy syllables from the
poet’s voice are developed by both time extension and stacking (copies of
the source, resampled at different rates and therefore different durations
and pitches — in this case, at octaves — are superimposed in such a way
that their attacks synchronize precisely) to produce giant bell-like attacks.
Audio 9 (0:55). Excerpt of the finale of Act 3 of Encounters with more abstract
exploration of the two women’s voices than in the preceding portraits. This example
(and all those that follow) is a stereo reduction of the 8-channel master.

The central sections of each Act take materials from all the speaking
voices and present them in surround sound so that the audience is
enveloped in the community of speakers. Each act treats this collection
differently. In Act 1 the tempi and rhythms of spoken phrases are
coordinated. Using the statistical information from the database, I was
able to select vocal phrases of the same tempo and carefully synchronize
these in the 8-channel mix. However, no matter how carefully this was
done, the result sounded simply like a crowd. Only by making very subtle
changes to the timing within phrases — changes so small that, in most
cases it is not possible to tell the difference between the original and the
time-modified phrase when played back-to-back — was I able to achieve
the rhythmic locking of the various voices. I tried various approaches to
time modification but in the end the simplest — deleting tiny slivers of
sound at the lowest energy points between syllables or inserting tiny
slivers of silence between syllables — proved the most effective.
Various rhythmic / spatial approaches are used — the initial phrases put
successive syllables on different, adjacent channels of the 8-channel ring,
so the speech phrase jump-circles around the audience. But for the most
part, complete short phrases are placed on single loudspeakers, with
some later use of echoes falling in the driving tempo, tutti accents occur
on several or all channels simultaneously whilst some sustained sounds or
textures pan in a circular fashion around the space. Near the end, clipped
syllable fragments are worked in double tempo.
Audio 10 (0:54). Excerpt of the central section of Act 1 of Encounters, in which the
natural speech rhythms of many people are coordinated over an array of eight speakers
(reduced here to stereo).

In Act 3, the speech is coordinated in terms of its melody and implied


harmonic field. As you might imagine, much speech lies in a narrow pitch
band, generating chromatic clusters of notes as its associated harmonic
fields. This may be interesting to hear once or twice but could become
musically tedious. I therefore used the database search facility to find
spoken phrases containing larger pitch intervals (a minor third or greater)
and to correlate these with the same pitches and pitch intervals in other
phrases, allowing me to gather these materials together and generate
interesting harmonic progressions between the groups of speech phrases
themselves. Filters resonating at the pitches (and harmonics) of these
harmonic fields amplify the harmonic resonance, and sometimes these
resonances float away from the vocal sources in the surround space.
Speech into Song
In Act 4, the various threads of the previous acts are drawn together
leading to the finale in which the speaking voices burst into song. In order
to achieve this I developed ideas from the iteration extension described
earlier. Spoken (voiced) vowels consist of small wave packets. The speed
at which the wave packets come past determines the pitch of the voice,
while the form of the wave packet determines the vowel we hear. Using a
very tiny envelope window we can detect these wave packets, in a
parallel fashion to the detection of the tongue flaps in an iterative (rolled)
“rr” sound, but on a much smaller time scale. There are several difficulties
to overcome. The first is that the packet size changes with pitch, so we
can no longer use a fixed-sized envelope window to tracks the wave
packets. We must start with a pitch-detection procedure (I use harmonic
peaks detection on the phase vocoder data) and then generate an
envelope window about one third of the wavelength of the perceived
pitch. We also have to deal with all the unpitched (and silent) events in
the speech stream in some rational way. Once we have detected the
packets, a new problem arises if we want to change the pitch. In real
speech, if the pitch of a given vowel (lets say “aa”) goes down by one
octave, the packet becomes twice as long, so to transpose the original
voice down by the same octave we have to devise some means of
extending the packet so it is “the same shape” yet longer. It wasn’t clear
to me how this could be achieved in a perceptually plausible way.
However experimentation revealed that simply by inserting silence
between the packets and thus changing their timing, not only did the
pitch of voice fall but the vowel quality was perfectly preserved. Even
transposed two octaves down, where the new signal was three-quarters
silence, a plausible — though at this transposition slightly gritty — vocal
“aa” was produced. Similarly, by overlapping the packets and thus
shortening the time between them, the voice could be made to rise in
pitch. Due to the signal overlap involved here, beyond around one octave
up the signal began to become implausibly resonant with the internal
echoes involved. However, this was enough pitch play to allow me to
develop spoken vowels into sung lines, adding random-varied vibrato of
the packet repeat rate for a more expressive cantato (the sound without
vibrato was heard earlier in the “Heathcliffe” examples).
Audio 11 (0:53). Excerpt from the finale in Act 4 of Encounters, with individual voices
heard earlier in the piece now combined in a group speech-song passage.

Biography

Trevor Wishart (*1946) is a composer / performer specialising in sound


metamorphosis and constructing software to make it possible (Sound Loom /
Composers Desktop Project). Has lived and worked as composer-in-residence in
Australia, Canada, Germany, Holland, Sweden and the USA. Currently residing in
the North of England, Wishart creates music with his own voice, for professional
groups, or in imaginary worlds conjured up in the studio. His æsthetic and
technical ideas are described in books On Sonic Art, Audible Design and Sound
Composition (2012). Works include Red Bird, Tongues of Fire, Two Women,
Imago and Globalalia. He has received commissions from Paris Biennale, DAAD
in Berlin, French Ministry of Culture and BBC Proms. In 2008 he was awarded
the Giga-Hertz Grand Prize for his life’s work. Between 2006 and 2010 he was
composer-in-residence at Durham University (North East England) and during
2011, Artist-in-Residence at the University of Oxford.
http://www.trevorwishart.co.uk
eContact! 15.2 — Toronto Electroacoustic Symposium (TES) 2012 (May / mai
2013). Montréal: Communauté électroacoustique canadienne / Canadian
Electroacoustic Community

You might also like