In The Search of The Right Format For Immersive Sound
In The Search of The Right Format For Immersive Sound
In The Search of The Right Format For Immersive Sound
net/publication/329191658
CITATIONS READS
0 177
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jorge Medina Victoria on 15 January 2019.
ABSTRACT
Immersive experiences are becoming the state of the art to deliver an artistic message with
audio content. There are different technological approaches to achieve three-dimensional
sound. The right choice depends on the recording, distribution and playback format in
addition to the technical characteristics, the usability of the format and its integration with
established technologies. In this paper, a broad summary of the three current dominant
technologies such as channel-based systems, ambisonics and object-based systems are
presented and compared with regard to human perception.
INTRODUCTION
Immersive audio is a concept widely used to describe techniques that deliver spatial audio
content in a setup around the listener [7] or through headphones. The majority of those
techniques and formats, which are also referred to as 3D–Audio, present audio information
not only in the horizontal plane but also with elevated sound sources and even in some cases
with sources under the horizontal listening plane.
To achieve complete immersion in a virtual world, it is necessary to fool or trick the senses;
the role of vision is predominant. However, sound information reinforces and supports
everything that is not seen [8] and is a fundamental part of the creation of a virtual world.
This paper is a short review of the most relevant methods used to capture and reproduce
immersive sound. An overview of the advantages and drawbacks relative to the different
techniques may facilitate the choice of an audio setup, regarding the available resources and
the expected acoustic environment for the listener.
Immersive audio content can be delivered mainly in three different formats: channel-based,
scene-based and object-based [7]. There are also mixed approaches, which depend strongly on
the context and combination of advantages of the different methods.
The normal approach is to use the same method from capture to playback of the audio
material. However, it is possible to use for the capture or recording of the sound a different
method as for its presentation or reproduction [2].
Channel-based
This approach is based mainly on amplitude panning. A straightforward relationship between
audio streams and loudspeakers is assumed [2], [7]. It is the traditional way of producing
sound for music recordings, film, etc. Level or amplitude differences between loudspeakers
provide a phantom source and therefore sound can be localized. It works well in the
horizontal plane [2]. The production of the amplitude differences through different methods
such as panning or using the faders of a mixer can be extended to setups with vertical
loudspeakers. However, localization in the vertical plane does not work as well as in the
horizontal plane and is very dependent on the sound source [5].
In some setups, amplitude panning plays no role at all. The setup developed by Morten
Lindberg and known under the name “2L-Cube” is an example of a 1:1 approach. The signals
recorded by two microphone arrays (Omni-pressure microphones) are routed directly to the
loudspeakers [4]. The first array in the horizontal plane consist of five microphones, the
second, a microphone array above the horizontal array, consist of four microphones. Both
arrays match the number of loudspeakers in the horizontal and height layers respectively.
The diversity of recording techniques and the robustness of formats such as 5.1 are some of
the advantages of the channel-based approach. On the other hand, a downside is the
reproduction setup, especially for immersive sound. The production and recording of sound
sources is very dependent on the reproduction setup, which is not flexible. The worst-case
scenario implies individual production approaches in the mix for delivering sound sources to
different loudspeaker arrays.
Object-based
For this sound representation, sound objects and the information about them are separated.
The ancillary information, which describes parameters such as spatial information, is known
as metadata [7]. The artistic intention is expressed through the metadata by indicating how the
audio elements should be translated in the reproduction system. Metadata can be generated by
a mixer or automatically [8]. In the final stage an audio rendering engine deliver a data stream
for the different formats from mono to 3D-Audio.
The object-based method is an alternative that is growing rapidly in acceptance because of its
flexibility. Distribution is possible to a number of different end-devices such as smartphones
or tablets. Reproduction is achieved for loudspeakers in different configurations or binaural
over headphones [7]. This method works well, especially when the sound sources are
singular elements as in the case of monophonic audio sources.
Advantages
• Independence of the loudspeaker setup
• Personalization and interactivity
Drawbacks
• Demanding workflow
• Preferably monophonic sources
Ambisonics
The development of ambisonics started over four decades ago. The idea behind is the control
and representation of a sound field in a specific position for a loudspeaker arrangement and its
decomposition in spherical harmonics [1]. The sound scene is directionally encoded. The
original mathematical framework enabled First Order Ambisonics (FOA). A further
development is known as Higher Order Ambisonics (HOA).
Figure 3: Ambisonics spherical harmonics up to 2nd order.
Producing audio material for the ambisonics scene-based representation normally involves
two stages: encoding and decoding. The flexibility and manipulability of those stages makes
ambisonics a very interesting method for the recording and reproduction of 3D-Audio. In the
encoding process, audio signals are captured through an A-Format microphone. This four
channel signal can be easily decoded in the B-Format or even into Binaural, which makes
ambisonics a very practical tool for virtual reality applications.
For virtual reality, object-based and scene-based methods are the most flexible options. In the
case of ambisonics, its mathematical approach enables an easy manipulation (e.g. rotation) of
the signal in a three dimensional plane, which is very practical when using head mounted
displays and headphones [8]. Additional to the advantages of scene-base methods, object-
based methods can produce complex dynamic soundscapes. The metadata, which defines
position, level, width and other attributes of the sound source can be modified dynamically
according to the scene or the artistic intention [2], [8].
DISCUSSION
Some more relevant attributes of immersive sound are envelopment, localization, a large
sweet spot, decorrelation of signals and timbre. There is no single technical approach that
could deliver every single feature in an appropriate way in order to produce the sense of
immersion. The right format is always a compromise between availability of resources, time
and the desired results. This paper is an overview of the most representative methods to
produce immersive audio and presents information that may be helpful in the process of
choosing a technical format to develop an acoustic idea.
REFERENCES
[1] Ahrens, Jens, Analytic Methods of Sound Field Synthesis, Springer-Verlag, Berlin
Heidelberg, 2012
[ 3] Fischer Cédric, Zingler Dominik, Medina Victoria, Jorge Enrique, Extending the Double-
MS configuration for 3D audio requirements, VDT Tonmeistertagung, Cologne, Germany,
Nov. 2016.
[4] Kim, Sungyoung, Chapter 7 Height Channels, Immersive Sound, The art and science of
binaural and multi-channel audio / edited by Agnieszka Roginska and Paul Geluso, Focal
Press, New York, 2018.
[5] Lee, Hyunkook, The Relationship Between Interchannel Time and Level Differences in
Vertical Sound Localization and Masking, Audio Engineering Society Convention Paper 131st,
New York, NY, USA, 2011.
[6] Riitano, Lucca; Medina Victoria, Jorge Enrique, Comparison between Different
Microphone Arrays for 3D-Audio, Audio Engineering Society Convention Paper 144th,
Milan, Italy, 2018.
[7] Rumsey, Francis, Immersive Audio: Objects, Mixing, and Rendering, Audio Engineering
Society Journal, Volume 64, Number 7/8, 2016, pp. 584-588.
[8] Susal, Joel; Krauss, Kurt; Tsingos, Nicolas; Altman, Marcus, Immersive Audio for VR,
Audio Engineering Society, Audio for Virtual and Augmented Reality Conference Paper, Sep
30 – Oct 1, 2016, Los Angeles, CA, USA.