Audio: the missing element in multimedia

Tom Dussek

The author

Tom Dussek is a musician and audio specialist. He recently graduated from the MA Design for Interactive Media at Middlesex University, where he specialised in developing the uses of sound in multimedia.

Abstract

The importance of sound as a channel of communication is emphasised, and attention is drawn to some of its special properties. The sound capabilities of the Macintosh, as an example of an affordable multimedia computer, are briefly discussed, and are compared with the lesser capabilities of the multimedia packages which run on this platform. A work-around to an example problem is discussed, followed by an introduction to the use of MIDI from within multimedia packages. As an illustration of the potential and the difficulties of using sound within current multimedia packages, two projects are described, and a plea is made for sound to be given more serious consideration by developers both of multimedia titles and of the authoring software used to create them.

The importance of sound

The use of sound within computer-based multimedia has mostly been subordinate to visual components-still images, moving imagery and text-only acting as a confirmation that an action has been carried out: a "click" sound accompanying the operation of an on-screen button, or a piece of music masking the delay of loading data or other such system operation. This approach to the use of sound has served to relegate the considerable powers of the user's auditory input, the sense of hearing, to a rather low and unsatisfactory level. The success of the music industry and the importance of sound and music to film indicates how a substantial element of our culture relies on our sensitivity to sound, both to provide information and emotional effect. Indeed, the prominent idea that an essential component in the functional design of multimedia is that there should be an "off" button for the auditory component indicates that sound is often used in a way which is not integral to the overall functionality of many multimedia products. This underuses an accurate, sensitive and emotive means by which the user may experience multimedia, via information received by the ears, and in doing so unreasonably restricts the bandwidth for communication.

Sound control

Within a multimedia package, it should be possible for the designer to incorporate elements of auditory control which are as comprehensive as those of visual control. Functionality such as the panning of a stereo image, pitch shifting, tonal filtering, effects-processing such as echo, reverberation or harmonic distortion, synchronisation of discrete tracks of sound, variable quantisation of data-in fact, all the psychoacoustic techniques and data manipulation tasks that have been developed by the professional audio industry-should ultimately be at the disposal of the designer and user of interactive multimedia products.

The Apple Sound Manager (v.3)-whilst a relatively basic audio management system-offers 16-bit audio recording, multiple sound channels, stereo capability and audio processing in terms of level control over each channel. There are many third party dedicated audio systems which utilise the capabilities of the Apple Sound Manager, increasing the audio functionality of the Macintosh by the incorporation of discrete software and/or hardware. However, multimedia authoring packages, and the multimedia products made with them, rarely take advantage of all the audio control possibilities of the Sound Manager, let alone adding any increased functionality by the incorporation of additional audio software into the product.

This report will consider the limitations in the provision of sound within MacroMedia Director tnote 1), a dominant multimedia authoring and delivery package for the Apple Macintosh computer (now also available for Windows), and will review the techniques used by the author and others at the Centre for Electronic Arts, Middlesex University, to overcome the control limitations of Director with regard to all but the simplest of auditory operations. This report will also endeavour to offer possible areas for development in the use and control of sound within multimedia, and as such should not be taken simply as a critique of Director.

Auditory control in MacroMedia Director

MacroMedia Director is surprisingly limited in its audio capabilities. Whilst it offers much to the multimedia designer in the control of animation, text, objects and colour, the provision for the control of sound is meagre. Indeed, of around 200 functions capabilities provided by Lingo (the programming, or scripting, language used by the package), only eight deal directly with audio. These functions enable the setting of a sound file's playback volume, the channel in which it will play, commands to start or stop a sound, and functions to return rudimentary status information as to the operation of a sound file.

It could be argued that sound requires little more than these controls with which to operate, following the standard control interface offered by the magnetic tape recorder (though with the notable exception of a "pause" control). However, the possibilities for the provision of greater auditory functionality within multimedia are considerable. Why not make it possible for the sound to be reversed, under the control of the user? Or for the sound to track the cursor, using spatial psychoacoustic techniques such as stereo imaging or the subjective height of sound in relation to pitch to give the user a spatial image of the cursor position within the screen? Or to enable tonal filtering of a sound to take place, providing the user with environmental information from the sound they are interacting with? Or pitch-shifting? Or through the implementation of time-stretching, or other sound manipulation techniques exclusive to audio in the digital domain, to provide new interactive auditory experiences never before offered to the user? These possibilities would provide a wealth of creative opportunities to the designer of interactive multimedia, and subsequently to the user. And, whilst Director is very comprehensive in facilitating the control of visual components, and thus enabling an interactive visual experience of depth and substance, there is no such translation of component-level control to the audio.

However, we must first consider what should be possible in terms of auditory control given the present technical levels of popular delivery platforms like the Macintosh.

The possibilities for greater auditory control

Within Director, visual components (PICT files, note 2) are kept in a small database as "castmembers", and called to the "stage" as and when required, following the overall "movie" metaphor of the authoring package. The audio components (typically AIFF files, note 3) are treated in the same way, being called as a castmember when required. AIFF files may also be stored externally from the Director movie. Sound can also be stored as a QuickTime movie, and replayed using QuickTime control methods, enabling the user to pause and resume playback of the sound without having to return to the beginning of the sound file as is the case with normal Lingo controlled AIFF playback.

A PICT file and an AlFF file are the same, in the sense that they are both digital data. However, there is a fundamental difference in the operation of visual and auditory components when presented to the user. They are complimentary modes of information, since, "Sound exists in time and over space; vision exists in space and over time." (Mountford, SJ, and Gaver, WW, 1990). Visual elements, once instigated, will remain seen until they are removed. Once an image has been put onto the screen in a particular place, no more is required of the computer's processors than to refresh the screen. Sound, however, existing "in time and over space", requires processing for as long as it is being heard. A sound file must be replayed in a linear fashion, and only when the sound ends will the requirements on the processors end.

Some of the previously cited examples of the possible control and manipulation of sound within an authoring package involve a great deal of real-time processing. Operaiions such as time-stretching (note 4) are particularly "processor-intensive", and as such are possibly beyond the capabilities of any current popular delivery plafform, although the potential for some form of partial processing with audio could extend the limits of what is presently deemed technically possible in terms of speed and consistency of delivery. Other audio manipulation and control functions, such as the use of reverberation, whilst worthy of consideration, would possibly require the use of more sophisticated DSP (Digital Signal Processor) chips to work well enough to offer the designer and the user the flexibility, control and quality required to make such functions usable in "real time". Suitable chips will very probably be incorporated into popular computers at some point, and it is as well to consider their potential benefit to multimedia products at this stage in the development of multimedia. There are, however, auditory control functions that are quite within the technical limits of computers currently in popular use.

The demands of simple real-time processing of an auditory stereo image parallel that required to move a graphic image around the screen in real time, and should require little (if any) more processing. Already, the playback volume of a sound file can be altered in real time from Lingo, (making use of the Apple Sound Manager) indicating that a certain amount of processing of audio is quite within the capabilities of Director and the Macintosh. It would seem, then, that the lack of anything other than the most basic auditory control operations within Director is due more to oversight than to technical limitations, though this is possibly itself due to the lack of user demand for auditory control. The fact that the techniques described below actually work, is further indication that the real time processing of certain parameters of sound is attainable, and suggests that they should be supported as an inherent part of Director/Lingo operations.

Experiments in greater auditory control

The techniques described in this section indicate how the control of sound can be used to provide a great deal of spatial and identifying information to the user, with particular reference to techniques developed for "A Three-Dimensional Audio Interface for Local Area Networks" (PGDip Project: Tom Dussek, Anthony McGaw, Nicola Wells, 1994). The techniques used to overcome the limitations on the control of sound within Director will be described. There were two approaches to this project, one which used AIFF sound files, and one which used MIDI (note 5) to control an external synthesiser.

An essential element of the "Audio Interface" project was the auditory communication of spatial information. This involved providing the user with stereo imaging for the x-axis, pitch-changing sounds for the y-axis, (following research into the subjective height of sound in relation to pitch by Peters (Peters, RW, 1960)) and using changes in volume level for the (virtual) z-axis. The x and z axes were therefore dependent on volume level changes, and the y-axis on discrete pitch-based instances.

The approach using AIFF files required adaptation of the basic audio control functionality provided by Director. To enable the control of a stereo image, whereby a sound can "pan" across the user's auditory perceptive field, for example making a sound travel from left to right, requires the attenuating and incrementing of the volume levels of the sound. In this example, the level of the sound on the left must decrease as the level of the sound on the right increases. This panning function is not provided by Lingo, and a solution must therefore be improvisedo The Macintosh will play stereo AIFF files, but Lingo does not permit the alteration of the sound levels of individual tracks within the AlSiF file. Only the level of the entire audio file may be adjusted. It would appear that whilst the Apple Sound Manager has "Stereo Mixing Capability", Director does not take advantage of this. If the user is to be offered control over the position of the sound within the stereo image, it is therefore impossible to set the balance of the left and right tracks at the recording stage. Our approach was to record the same sound into two AIFF files, one containing the sound on the left track with silence on the right, the other having silence on the left and sound on the right.

As both files are stereo, the Macintosh will replay them as such, and the overall result will be that a stereo image will be heard. The user's actions may then be used to adjust the volume level of one of the two sound files, and, if this adjustment is linked in inverse proportion the other file, panning of the stereo image is achieved. This stereo imaging technique indicates the possibility for real time parametric modulation of audio files from within Director.

However, this technique is not without its drawbacks. Firstly, it uses four playback channels to carry out an operation that should only require two, were Lingo to take proper advantage of the Apple Sound Manager. As Director and the Macintosh are limited to the number of audio channels which can be played at any one time, (the exact number is not indicated by any of the literature associated with Director, although it appears to be seven) this doubling of channel requirements severely limits the possibilities for this kind of stereo control.

The second major problem is that of synchronising the two A1FF files, sincc being separate files called by two discrete Lingo programme statements-they cannot be started in the same instant. If the sound is anything other than a continuous tone, (which may still be subject to a phasing effect if the two sound files are played together asynchronously) then the delay in the second AIFF file being triggered after the instigation of the first will cause a slight but very audible echo, as the second file is running behind the first. Our hearing is extremely sensitive to discrepancies in audio information (this is why a QuickTime Movie gives priority to audio playback, at the expense of constant visual frame rates, as we are more tolerant of visual than auditory inconsistencies) and the psychoacoustic effect of the stereo image is completely lost. Instead of the two files being interpreted as one sound bearing spatial information, they become two discrete sounds. To avoid this delay, a period of silence equal to the delay time in the second file being triggered was added to the beginning of the first AIFF file. Provided that the pre-delay of the first file was accurate enough, this technique resulted in the synchronous playback of the two files. The drawback with this technique is that the amount of pre-delay required is dependent on the speed of the delivery computer. If the computer were faster or slower than the computer on which the project was developed, the files would no longer be synchronised.

With the Three-Dimensional Audio Interface project, y-axis movement, which was not dependent on volume level, used pitch to indicate the height of a visual object. As there is no provision within Lingo to alter the pitch of a sound file, multiple AIFF files had to be recorded at different pitches, and imported into the cast of the movie. They could then be replayed from Lingo, switching between the AIFF castmembers to alter the pitch of the sound. This is very similar to multiple PICT file castmembers being used to create an animation. However, whilst it is reasonable to accept that the amount of processing required to manipulate an image in real time to generate animation is beyond the capabilities of current popular computer technology, thus necessitating the use of multiple pre-drawn discrete images, the same is not true of sound files. The only parameter of the file that must be processed to achieve pitch shifting is the frequency. This is done by relatively simple multiplication (or division) of the fundamental frequency of the sample rate, and, even when following the slight anomalies of the equally tempered tuning of western musical tradition, should provide little challenge to the processing capabilities of the Macintosh. There are considerations of audio quality in this, as the reduction of the sample rate to lower the pitch of a sample will cause the high frequencies of the sampled tone to be lost, and timbral factors such as the attack or decay of the sampled sound will alter in direct proportion to the sample rate alteration, resulting in unnatural modulation of the tone, but this may be rectified by the judicious use of a degree of multi-sampling and construction of a sound using separate components such as the attack, sustain, decay and release of the sound, following the principles developed by the music industry. Rectifying the problems of sample rate alteration is not an insurmountable problem, and the provision for pitch shifting sound files from Lingo would greatly increase the potential for more comprehensive control of sound within multimedia.

The use of external contwol functions with Director

The use of additional hardware devices, triggered by external program modules attached to Director (X-Objects, note 6), demonstrates how more sophisticated control of sound provides an exciting opportunity to the multimedia designer in the use of sound.

The use of MIDI signals, transmitted under the control of the user to an external synthesiser, (which then generates the sound) is, for the purposes of this report, an example of the use of dedicated audio control functions within multimedia, as opposed to the use of synthesised sound.

Incorporating X-Objects (in this case a commercial product, "Hyper- MIDI", by Earlevel Engineering, California) enabled all the control parameters offered by MIDI to be accessed from Director. These parameters are comprehensive, as MIDI has developed from basic note- trigger information to offer many more functional controls. MIDI is a control protocol, and therefore does not process the actual sound, so direct manipulation of the sounds was not possible. Numbers between 0 and 127 (as required by the MIDI protocol) were generated from Lingo by the user's interaction, and this was sent (along with the appropriate identification information) to the synthesiser which then responded to the control information. This enabled a comprehensive control system to be established. Note values (and thus pitch values) were easily altered, panning was easily attained, as was the setting of volume levels for a sound. Rudimentary equations were used to provide the receiving synthesiser with values based on the user's interaction, and the whole system worked very smoothly.

This use of external MIDI controlled sound generation is notable for the purposes of this report in that it demonstrates the functional advantage gained by a multimedia product from the use of dedicated sound control systems. Subsequent projects undertaken using this system, in particular "The Audio Playground" (MA Project: Dussek 1994), have explored further the use of sound as an interpretation of the physical actions of the user, in which the sound responds to the dynamics of the user's action with the mouse, by the setting of the velocity parameter within MIDI. Non-auditory responses (Davis, RC et al., 1955) triggered in the user by the use of sound-driven narrative structure were also explored, allowing emotive exploration, discovery and shocks. These projects required the comprehensive and accurate control of sound across various parameters, and resulted in an interactive experience which stimulated, excited and absorbed the users using sound as an integral component to the user's interaction.

It is clear what could be achieved with more comprehensive audio control within multimedia authoring packages, of which Director is only an example. The use of MIDI, whilst offering a great deal to the designer in terms of auditory control, is currently impractical past the prototyping stage, due to the requirement for external hardware. However, it does draw attention to the possibilities for controlling sound which packages like Director should begin to exploit. We can assume that hardware specifications will increase, with the greater use of DSP chips and the development of comprehensive but relatively cheap digital audio manipulation products by companies such as Yamaha, E-mu Systems and Roland (among many others). In the meantime, packages should make full use of the hardware and system software facilities which are already present so that the designer, and thus the user, is enabled to explore many more auditory possibilities

Developing the use of sound within multimedia

If comprehensive audio capabilities were available to multimedia authors, would they be fully exploited? There is a clear need for greater attention to sound in both design and production (and this has implications for the way in which designers and developers are trained). Designers and developers must begin to include sound in a way which is more substantial and satisfying than has currently been exhibited if we are to experience the full potential of multimedia. Production staff should not tolerate the level of technical expertise that some now do in relation to sound. If we are to accept that most audio from a CD-ROM or other multimedia product will be replayed through the rather poor quality amplifiers and loudspeakers built into delivery platforms, then the audio should be recorded with this in mind. There is a parallel here with the television industry, where signal compressors, filtering and very high quality recording devices are used to ensure the highest possible standards from the domestic television's low fidelity sound reproduction system. In television, the end result achieves an acceptable degree of quality and intelligibility, yet in multimedia products the experience and knowledge of the professional audio industry is all too rarely used. Even where multimedia publishers are investing in recording studios to provide an acceptable level of sound quality within their products, what is subsequently done with the auditory component often remains pedestrian and uninspiring. The emotive use of sound in film, to build tension, to make us jump, to constitute, contradict or reinforce certain elements of the narrative, offers a wealth of finely crafted techniques, yet these seem to have been largely ignored. MacroMedia may well feel that there is little point in developing more comprehensive sound control functions for Director if multimedia designers show no desire to use them. A demand must therefore be created, and serious thought must be given to the use of sound as an integral component of the interactive experience. It is important to start considering how we may provide as much for the user aurally as we do visually: in its present stage of development, audio is still the 'missing element' in multimedia.

Tom Dussek, 1994

Notes

Note 1

This report deals with Director Version 3. Whilst version 4 is now available, its release is too recent to permit a fair appraisal of the package and its audio capabilitiesr

Note 2

PICT files are a graphic file format for saving information created in a paint or draw application on the Macintosh.

Note 3

AIFF, Audio Interchange File Format, is a standard for exchanging sound information between applications, required for sound files which are played back from within Director.

Note 4

Time stretching is a process whereby a sound sample is made longer (or shorter) in duration by the adding in (or removing) of estimated samples at regular intervals over the length (in time) of the sample. This alters the duration of the sample without altering the pitch.

Note 5

MIDI, WIusical Instrument Digital Interface, is a communication protocol for electronic devices, e.g. synthesisers, to transmit and receive control information.

Note 6

"X-Objects" are modules of code which may be linked to Director to increase its functionality in some specific manner. The incorporation of "HyperMIDI" X-objects into Director allows MIDI information to be transmitted (or received) from Director, a function which Lingo is otherwise unable to perform.

References

Davis, R C, Buchwald, A M and Frankman, R W,

"Autonomic and Muscular Responses and their Relation to Simple Stimuli", Psychology Monographs 69, no. 405, 1955

Mountford, S.J, and Gaver, W.W,

"Talking and Listening to Computers", from "The Art of Human Computer Interface Design", ed. Laurel, B., Addison Wesley, 1990, p. 322.

Pater, W,

'The School of Giorgione", 1873, in "The Renaissance", Fontana, 1961, London, p.135.

Peters, R W,

"Research of Psychological Parameters of Sound", report WADD TR 60-249, AD 240814, Wright Air Development Centre, Aerospace Medical Lab., Wright Patterson Airforce Base, Ohio, 1960.