Using Audio in Multimedia

Introduction

When considering multimedia applications audio is often neglected. Traditionally computers have relied on visual interfaces, and audio facilities were very limited. Now, however, most personal computers will have sound cards and speakers, and the hardware to upgrade those that do not is relatively cheap.

Audio can be used to enhance multimedia applications in a number of ways, for example in delivering lectures over the web, music used to add interest and emotion to a presentation, and other non-speech audio used as part of a general interface.

This introductory paper will look at some of the reasons for using audio, provide an overview of digital audio file formats and look at some novel audio interfaces.

One of the main uses of audio in a networked environment is in videoconferencing applications. Videoconferencing is beyond the scope of this paper, but for more information see the AGOCG briefing paper: 'Introduction to Video Conferencing'

Why Use Audio

Perhaps the most obvious advantage of using audio is that it can provide an interface for visually disabled users, however using audio offers a number of other advantages for all users:

It can convey meaning, providing an extra channel of information. It allows redundancy to be incorporated into the presentation of information, so that if the meaning is unclear to a user using visual information alone, the audio may clarify it.
Different learners use different learning strategies, and audio can provide additional information to support different learning styles, for example some users may learn more by hearing than reading a piece of text.
Audio can add a sense of realism. Cultural associations with music allow you to convey emotion, time period, geographic location, etc. However, when using audio in this way you must be aware that meanings may differ in different cultures. Methods of sound spatialisation are now available, giving the effect of 3D sound, and allowing environmental acoustic effects, such as reverberation, to be added. For example, for the Windows platform, Microsoft has defined the device-independent DirectSound interface for spatial sound as part of DirectX.
It is useful for directing attention to important events. Non-speech audio may be readily identified by users, for example the sound of breaking glass to signify an error. Since audio can grab the users attention so successfully, it must be used carefully so as not to unduly distract from other media.
It can add interest to a presentation or program.
Ease of communication - users may respond better to the spoken word than other media. For example in a company presentation, 'sound bytes' from satisfied customers can be used.

There are however a number of disadvantages to using audio:

Like most media, files can be large. However files sizes can be reduced by various methods (see File Formats), and streamed audio can be delivered over the Web (see Streaming).
Audio can be easily overused, and when sounds are continually used users tend to tune them out. When used in a complex environment it can increase the likelyhood of cognitive overload. Studies have shown that while congruent use of audio and video can enhance comprehension and learning, incongruent material can significantly reduce it. That is, where multiple media are used they should be highly related to each other to be most effective.
For most people, audio is not as memorable as visual media.
Good quality audio can be difficult to produce, and like other media most commercial audio, particularly music, is copyright.
Users must have appropriate hardware and software. In an open plan environment this must include headphones.

File Formats

There are a large number of audio formats, but in all the file size (and quality) depend on:

Sampling frequency
Bit depth
Number of channels (mono, stereo)
Lossiness of compression

The easiest way to reduce file size is to switch from stereo to mono. You immediately lose half the data, and for many audio files it will have only a small effect on perceived quality.

Bit depth, or sample size, is the amount of information stored for each point - equivalent to the bits/pixel in an image file. This is usually 8 or 16 bits.

Frequency is the number of times per second the sound was sampled - the higher the frequency, the better the quality. In practice the frequency is usually set at one of a number of predetermined figures, most commonly 11KHz, 22KHz and 44KHz. 22kHz is very common in computer sound file formats, 44kHz is the standard for audio compact discs

The total size of a mono, uncompressed sound file will be the sample rate * bit depth * duration. Stereo sound will be twice this. For example, a CD quality sound file will be 16 bit, 44KHz, and uncompressed will be about 10.5Mb per minute.

The most common sound formats found on the Web are WAV, a Microsoft format, and AU, primarily a UNIX based format, AIFF (Audio Interchange File Format) mainly used on Mac and SGIs, and streamed formats such as RealAudio (.ra).

Recently MP3 files have become more popular, particularly for storing CD quality audio. MP3 refers to the MPEG (Motion Picture Expert Group) layer 3 audio encoding scheme, which is defined within both the MPEG-1 and MPEG-2 standards. The audio encoding scheme in MPEG-2 only differs from that in MPEG-1 in that it was extended to support very low bitrate applications.

MP3 can provide about 12:1 compression from an 44kHz 16-bit stereo WAV file without noticeable degradation of sound quality, much higher compression rates can be obtained, but at a cost of poorer sound quality. However, it is reasonably CPU intensive, encoding much more so than decoding. MP3 playback is not recommended on machines slower than a Pentium or equivalent.

MIDI (Musical Instrument Digital Interface) files are different from the audio formats described above. MIDI is a communications standard developed for electronic musical instruments and computers. In some ways it is the sound equivalent of vector graphics. It is not digitized sound, but a series of commands which a MIDI playback device interprets to reproduce the sound, for example the pressing of a piano key. Like vector graphics MIDI files are very compact, however, how the sounds produced by the MIDI file depend on the playback device, and it may sound different from one machine to the next. MIDI files are only suitable for recording music; they cannot be used to store dialogue. They are also more difficult to edit and manipulate than digitized sound files, though if you have the necessary skills every detail can be manipulated.

Streaming

Until relatively recently to listen to an audio file or play a video over the Web, the whole file first had to be downloaded. This changed with the release of Real Audio from Progressive Networks. Real Audio, and other similar products that have followed for both audio and video, allow streaming over the Internet. Streaming means that the audio or video file is played in real-time on the user's machine, without the need to store it as a local file first.

To play a RealMedia file, a link is included in the HTML document to a metafile, which contains the location of the media file, which is held on a RealServer. When the link is selected, the RealMedia player is invoked on the client, and the player begins to stream the media file. Generally the web browser plug-ins to play the streamed media files are freely available, but the server to deliver the files must be purchased.

There are now many products available which support streaming of various audio and video formats including MPEG, AVI and QuickTime, including Real Media (www.realmedia.com), Microsoft's Media Player (www.microsoft.com/Windows/MediaPlayer) and Xing's Streamworks (www.xingtech.com).

For an example of streaming, and more information about streaming see the presentation ' Streaming Multimedia on the Web' by Les Howles of the Department of Learning Technology and Distance Education, University of Wisconsin. (http://www.wisc.edu/learntech/HTMLStreamPres/StreamPres.html)

VRML

The Virtual Reality Modelling Language (VRML, often pronounced 'vermal') was designed to allow 3D 'worlds' to be delivered over the World Wide Web (WWW). Although it is usually thought of in the context of graphics only, VRML 97 supports the inclusion of spatialised, 3D audio, giving the listener a sense of the location of a virtual sound source in a virtual listening space.

A VRML file consists of a collection of objects, called nodes, containing parameters or fields which modify the node. Audio is supported through the use of several nodes:

Sound node, which allows you to specify the spatial details for the sound, with fields such as direction, intensity and location.
Audioclip node, which provides the source of sound data for the Sound node
MovieTexture node, which can be used as the source of sound data for a Sound node, as a MPEG1 file

For more information about VRML see: 'Introduction to VRML'

Audio Interfaces

There are a number of scenarios in which an audio interface or combined audio/visual interface may be more useful than a standard visual only interface. For example, in the increasingly popular Personal Digital Assistants (PDAs). These are no longer restricted to simple address books and electronic diaries, but can now also act as Internet terminals, word processors, etc. The main problem with PDAs is their very small screen size, where an audio interface may be more useful than a traditional GUI (graphical user interface).

Aural Style Sheets

Aural Cascading Style Sheets are currently being investigated by the WWW Consortium. These are being designed to make WWW documents more accessible to visually impaired users. This group of users will include not only the blind and partially sighted, but anyone for whom visual presentation is not appropriate, where eyes are engaged in another task, e.g., driving. Properties proposed in the aural CSS include:

Volume
Pause, before and/or after an element
Cue, play an auditory cue before and/or after an element
Play-during - play a background sound during an element
Cue and Play-during can both help to distinguish various semantic elements.
Several spatial properties are defined. These can be used to generate stereo sound, for example to distinguish between two different 'voices'.
Speech properties allow different voices to be specified, speed of speech etc.

Wearable Audio Computing

The need for a "hands-and-eyes free" interface was recognized in the development of Nomadic Radio, a distributed computing platform designed to be worn round the neck giving access to a variety of functions through an auditory interface. This uses various auditory cues and speech input/output. It makes use of the "Cocktail Party Effect", which means humans can listen to several audio streams simultaneously, and selectively focus on the one that is of interest, and tune the rest into the background. This allows the users to be aware of messages or events without the interface requiring their full attention.

Summary

Hardware and software support for audio has improved greatly in the last few years on most platforms, allowing audio to be used much more widely. Delivery of audio over networks has benefited from the development of streaming technologies, allowing existing audio files to be delivered in real time to many users, even those using slow connections such as modems. 3D-spatialised audio is now supported in a number of ways, including VRML and vendor specific implementations such as Microsoft's DirectX.

The use of non-speech audio to provide new interfaces looks set to increase, providing additional methods of accessing existing applications, and through the development of innovative products such as 'wearable' computers.

Careful use of audio can add significantly to the ease of use, effectiveness and appeal of many applications. However, poor use of audio can detract, making interfaces harder to use and causing cognitive problems.

References

How to select the Appropriate Media. Jim Martin. Martin Information Services, Inc., April, 1996. http://ettu618.edu.polyu.edu.hk/Umbrella/Marticles/Articles/Article8.html

Technology in Music Education: A Demonstration of Integrating Multimedia into Web Pages for Music Education. Steven G. Estrella Temple University http://fred.music.temple.edu/multimedia/outline.html

The Role of Non-speech Audio and its Applicability in Multi-Sensory Systems - A Review. Meera S Datta F 13 Conference 98 NIIT Ltd. http://206.214.38.80/conferences/F1398/Meera.html

Multimedia online: notes from the field - audio, video, animation. A step in the right direction. the NODE - learning technologies network. June 98 http://node.on.ca/tfl/fieldnotes/mclarke.html

Integrating Synchronous and Asynchronous Teaching Technologies. Robin Mason. Institute of Educational Technology, Open University. OTD REPORT NO 11 http://www-iet.open.ac.uk/iet/otd/otd11.html

Auditory Cues for Browsing, Surfing, and Navigating the WWW: The Audible Web Michael C. Albers Sun Microsystems, Inc. http://java.sun.com/people/mca/papers/ICAD96/ICAD96_AW.html

The MIT Wearable Computing Web Page http://lcs.www.media.mit.edu/projects/wearables/INDEX.HTMl

MPEG Audio Layer 3 - Information File. Fraunhofer-Gesellschaft http://www.iis.fhg.de/amm/techinf/layer3/INDEX.HTMl

Graphics Multimedia Virtual Environments Visualisation Contents