The basic psycho- and physioacoustic phenomenas behind multi-channel reproduction are studied in this presentation. The dynamic range of human hearing is the basis of the specifications for audio base band frequency and dynamic range. Besides directional hearing, critical bands and masking are discussed as a basis for understanding the new bit compressed digital audio transmission and recording methods. The very basic theory of directional hearing in 4pi-space is studied, Pinna Cues of Sound Direction. These findings are also used in virtual sound source technology. This theory offers also a very good electrical solution for correcting the acoustical irregularities in speaker system by improving the stereo image stability. The distance of a sound source is difficult to perceive in sound reproduction. Some reasons for this difficulty are explained in the chapter Perception of Distance.


 The most important hearing characteristic of multi-channel audio is localization of the sound source. The other important phenomenas are hearing area, critical band and masking, especially for new digital multi-channel systems, where these features are used to reduce the bit count of the bitstream.




The dynamic range of hearing i.e. area between the quiet threshold and the threshold of pain is a plane in which audible sounds can be displayed. In its normal form, the dynamic range of hearing is plotted with frequency on a logarithmic scale as the abscissa, and sound pressure level in dB on a linear scale as the ordinate. This means that two logarithmic scales are used because the level is related to the logarithm of sound pressure. The critical-band rate may also be used as the abscissa. This scale is closer to the characteristics of our hearing system than frequency.


The usual display of dynamic range of hearing is shown in figure 1. On the right, the ordinate scales are sound intensity in Watt per square meter and sound pressure in Pascal. Sound pressure level is given for a free-field condition relative to 2x10-5 Pa. Sound intensity level is plotted relative to 10-12 W/m2.

Figure 1 Dynamic range of hearing. The ordinate scale is not only expressed in sound pressure level but also in sound intensity and sound pressure.




It is assumed that our hearing system process sounds in relatively narrow frequency bands. It has been discovered that the part of a noise that is most effective in masking a test tone, is the part of its spectrum lying near the tone. Masking is achieved, when the power of the test tone and the power of that part of the noise spectrum lying near the tone and producing the sensation effect, are the same. The parts of the noise outside the spectrum near the test tone do not contribute to masking. Characteristic frequency bands defined in this way have a bandwidth that produces the same acoustic power in the tone and in the noise spectrum within that band, when the tone is just masked.


Data from many subjects has been collected to produce a reasonable estimation of the width of the critical band. Although the lowest critical bandwidth in the audible frequency region may be very close to 80 Hz, it is attractive to add the inaudible range from 0 Hz to 20 Hz to that critical band, and to assume that the lowest critical band ranges from 0 Hz to 100 Hz. Using this approximation, figure 2 shows the average between the quiet threshold and 90dB. There is a small tendency for the critical band to increase somewhat for levels above 70 dB



Figure 2 Critical bandwidth as a function of frequency. Approximations for low and high frequency ranges are indicated by broken lines [ZWFA90].




Masking plays a very important role in everyday life. For a conversation on the sidewalk of a quiet street, little power is necessary for the speakers to understand each other. However, if a loud truck passes by, the conversation is disturbed. The speakers can no longer hear each other if speech power is kept constant.

Figure 3 Schematic drawing to illustrate and characterize the regions within which premasking, simultaneous masking and post masking occur. Note that postmasking uses a different time origin than premasking and simultaneous masking [ZWFA90].


Figure 4 Level of test tone just masked by critical-band wide noise with center frequency of 1 kHz and different levels as a function of the frequency of the test tone [ZWFA90].


Masking effects can be measured not only when masker and test tone are present simultaneously, but also when they are not. In the latter case, the test sound has to be a short burst or sound impulse that can be present before the masker stimulus is switched on. The masking effect produced under these conditions is called pre-stimulus masking, (figure 3). This effect is not very strong, but if the test sound is present after the masker is switched off, then a quite pronounced effect occurs. Because the test sound is present after the termination of the masker, the effect is called post-stimulus masking, or postmasking. Figure 4 shows the dependence of masked threshold on the level of a narrow band noise centered at 1 kHz. Narrow band noise means a noise with a bandwidth equal to or smaller than the critical bandwidth (about 100 Hz below and 0.2 f above 500 Hz).




Human hearing is quite atuned to detecting the direction of sounds in a horizontal plane [JUTO84]. It is most sensitive below 1 kHz. For frequencies over 2 kHz the sensitivity drops only slightly. In the frequency range from 1 kHz to 2 kHz the ability to detect directions is lower. The main mechanism for separating sound directions below 1.5 kHz is time difference and over the 1.5 kHz it is level difference between the ears.


If the same sound is excited at different times or with different levels to different ears, the sound source is localized in direction that depends of loudness or the order of sound excitation. Level and time differences are interchangeable within limits. Level and time differences needed to compensate each other are frequency dependent. At low frequencies the time that is needed to compensate for level is about 2 µs/dB and for high frequencies up to 100 µs/dB is needed. When the time difference is over 2 ms, the mechanism does not work and the first incoming sound is totally dominant.  


Pinna cues of sound direction


Interaural differences in time and level are considered to be the major factors in directional hearing. They, however, can not take into account the localization of three dimensional space. Sound sources that cause identical interaural time differences lie on a hyperbolic cone around the interaural axis. The mechanism on how humans detect an elevation of sound source is not well known nor is front-back discrimination well know. One way to approach the problem is to use the function of pinnae with three dimensional space sound sources.  


Figure 5 Descriptive diagram of pinna


Over one kHz, it appears that directivity information is processed basically in frequency domain as spectral variations while only secondary cues are handled in the time domain. Below one kHz the dimensions of the head and ears are such that time delay is the only primary cue as to the direction of sound. The first 500  µs to 1 ms is dominant for detecting the source. Repetitive directional information arrived after 1 ms has a lower effect until zero impact is reached at 10 ms (Haas effect?) [HBR88]. For early man this was all that was necessary for survival. There was no time for a complex process to detect the directions of a arriving sound. The consequence of this kind of character is that directional cues have to be as simple and as clear as possible. 


Figure 6 Frequencies of judgment front (f), behind (b) and above (a) for one-third octave bands of noise [JEBL70]. Directional bands: Bordered- at 90 % level of significance and shaded- most likely

Figure 7 Spectral features that vary monotonically with decreasing sound source elevation in vertical plane (above) and increasing source azimuth in horizontal plane [HOHA94].  


The processing of cues has to be effective and well suited for the neural system. For simplicity some experiments conclude that in some cases, directional information is processed as monaural [HBR88]. 


In the head related transfer function, HRTF, there appears to be a notch in the high frequency range that is a function of the elevation of the sound source. The notch itself may not be the primary cue but the left slope that varies systematically and monotonically from 6 to 13 kHz when the elevation is increasing, certainly is. This is the only cue below ear level. Although the detection of spectral slope is a difficult task for electronic equipment to measure it is simple for neural system while the detection of spectral minimum is a more complex task. Experimentally, the sensation of elevated sound source can be achieved by [HOHA94]  

** 3.9...8.0 kHz low pass cutoffs. Increasing the cutoff at this frequency range, increase the elevation angle from 0 to 60 degrees.

** 4.0...7.2 kHz bandpass filtering is perceived in front with an elevation of 60 degrees.

** 7.4...10.8 kHz notches cause elevation increase from 0 to 60 degrees when the notch frequency increases.

** 10.3 kHz low pass filtering causes the elevation of 90degrees.

** 8.1...9.1 kHz bandpass signal causes the sensation of 90 to 60 degrees elevation in rear section.

** 12.0...17.8 kHz notch causes elevation of 90 degrees.  


Besides the level, the right hand edge of a spectral notch, is shifted toward the higher frequencies in the right ear, when the sound source is moving clockwise. The working area of this cue is from 40 to 180 degrees. In the frontal section a double notch would work as directional cue from -40 to +30 degrees. It is the same double notch that detects the elevation in zero azimuth. In horizontal-plane localization the concha plays the major role in the HRTF. 


The sensation of a background sound source can be achieve with 13.2...15.5 khz high-pass or 14.5 kHz band-pass filtered noise. In the first case there is no elevation and in the latter one source is elevated 30 degrees. The common denominator is towards the high frequencies ascending spectral slope. In HRTF there is also, at 16 kHz, a deep notch that doesn't appear in the frontal section. Front back discrimination appears to be based on a level difference at a band in these frequencies. In the lower frequency band from 3.75...7.5 kHz, depending on the azimuth angle, there is a deep notch with rear section sound sources but none with frontal sources. The third mechanism to achieving rear sensation might be a boosted frequency band between 7 and 12 kHz.


The frequency bands affecting the sensation of front, back and over head sound source with the stimulus of one third octave band noise are presented in figure 3.6. Figure 7 shows the main pinna cues that probably are active in detecting the elevation and azimuth of the sound source. In these measurements a torso in an unechoic chamber has been used. The microphone is mounted in the right ear of the torso.  


Perception of distance 


Human hearing is not very accurate in detecting the distance of a sound source. Loudness is maybe the most widely accepted cue for distance. Loudness decreases inversely as the distance in unechoic conditions and about 20 phon [MBGA69] is needed to perceive half the distance. In reverberant condition the situation is a little different.The decrease in loudness is not as large as in an unechoic condition and the loudness difference needed to double the perception of distance varies from 22 to 41 phon [SØNI93]. For accurate perception of distance absolute loudness must be known. The distance of the source is over estimated with low levels of sound but improves considerable with an increase in the sound level. Variations between individuals are large. In unechoic condition it is almost impossible to estimate the distance without any other cue other than loudness.


The shape of the frequency spectrum is another factor that may give a cue for ascertaining the distance of a sound source. Air damping is frequency dependent. The loss of high frequencies is greater than the losses of low frequencies. Besides direct air damping, many echoes in reverberant rooms cause absorption of reflections from walls. The loss of loudness depends on the materials of the walls. To estimate the distance of sound source requires knowledge about the characteristics of room and the frequency spectrum of the source. 


The ratio between direct sound, early reflections and reverberation also offers a cue as to the distance of a sound source. In this case known sound and room characteristics are required for good perception. With a known source like the human voice the reverberation ratio appears to be the most important [SØNI93] factor in estimating the distance. Also binaural differences may sometimes be a valuable cue. The significance of this factor is not however very strong. Experiments have been also shown that the distance of a source is over estimated when the source is at an angle to the receiver as opposed to directly in front.  




It has long been known that there is a high correlation between auditory spaciousness and preference in concert hall types of sound fields. This is confirmed by psychocoustic experiments [JBWL86]. The correlation preference versus spaciousness is presented in figure 8. However normally spaciousness information is minimized in recorded or broadcasted program material. A very common technique for recording music is to place the microphones close to the instruments. The signal level through this method is the highest above noise. Also external disturbance like air conditioning, airplanes and traffic is minimized. Pop and rock music is normally recorded in sound studios and it is common that the final recording is assembled from short take ups. It is not unusual that the entire band is not recording at the same time. These sound studios are heavily damped producing dry acoustics while close range microphones and multi-channel recorders are used. In reproduction the direction information has to come from the right direction for satisfactory spaciousness image. The only way to do this is to add them in afterward.  


Figure 8 Plot of spaciousness versus preference [JBWL86]. As test material a motif of Mozart's Jupiter symphony has been used.  

Figure 9 A classic presentation of reflections in a concert hall (Capitol Theater in Yakima, Washington), USA [HECH79]. The sound source is marked as A and receiver as B. Direct sound is c and reflected ones d and e.  


Besides direct sound, reflected sounds can also be heard (figure 9) in a closed space. Sound is reflected from walls, ceiling and the floor. The intensity of reflected sound decreases over time for three reasons. First the sound pressure decreases inversely as the distance. Secondly at every reflection the sound wave loses energy, how much depending upon the construction and material of the reflective surface. The energy loss depends on frequency and the spectrum of the sound reformed during the reflections. The last factor is that the air attenuates sound, again this is frequency dependent.  


The reflective sound field has been spread over early reflections and reverberation. The response of a sound impulse is presented in figure 10.


Figure 10 Schematic presentation of impulse response in a room. The time scale can be seen only as an example not actual one.  

Table 1 Effect of a single reflection [MBAM81]

Delay ms Frontal reflection Lateral reflection
0 Loudness Apparent image size or image shift
5 Tone coloration  
10-20   Tone coloration
3-60   Spatial impression

Echo disturbance

Echo disturbance


The reflective sound field has been spread over early reflections and reverberation. The response of a sound impulse is presented in figure 10. The border between early reflections and reverberation is not clear but it would be somewhere between 50 and 100 ms. As early reflections can not be separated from direct sound, they increase the total loudness. Reflections that create the sensation of two different sounds belong to the reverberation field. If direct and reflective sounds have the same intensity they are perceived as two sounds [ALHA76]. At 100 ms and with level difference 6 dB they are perceived as one source. The subjective effect of a single reflection is described in the table 1.


Spaciousness is a multi-dimensional perceptual attribute. The most important cues for the spaciousness are early reflections. They are not the only cues but the other components of post sound and reverberation field offer only complementary information. In cases where the level and count of early lateral reflections are low the reverberation can be used as a substitute. Sometimes the reverberation is perceived as disturbance.  


To be valuable for spaciousness the early reflections have to arrive in directions other than the direct sound. The all spectral components of those reflections carry information about the room. Reflectors have to work over a large frequency band to be effective. Spectral components of reflections work differently. Frequencies in lateral reflections below 3 kHz mainly expand the sensation of depth. If the reflections contain components above 3 kHz the perception of broadening of the room will be prominent. In figure 11 the reflection patterns of two different shapes of halls have been studied [JEBO85].


Figure 11 Directions of early reflections in two different shapes of halls. Reflection pattern of rectangle room is on the left side and on the right the shape of hall is fan with direct rear wall [JEBO85]

Figure 12 Spatial impression against lateral delayed reflection pair at 40 degree angle with Mozart's motif. 95 % confidence limit presented as vertical lines [MBAM81].  


During the first 100 ms the sound will reflect over a hundred times creating the same count of image as the sound sources. At this time the signal processors for consumer use are not powerful enough to handle this amount of audio data. Data reduction is needed. Only the most important reflections can be taken into account. Reflections from front section are ignored and also those arriving from the same direction within one ms. of each other. Also elevated image sources are discarded. There should now leave only five to ten reflections to process, small enough for consumer processors. The effect of one reflection pair in the angle of 40 degrees is presented in figure 12.  


Figure 13 Quality of listening experience according to count of reproduction channels [NMKOS71] when music is used as program source in frontal section.  




The psychoacoustic target of multi-channel reproduction systems can be split into two main groups. The goal of the first group is to create super directionality so that the sound source can be located anywhere around a listener. Results with such systems are impressive, but it sometimes makes tires the listeners. The other possibility is to create an ambient field surrounding a listener as in a concert hall.




If accurate localization everywhere around the listener is desired, more than two channels and speakers have to be used. Many experiments have shown that the four channels used in quadraphony is not enough [MCA68] and the need for more channels is obvious (figure 13). The first step is to place a fifth channel between the front left and right speakers. Sound localization with this system is somewhat confusing and the result is not very good [GTGP76]. A six channel system is possible. The extra speakers can be placed between the left and right channels in both the front and rear. In this layout the new speakers are offering a real source at the highest phantom accuration point. The alternative method is to use new speakers at the lowest phantom accuration location in the sides of listener. Some experimenters [GTGP76] prefer this six-channel confiquration over the first one.


In rectangle quadraphony the most accurate phantom localization is achieved in the front quadrant. The image is easily perceived and the shift is smooth and sharp with inter-channel level difference. If outphase phantoms are used the image is unstable and front-rear confusions are common. Also the sound source has been localized outside the speakers. The phantom source behavior in rear the quadrant is very similar in both inphase and outphase cases. The shift, according to inter-channel level difference is sometimes sharper with rear-front confusions increasing slightly. In side quadrants the localization of phantom source is almost impossible to achieve[RICA79].


Figure 14 Preferred angles of front speakers

Figure 15 Phantom source localization in four speaker system between front and right side speaker according to level difference [RICA79]


In the arrangement presented in the right side of the figure 4.2, phantom source localization is almost as good as in the frontal section of the rectangle quadraphony system. The phantom source generated inphase shifts smoothly from center to side speaker according to the differences in sound level. Front-rear confusion is a little higher than in the front quadrant of the rectangle system. Outphase signals between front and rear speakers caused an extremely vague phantom and the rate of front-rear confusion was high. The behavior of the front-side speakers is presented in picture 3.15. If the level of signals is nearly equal, the sound is localized to the rear and opposite side. Sources in the two rear quadrants have very unstable, non localized phantoms and rear-front confusion rate is also very high. This system is totally useless to create any localized rear source [RICA79].


Inter-channel phase difference


It is known phenomena that phase difference can create a phantom source shift toward the leading speaker. The method of comparing inter-channel level difference to control phantom source is rarely used. The reason may be, that experiments have been shown inter-channel localization is difficult to achieve and the stereo image is not well defined. Theoretically the inter-channel phase delay can be reduced to the inter-channel level difference [FOED8X].


Figure 16 Stereophonic system geometry


Stereophonic geometry presented in figure 16 has a phase delay radians between speakers at the listening point. The wavefront H(x) [BBE85] generated along the x-axis is




where L and R are the amplitudes of the left and right channels and k = 2 / is the wave constant and is wave length. For the listening geometry in figure 3.16 and frequencies to be considered [BBE85]






Substituting equations (2) and (3) into equation (1) and neglecting terms that carry no directional information nor radial amplitude reduction in divergent wavefront and assuming speaker polar diagram variations are small in the area occupying the head and considering there is no level difference between the left and right source, the equation (1) can be written




The equation (4) has the maximum




and respectively




If the delay is equal to 0, the maximum is at x=0 in the center axis of the speakers and if the delay 0 the maximum shifts toward the leading speaker.


The direction of arrived sound pattern can be calculated from equation [CDV57]




where Le and Re are average instantaneous sound pressures from the left and right loudspeakers in the left and right ears respectively. Assuming the listener is on the center axes of the speakers, the sound pressures of the interference pattern, due to phase difference can be calculated from equation (4) in the position of ears -xm x xm. Substituting calculated values into the equation (7) the estimated direction pattern can be calculated. The sound pressures at the distance between ears, 14 cm, (are calculated in figure xx with the geometry of the figure 16 set at w=230 cm, h=200cm and f=500 Hz). The calculated and measured directions are in figure 17. In the 500 Hz experiment one third octave pink noise was used.


The phase difference is a usable cue for sound localization only with frequencies below 1.5 kHz, because of the size of the head. In high frequencies the interference pattern has more than one minimum or maximum at the distance between ears and the pressures are not proportional to the pressures of the speakers.  


Figure 17 Phantom source positioning based on inter-channel phase difference in geometry presented in figure 16 (frequency 500 Hz) [FOED8X].



As well as the 360 degree accurate localization another objective of surround sound system is to create an ambiance soundfield with rear channels. In this case an accurate localization is provided only in front of the listeners and for all other directions the sound is not localized. The task is now somewhat easier and satisfactory result can be achieved with as small a number of channels as three. The best ambiance can be achieved by placing surround speakers at the sides of listeners [NMKOS71] instead of at the rear. If more channels are used rear channels can be also used in conjunction with side speakers.The relative quality of sound reproduction compared to the number of speakers is presented in figure 13. In the picture the super directivity has been studied, but the sound source is an orchestra located in front of the listeners with side and rear speakers producing ambiance signals.  



Alpo Halme, "Rakennus- ja huoneakustiikka", Otakustantamo kolmas muuttamaton painos, 1987


J. C. Benneth, K. Barker, F. O. Edeko, "A New Approach to The Assesment of Stereophonic Sound System Performance", Journal of Audio Engineering Society, vol 33, pp. 314...321, May 1985


H. M. Clarc, G. F. Dutton, P. B. Vanderlyn, "The "Stereophonic" Recording and Reproducing System", The Proceedings of The Institute of Electrical Engineers, vol. 104, Part B, pp. 417...430, 1957


F. O. Edeco, "Image Localization and Interchannel Phase Difference" Electronics &Wireles World, vol. , pp. 799...802


G. Thiele, G. Plenge, "Localization of Lateral Phantom Sources", conference presentation, Audio Engineering Society, Zyrich, Switzerland, March 2...5, 1976


E. R. Hafter, T. N. Buell, V. M. Richards, "Onset-Coding in Lateralization: It's Form, Site, and Function", in Auditory Function, G. M. Edelman, W. E. Gall, W. M. Cowan. Eds, Wiley, New York, pp. 647...676,1988




Hok-Loe Han, "Measuring a Dummy Head in Search of Pinna Cues", Journal of Audio Engineering Society, vol 42 no 1/2 pp. 15...37, January 1994


Jens Bleuert and Werner Lindeman, " Auditory Soaciousnes: Some Further Psychoacoustics Analyses", J. Acoust. Soc Am. pp. 533...542, 1986


Jens Blauert, "Sound Localization in the Median Plane", Acustica, vol 22, pp. 205...213, 1969/1970


Jefrey Borish, " An Auditorium Simulator for Domestic Use", Journal of Audio Engineering Society, vol 33 no 5, pp. 330...341, May 1985


Juha Törönen, "Kaiuttimien tuotekehitys- ja tutkimusmenetelmien kehitysmahdollisuudet", Valtion teknillinen tutkimuskeskus, Tiedotteita 273, Espoo 1984


M. Barron and A. H. Marshall, "Spatial Impression Due to Early Lateral Reflections in Concert Halls: The Derivation of A Physical Measure",  Journal of Sound and Vibration, 77(2), pp. 211...232, 1981


M. Camras, " Approach to Recreating a Sound Field",  J, Acoust. Soc. Am., vol 43 pp. 1425...1431, 1968


Tekeshi Nakayma, Tanetoshi Miura, Osamu Kosaka, Michio Okamoto and Takeo Shiga, "Subjective Assesment of Multichannel Reproduction", Journal of Audio Engineering Society, vol 19,  October 1971


Richard Cabot, "A Triphonic Sound Reproduction System Using Coincident Microphones", Journal of Audio Engineering Society, vol 27, pp. 965...969, December 1979


Soren H. Nielsen; "Auditory Distance Perception in Different Rooms", Journal of Audio Engineering Society vol. 41, pp. 755...770, October 1993


E. Zwicker, H. Fastl, Springer-Verlag; "Psychöacoustics, Facts and Models", 1990