Psychoacoustics

Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branch of…

Psychoacoustics constitutes a branch of psychophysics dedicated to the scientific investigation of sound perception by the human auditory system. This interdisciplinary field explores the psychological responses associated with auditory stimuli, including noise, speech, and music, drawing upon principles from psychology, acoustics, electronic engineering, physics, biology, physiology, and computer science.

Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branch of science studying the psychological responses associated with sound, including noise, speech, and music. Psychoacoustics is an interdisciplinary field including psychology, acoustics, electronic engineering, physics, biology, physiology, and computer science.

Foundational Context

Auditory perception transcends a purely mechanical phenomenon of wave propagation, fundamentally involving both sensory and perceptual processes. When an individual experiences sound, mechanical sound waves traveling through the air reach the ear, where they are subsequently transformed into neural action potentials. These nerve impulses then transmit to the brain for perception. Therefore, in various acoustical challenges, such as audio processing, it is advantageous to consider not only the mechanical aspects of the environment but also the crucial involvement of both the ear and the brain in shaping an individual's listening experience.

For example, the inner ear conducts significant signal processing during the conversion of sound waveforms into neural stimuli, a process that can render certain differences between waveforms imperceptible. This physiological characteristic is exploited by data compression techniques, such as MP3. Moreover, the auditory system exhibits a nonlinear response to varying sound intensity levels, a phenomenon known as loudness. Telephone networks and audio noise reduction systems utilize this principle by nonlinearly compressing data samples before transmission and subsequently expanding them for playback. A further effect of the ear's nonlinear response is the generation of phantom beat notes, or intermodulation distortion products, when sounds of closely related frequencies occur.

At least five distinct features are recognized for characterizing effective psychoacoustic practices: Loudness, which quantifies perceived volume; Roughness, representing sensory dissonance; Sharpness, related to spectral distribution; Tonalness, defined as the ratio of tonal spectral peaks; and Spaciousness, employed to predict the perceived spatial extent.

An alternative methodology for music genre recognition or recommendation involves the exclusion of a wide range of objective features that lack direct correlation with human perception. However, certain low-level features, despite not being directly tied to human or physical perception, can nonetheless contribute to advancing the understanding of psychoacoustics.

Root Mean Square (RMS) serves as a method for quantifying sound, particularly its loudness. This measurement process is significant for monitoring volume levels. Spectral Rolloff aids in achieving frequency balance, while Spectral Flatness is employed to characterize the amplitude range of noise. Lastly, Inter Channel Cross Correlation estimates the perceptual relationship between the sound received by one ear and that received by the other.

Perceptual Limits

The human auditory system is nominally capable of perceiving sounds within the frequency range of 20 to 20000 Hz. This upper limit typically declines with age, resulting in most adults being unable to detect frequencies above 16000 Hz. Under ideal laboratory conditions, the lowest frequency recognized as a musical tone is 12 Hz. Furthermore, tones between 4 and 16 Hz can be perceived through the body's tactile sense.

Human perception of audio signal time separation has been measured at less than 10 μs. This observation does not signify that frequencies above 100 kHz (1/10 μs) are audible, but rather indicates that temporal discrimination is not directly coupled with the frequency range of audibility.

The frequency resolution of the human ear is approximately 3.6 Hz within the octave range of 1000–2000 Hz. This implies that pitch changes greater than 3.6 Hz are perceptible in a clinical context. However, even smaller pitch differences can be discerned through other mechanisms. For example, the interference of two pitches often results in a repetitive variation in the tone's volume. This amplitude modulation occurs at a frequency equal to the difference between the two tones' frequencies and is known as beating.

Western musical notation utilizes a semitone scale that is logarithmic with respect to frequency, rather than linear. Conversely, scales such as the mel and Bark scales have been developed directly from human auditory perception research. Although these are primarily applied in perceptual studies, not musical composition, they demonstrate an approximately logarithmic frequency relationship at higher ranges and a nearly linear one at lower frequencies.

The range of sound intensities perceptible to humans is exceptionally vast. The human eardrum exhibits sensitivity to minute fluctuations in sound pressure, capable of detecting changes ranging from a few micropascals (μPa) up to values exceeding 100 kPa. Consequently, sound pressure levels are logarithmically quantified, with all pressures standardized against a reference of 20 μPa (equivalent to 1.97385×§1415§−10 atm). This establishes the lower threshold of audibility at 0 dB; however, the upper limit remains less precisely delineated, primarily concerning the risk of inducing noise-related hearing impairment.

Detailed investigations into the lower bounds of audibility reveal that the minimum intensity required for sound perception is contingent upon its frequency. By systematically measuring this minimal intensity across a spectrum of test tone frequencies, a frequency-dependent absolute threshold of hearing (ATH) curve can be established. The human ear typically exhibits peak sensitivity, corresponding to its lowest ATH, within the 1–5 kHz range. Nevertheless, this threshold is subject to age-related variations, with older individuals generally demonstrating diminished sensitivity above 2 kHz.

The absolute threshold of hearing (ATH) represents the lowest boundary among the equal-loudness contours. These contours delineate the sound pressure level (dB SPL) across the audible frequency spectrum at which sounds are perceived to possess equivalent loudness. Fletcher and Munson conducted the initial measurements of equal-loudness contours at Bell Labs in 1933, employing pure tones delivered through headphones; their collected data are known as Fletcher–Munson curves. Due to the inherent challenges in quantifying subjective loudness, these curves were derived by averaging data from numerous participants. In 1956, Robinson and Dadson refined this methodology, generating a revised set of equal-loudness curves for a frontal sound source evaluated within an anechoic chamber. These Robinson-Dadson curves were subsequently standardized as ISO 226 in 1986. A revision of ISO 226 occurred in 2003, incorporating data compiled from 12 international research initiatives.

Sound Localization

Sound localization refers to the cognitive process by which the spatial origin of an auditory stimulus is identified. The brain leverages minute interaural disparities in loudness, tonal characteristics, and temporal arrival to ascertain sound source locations. Spatial localization can be characterized by three-dimensional parameters: azimuth (horizontal angle), zenith (vertical angle), and either distance (for stationary sounds) or velocity (for moving sounds). Humans, akin to most quadrupedal species, demonstrate proficiency in discerning horizontal sound directions but exhibit reduced accuracy in vertical localization, primarily attributable to the symmetrical placement of their auditory organs. Conversely, certain owl species possess asymmetrically positioned ears, enabling them to detect sound across all three spatial planes—an evolutionary adaptation facilitating the nocturnal hunting of small mammals.

Masking Effects

Consider a scenario where an auditory signal is perceptible to a listener in the absence of other sounds. However, when this signal is presented concurrently with another sound, its intensity must be greater for the listener to perceive it. The interfering sound is termed the masker, and the resulting impairment of perception is referred to as masking. Notably, masking can occur even if the masker does not share the same frequency components as the masked signal. Masking typically manifests when a signal and a masker are presented simultaneously—for example, when a whispered utterance is obscured by a shouted one—resulting in the listener's inability to perceive the weaker signal due to the louder masker. Furthermore, masking effects can extend to signals presented immediately before a masker commences (forward masking) or after it ceases (backward masking). For instance, a sudden, intense percussive sound can render preceding or subsequent auditory stimuli inaudible. It is observed that backward masking typically exhibits a weaker effect compared to forward masking. The phenomenon of auditory masking has been extensively investigated in psychoacoustical research and is strategically utilized in lossy audio encoding algorithms, such as MP3.

Missing Fundamental

When exposed to a harmonic series of frequencies, such as 2f, 3f, 4f, 5f, and so forth (where f denotes a particular frequency), human perception typically identifies the pitch as f.

Music

Psychoacoustics encompasses subjects and research pertinent to both music psychology and music therapy. Theorists, including Benjamin Boretz, contend that certain psychoacoustic findings hold significance exclusively within a musical framework.

Irv Teibel's Environments series LPs, produced between 1969 and 1979, represent an early commercial offering of sounds specifically designed to augment psychological capacities.

Applied Psychoacoustics

Psychoacoustics has historically maintained a symbiotic relationship with computer science. Notable internet pioneers, J. C. R. Licklider and Bob Taylor, both pursued graduate studies in psychoacoustics. Similarly, BBN Technologies initially focused on acoustics consulting prior to its involvement in constructing the inaugural packet-switched network.

Licklider authored a significant paper titled "A duplex theory of pitch perception".

Psychoacoustics finds application across numerous domains of software development, where engineers implement established and experimental mathematical models in digital signal processing. Many audio compression codecs, including MP3 and Opus, employ a psychoacoustic model to enhance compression ratios. The efficacy of traditional audio systems in reproducing music within theaters and residences is largely attributable to psychoacoustics. Moreover, psychoacoustic principles have led to innovative audio systems, such as psychoacoustic sound field synthesis. Additionally, researchers have explored, with limited success, the development of novel acoustic weapons capable of emitting frequencies that could impair, harm, or be lethal. Psychoacoustics is also utilized in sonification to render multiple independent data dimensions audible and readily interpretable. This facilitates auditory guidance without requiring spatial audio, finding use in sonification-based computer games and other applications like drone operation and image-guided surgery. Contemporary musical applications also leverage psychoacoustics, as musicians and artists continually craft new auditory experiences by masking undesirable instrument frequencies, thereby accentuating others. A further application involves the design of compact or lower-fidelity loudspeakers, which can exploit the phenomenon of missing fundamentals to simulate bass notes at frequencies below their physical production capabilities.

Automobile manufacturers meticulously engineer the acoustic properties of their engines and even vehicle doors to achieve specific sound profiles.

Perceptual Audio Coding

The psychoacoustic model facilitates high-quality lossy signal compression by identifying components of a digital audio signal that can be eliminated or reproduced at a lower fidelity without a substantial degradation in the perceived sound quality. This significantly enhances the overall compression ratio, with psychoacoustic analysis frequently yielding compressed music files that are one-tenth to one-twelfth the size of high-quality masters, yet exhibit a proportionally smaller loss in discernible quality. This type of compression is integral to almost all contemporary lossy audio compression formats. Examples of these formats include Dolby Digital (AC-3), MP3, Opus, Ogg Vorbis, AAC, WMA, MPEG-1 Layer II (employed for digital audio broadcasting in various nations), and ATRAC, the compression technology utilized in MiniDisc and certain Walkman models.

Psychoacoustics is fundamentally grounded in human anatomy, particularly the auditory system's limitations in sound perception. The primary constraints include:

High-frequency limit
Absolute threshold of hearing
Temporal masking (encompassing forward masking and backward masking)
Simultaneous masking (alternatively termed spectral masking)

A compression algorithm can prioritize sounds outside the human auditory range less, and diminish the precision of various frequencies based on the anticipated masking level. Through the judicious reallocation of bits from less significant to more significant components, the algorithm guarantees that the sounds most likely to be perceived by a listener are represented with optimal accuracy.

Audio encoders employ a perceptual (psychoacoustic) model to analyze audio, determining the necessary precision for each frequency band or temporal segment. The outcomes of this analysis subsequently guide the adjustment of coding precision across varying frequencies and time, utilizing a suite of coding tools specific to the audio encoding format, given that different formats support distinct toolsets.

These coding tools include, but are not limited to:

Frequency filtering (e.g., low-pass, high-pass)
Transform window selection, encompassing size and model parameters
Joint stereo coding
Parametric stereo
Sample requantization
Non-linear quantization
Vector quantization
Temporal noise shaping (TNS)
Perceptual noise substitution (PNS)
Spectral band replication (SBR)

Many encoders incorporate a rate control algorithm to maintain the output bitrate of the encoded audio within specified constraints. Should transparent coding prove unattainable at the desired bitrate, these algorithms will modify coding precision—thereby introducing distortion—across different segments of the sound spectrum. This adjustment is guided by data derived from the psychoacoustic model, continuing until the target bitrate is achieved.

References

Sources

The Musical Ear—Perception of Sound

The Musical Ear—Perception of Sound at the Wayback Machine (archived 2005-12-25)
Müller C, Schnider P, Persterer A, Opitz M, Nefjodova MV, Berger M (1993). "Applied psychoacoustics in space flight." Wien Med Wochenschr (in German). 143 (23–24): 633–5. PMID 8178525.Source: TORIma Academy Archive

Psychoacoustics