Computational Perception Group

home | members | publications | directions | research | presentations | teaching | vacancies | personal

Machine Hearing Research

The sound of falling rain, running water, crackling fire, and howling wind are immediately recognisable. But what makes them sound the way they do?

The answer must lie in the statistics of the sounds. For instance, no two ``rain'' sounds are identical, because the precise arrangement of falling water droplets is never repeated. Consequently, the perceptual similarity of two rain sounds cannot be derived from a direct comparison of their waveforms. Instead the similarity must be derived at the level of the statistics of the sounds, that is the aspects of the waveform which relate to the rate of falling rain-drops, the distribution of droplet sizes, and so on.

Perhaps surprisingly, a relatively small set of statistics are sufficient to describe a large number of sounds. For example, please listen to this introductory example made from sounds collected on a camping trip. The energy at each time-frequency point is plotted below and the sounds are labeled.

So, a person starts by a camp fire and then walks past a stream to their tent. The wind starts to howl and the person gets to their tent just in time before it rains at the end of the clip.

Remarkably, all the sounds on this clip are synthetic (except for the sound of the closing zip). They are produced from a single 'generative model', which has been trained on natural sounds and learns how to produce natural sounding versions by capturing their statistics. The model is particularly good at synthesising auditory textures like fire, running water, wind and rain.

Here are some more examples showing the successes and failures of the method:

Stream

Fire

Wind

Rain

Twigs

Applause

Speech

Synthetic 1

Synthetic 1

Synthetic 1

Synthetic 1

Synthetic 1

Synthetic 1

Synthetic 1

Original 1

Original 1

Original 1

Original 1

Original 1

Original 1

Original 1

Synthetic 2

Synthetic 2

Synthetic 2

Original 2

Original 2

Original 2

Synthetic 3

Original 3

The model works by learning the important statistics of these sounds. It can then produce new synthetic versions, of arbitrary duration, by ensuring the new sounds match the statistics of the original. This demonstrates that auditory textures are often defined statistically, a fact first demonstrated by Josh McDermott? and Eero Simoncelli.

Characterising the statistics of natural sounds is important. For example, when your car's automatic speech recognition system tries to figure out what you are saying when there is traffic noise in the background, it will often fail. However, the performance can be enhanced be removing the traffic noise and this can be done by knowing the difference between the statistics of the traffic noise and speech.

Here's a preliminary example of this technology in action. Here is a mixture of speech and running water. The statistical algorithm separates the mixture based on their statistics producing the restored speech.

We are applying the audio texture technology to audio denoising and source separation problems, synthesis of audio textures for creative industries and computer games, and to audio therapy for tinnitus.

Importantly, by generating synthetic sounds, this work also reveals the statistics to which auditory processing is sensitive. This is an important practical tool for understanding how hearing operates.

Related papers

Related talks

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback