Speech as a "modulated signal"

Speech, like many interesting, natural sounds, is a dynamic signal, i.e. its amplitude and frequency content change over time. One interesting question asked by Elliott & Theunissen is whether speech has "characteristic" time varying amplitude and frequency distributions. Do the "temporal and spectral modulations" of speech have to follow within certain parameter ranges for speech to be comprehensible or recognizable? What temporal and spectral modulations does speech normally exhibit? And are there particular modulations that are "necessary" to make speech identifiable or comprehensible?

Elliot and Theunissen addressed this question by calculating the "modulation spectra" of speech as shown here:

modulation spectra of speech


Such modulation spectra are "inveritble", meaning that (provided you are skilled at digital signal processing) you can go from the modulation spectrum back to the original sound, possibly after removing certain ranges of modulation from the original signal, and you can then ask whether the speech sounds remain comprehensible if particular modulations are removed.

Here some examples. First an original speech sound:

Now the same speech sample with all temporal and spectral modulations filtered out except for the "core" region with spectral modulations of less than 4 cycles/kHz and temporal modulations between 1 and 7 Hz. The sample remains comprehensible but sounds very artificial.

An interesting result from this decomposition into spectral and temporal modulations is that the "meaning" of speech sample "lives in a different part of modulation space" from voice pitch or speaker identity.

Consider this example where all temporal modulations are preserved, but all only spectral modulations below 0.5 cycles / kHz are preserved. This preserves speech formants, so the speech remains comprehensible, but pitch information is mostly lost and we can no longer tell whether it is a male or female speaker:

And compare this against a sample where all temporal modulations faster than 3 Hz are filtered out. Now we are missing the time structure important for carrying "meaning", and the sentence becomes harder to understand, but we can still easily identify the voice pitch and gender of the speaker: