Audio Deep Learning: Audio Representation (Spectrograms)
In audio deep learning, continuous sound waves are converted into visual representations called spectrograms, allowing 2D convolutional networks to process audio signals.
Sound Wave to Spectrogram Conversion
Fourier transforms decompose 1D audio waveforms into 2D frequency representations over time.
Short-Time Fourier Transform (STFT)
Sound is recorded as a 1D waveform representing air pressure variations over time. To analyze frequency changes over time, we use the Short-Time Fourier Transform (STFT). STFT divides the continuous waveform into short overlapping windows and applies the Fast Fourier Transform (FFT) to each window:
\\(X(t, f) = \\sum_{n=-\\infty}^{\\infty} x[n] w[n - t] e^{-j 2 \\pi f n}\\)
This computes the frequency spectrum of each window. Plotting these frequency spectra sequentially along the time axis creates a 2D image called a spectrogram. The horizontal axis represents time, the vertical axis represents frequency, and the intensity of each pixel represents the amplitude of the frequency at that point in time.
The Mel Scale and Mel Spectrograms
Human hearing does not perceive frequency changes linearly. We are much more sensitive to differences in low frequencies than high frequencies. To model this, the linear frequency scale is warped into the Mel Scale using the formula:
\\(m = 2595 \\log_{10} \\left(1 + \\frac{f}{700}\\right)\\)
A Mel Spectrogram is a spectrogram where the frequencies are converted to the Mel scale. This representation aligns the frequency bands with human auditory perception, making it highly effective for tasks like speech recognition and music classification.
Log-Mel Spectrograms and Dynamic Range
Converting spectrogram power values to decibels matches human loudness perception and stabilizes networks.
Log-Scaling Amplitude
The range of sound intensities that humans can hear is vast. To handle this, the amplitude values of the Mel Spectrogram are log-scaled (converted to decibels):
\\(dB = 10 \\log_{10} \\left(\\frac{P}{P_{ref}}\\right)\\)
where \\(P\\) is the power of the signal. Log-Mel spectrograms compress the dynamic range of the audio, highlighting low-volume details and matching human loudness perception, which prevents high-amplitude noise from dominating feature representations.
Hyperparameter Tuning for Spectrograms
Creating a spectrogram involves tuning key hyperparameters: n_fft (window size), hop_length (stride between windows), and n_mels (number of Mel bands).
A large n_fft increases frequency resolution but degrades temporal resolution (blurring events in time). A small n_fft does the opposite. Balancing these parameters is essential for high-fidelity audio classification, ensuring sharp representations.
PyTorch Implementation using torchaudio
Let's build a Mel Spectrogram transformation pipeline in PyTorch using the torchaudio library.
Generating Mel Spectrograms in PyTorch
PyTorch provides the torchaudio library to handle audio signals. The code below shows how to load an audio file, generate a Mel Spectrogram, and apply log-scaling.
In this transform, the 16,000 audio samples are mapped to a 64x32 2D representation, where 64 represents frequency bins and 32 represents time windows, turning audio into an image.
Spectrogram Normalization
Just like images, spectrograms must be normalized before feeding them to a CNN. The code below shows how to normalize a log-Mel spectrogram tensor to have zero mean and unit variance.
<pre><code class="language-python"># Standardize features to zero mean and unit variance mean = log_mel_spec.mean() std = log_mel_spec.std() normalized_spec = (log_mel_spec - mean) / (std + 1e-8) print("Normalized Mean:", normalized_spec.mean().item()) print("Normalized Std:", normalized_spec.std().item())</pre>Standardizing the spectrogram output prevents gradient instabilities during training, helping CNN filters converge quickly when processing audio features.