In deep learning, audio is typically encoded by converting the sound waves into a digital form called an audio signal. The process involves several steps:
1.Sampling: The continuous audio signal is sampled at a regular interval, typically measured in kilohertz (kHz). This creates a series of discrete data points.
2. Quantization: Each sampled point is then quantized into a numerical value, typically represented in bits. This quantization turns the amplitude of the audio wave at each point into a digital value.
3. Encoding: The quantized values can be further encoded using different formats and compression techniques to reduce file size or tailor them for specific applications.
Additionally, for deep learning applications, the raw audio can be transformed into different formats that might be more useful for model training, such as:
Spectrograms: These are visual representations of the spectrum of frequencies in a sound or other signal as they vary with time, which are useful for models dealing with frequencies.
-Mel Frequency Cepstral Coefficients (MFCCs): These coefficients make up a representation that captures the timbre or texture of the sound, which is highly useful in speech and audio recognition tasks.
Wavelet Transforms: This method provides a way to analyze non-stationary signals at different scales or resolutions.
These transformations help in capturing various aspects of audio data that are important for tasks like speech recognition, music generation, and sound classification.
For deep learning purposes, one would commonly transform raw audio into spectograms like Mel spectograms. These spectograms are still temporal, meaning that their sizes still depend on how long your raw audio is, so you would segment your audio into slices of fixed length before calculation. Then, you feed your model this sequence of spectograms as input. Mel spectograms have become a standard way to represent audio in deep learning because of the popularity of speech-to-text tasks.