Context: Because speech and music signals are generally targeted at human listeners, using knowledge about the human auditory system in audio signal processing is natural. This usually involves performing a so-called "perceptually motivated" analysis of the signal. That is, analyze sounds in a way that approximates the analysis of sounds performed by the human auditory system. Because the auditory time-frequency resolution is not fixed but mostly varies with frequency and level, this implies obtaining a transform with a variable resolution. Still, many audio processing techniques rely on the short-time Fourier (also known as Gabor) or cosine transforms to analyze, process, and re-synthesize sounds. In these transforms, the resolution is fixed.
Current approaches to perceptually motivated time-frequency analysis include linear (e.g. gammatone) and nonlinear auditory filter banks (e.g. gammachirp, DRNL filter bank), complex auditory models that attempt to replicate the nonlinear processing in the auditory system, or time-frequency transforms that are partially adapted to perception (e.g. wavelets, blocks of uniform filter banks). While auditory filter banks and auditory models approximate well the auditory resolution, they are mostly not or only approximately invertible. In contrast, time-frequency transforms allow for signal reconstruction but only approximate the auditory resolution.
Method: This work introduces a new linear time-frequency transform constructed to provide perceptually motivated time-frequency analysis, perfect reconstruction, computational efficiency, and adaptable resolution and redundancy. We rely on the mathematical theory of frames and the recent non-stationary Gabor transform to formulate a transform whose frequency resolution can be matched to any perceptual frequency scale like the ERB, Bark, or Mel scale. From a filter bank viewpoint, the proposed transform corresponds to an oversampled filter bank with filters linearly distributed on a perceptual frequency scale.
Results: The transform is implemented as a non-uniform filter bank that features two parameters to adapt the resolution and redundancy: The number of filters per critical band and the down-sampling factors. The proposed filter bank design, entirely performed in the frequency domain, also allows for various filter shapes and uniform or non-uniform filter bank setting. The time-frequency resolution and redundancy of the transform are adaptable without affecting its perfect reconstruction property down to redundancies close to 1.
My contribution: Principal investigator.
Potential applications: mainly audio processing techniques that require an analysis-synthesis framework. The proposed filter bank is a potential substitute for the Gabor transform or the gammatone filter bank as used in audio applications.
Resources and related publications: We provide the codes for computing the transform and its inverse in the Matlab/Octave toolbox LTFAT.
O. Derrien, T. Necciari, and P. Balazs. A quasi-orthogonal, invertible, and perceptually relevant time-frequency transform for audio coding. In Proceedings of EUSIPCO 2015, pages 804–808, Nice, France, September 2015. IEEE.
T. Necciari, P. Balazs, N. Holighaus, and P. Søndergaard. The ERBlet transform: An auditory-based time-frequency representation with perfect reconstruction. In Proceedings of ICASSP 2013, pages 498–502, Vancouver, Canada, May 2013. IEEE.
T. Necciari and P. Balazs. A perfectly invertible perception-based time-frequency transform for audio representation, analysis and synthesis. Presented at the AIA-DAGA 2013 Conference on Acoustics, March 2013.