End-to-end Source Separation using Adaptive Front-endsSource separation and other audio applications have traditionally relied on the use of short-time Fourier transforms as a front-end frequency domain representation step. The unavailability of a neural network equivalent to forward and inverse transforms hinders the implementation of end-to-end learning systems for these applications. In this work, we present an auto-encoder neural network that can act as an equivalent to short-time front-end transforms. We demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal and further show how it can be used as an adaptive front-end for supervised source separation.
- Mixture: Male-Female Speech mixture from TIMIT database at 0 dB.
- Female DFT: The female voice separated using DFT.
- Female AET: The female voice separated using adaptive front-ends.
- Female ortho AET: The female voice separated using adaptive and orthogonal front-ends, (i.e., the analysis transform is the transpose of the synthesis transform)
|#.||Mixture||Female DFT||Female AET||Female ortho AET|