M. McLaren, M. Graciarena and Y. Lei, “Softsad: Integrated frame-based speech confidence for speaker recognition,” In Proc. 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
Abstract
In this paper we propose softSAD: the direct integration of speech posteriors into a speaker recognition system instead of using speech activity detection (SAD). SoftSAD improves the generalization of speech/non-speech models to unseen conditions by removing the need to make binary speech/non-speech decisions based on a threshold. Instead, softSAD explicitly integrates into the Baum-Welch statistics a speech posterior for each frame. We demonstrate the benefits of softSAD over SAD in severely mismatched conditions by evaluating a system developed for the National Institute for Standards and Technology (NIST) 2012 speaker recognition evaluation (SRE) on the channel-degraded Defense Advanced Research Projects Agency Robust Automatic Transcription of Speech speaker identification task (and vice versa). We also show that SoftSAD provides benefits over SAD in matched conditions.