M. McLaren, M. Graciarena and Y. Lei, “Softsad: Integrated frame-based speech confidence for speaker recognition,” In Proc. 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
In this paper we propose softSAD: the direct integration of speech posteriors into a speaker recognition system instead of using speech activity detection (SAD). SoftSAD improves the generalization of speech/non-speech models to unseen conditions by removing the need to make binary speech/non-speech decisions based on a threshold. Instead, softSAD explicitly integrates into the Baum-Welch statistics a speech posterior for each frame. We demonstrate the benefits of softSAD over SAD in severely mismatched conditions by evaluating a system developed for the National Institute for Standards and Technology (NIST) 2012 speaker recognition evaluation (SRE) on the channel-degraded Defense Advanced Research Projects Agency Robust Automatic Transcription of Speech speaker identification task (and vice versa). We also show that SoftSAD provides benefits over SAD in matched conditions.
Index Terms— Speech activity detection, speaker identification, unseen conditions, mismatched conditions.