Scheffer, N., & Lei, Y. (2014, September). Content matching for short duration speaker recognition. In Interspeech (pp. 1317-1321).
This work attempts to tackle the problem of content mismatch for short duration speaker veriﬁcation. Experiments are run on both text-dependent and ext-independent protocols, where a larger amount of enrollment data is available in the latter. We recently proposed a framework based on a deep neural network that explicitly utilizes phonetic information, and showed increased performance on long duration utterances. We show how this new framework can also yield signiﬁcant improvements for short duration. We then propose an innovative approach to perform content matching, i.e. transforming a textindependent trial into a text-dependent one by mining content from a speaker’s enrollment data to match the test utterance. We show how content matching can be effectively done at the statistics level to enable the use of standard veriﬁcation backends. Experiments – run on the RSR2015 and NIST SRE 2010 data sets – show relative improvements of 50% for cases where the content has been said during enrollment. While no signiﬁcant improvements were observed for the general text-independent case, we believe that this work might pave the way for new research for speaker veriﬁcation with very short utterances.