In this work, we report the same achievement in DNN-based SID performance on microphone speech. We consider two approaches to DNN-based SID: one that uses the DNN to extract features, and another that uses the DNN during feature modeling.
Improved speaker recognition using DCT coefficients as features
We recently proposed the use of coefficients extracted from the 2D discrete cosine transform (DCT) of log Mel filter bank energies to improve speaker recognition over the traditional Mel frequency cepstral coefficients (MFCC) with appended deltas and double deltas (MFCC/deltas).
Application of Convolutional Neural Networks to Speaker Recognition in Noisy Conditions
This paper applies a convolutional neural network (CNN) trained for automatic speech recognition (ASR) to the task of speaker identification (SID).
Spoken Language Recognition Based on Senone Posteriors
This paper explores in depth a recently proposed approach to spoken language recognition based on the estimated posteriors for a set of senones representing the phonetic space of one or more languages. A neural network (NN) is trained to estimate the posterior probabilities for the senones at a frame level. A feature vector is then derived for every sample using these posteriors. The effect of the language used in training the NN and the number of senones are studied. Speech-activity detection (SAD) and dimensionality reduction approaches are also explored and Gaussian and NN backends are compared. Results are presented on heavily degraded speech data. The proposed system is shown to give over 40% relative gain compared to a state-of-the-art language recognition system at sample durations from 3 to 120 seconds.
A Deep Neural Network Speaker Verification System Targeting Microphone Speech
We recently proposed the use of deep neural networks (DNN) in place of Gaussian Mixture models (GMM) in the i-vector extraction process for speaker recognition.
Trial-Based Calibration for Speaker Recognition in Unseen Conditions
This work presents Trial-Based Calibration (TBC), a novel, automated calibration technique robust to both unseen and widely varying conditions.
Application of Convolutional Neural Networks to Language Identification in Noisy Conditions
This paper proposes two novel frontends for robust language identification (LID) using a convolutional neural network (CNN) trained for automatic speech recognition (ASR).
Simplified VTS-Based I-Vector Extraction in Noise-Robust Speaker Recognition
A vector taylor series (VTS) based i-vector extractor was recently proposed for noise-robust speaker recognition by extracting synthesized clean i-vectors to be used in the standard system back-end. This approach brings significant improvements in accuracy for noisy speech conditions. However, this approach incurred such a large computational expense that using the state-of-the-art model size or evaluating large scale evaluations was impractical. In this work, we propose an efficient simplification scheme, named sVTS, in order to show that the VTS approach gives improvements in large scale applications compared to state-of-the-art systems. In contrast to VTS, sVTS generates normalized Baum-Welch statistics and uses the standard i-vector model, making it straightforward to employ on the state-of-the-art i-vector speaker recognition system. Results presented on both the PRISM and the large NIST SRE’12 corpora show that using sVTS i-vectors provides significant improvements in the noisy conditions, and that our proposed simplification result in only a slight degradation with respect to the original VTS approach.
A Novel Scheme for Speaker Recognition Using a Phonetically-Aware Deep Neural Network
We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR). Specifically, the DNN replaces the standard Gaussian mixture model (GMM) to produce frame alignments. The use of an ASR-DNN system in the speaker recognition pipeline is attractive as it integrates the information from speech content directly into the statistics, allowing the standard backends to remain unchanged. Improvement from the proposed framework compared to a state-of-the-art system are of 30% relative at the equal error rate when evaluated on the telephone conditions from the 2012 NIST speaker recognition evaluation (SRE). The proposed framework is a successful way to efficiently leverage transcribed data for speaker recognition, thus opening up a wide spectrum of research directions.