The output scores of most speaker recognition systems are not directly interpretable as stand-alone values. For this reason, a calibration step is usually performed on the scores to convert them into proper likelihood ratios (LR), which have a clear probabilistic interpretation. The standard calibration approach transforms the system scores using a linear function trained using data selected to closely match the evaluation conditions. This selection, though, is not feasible when the evaluation conditions are unknown. In previous work, we proposed a calibration approach for this scenario called trialbased calibration (TBC). TBC trains a separate calibration model for each test trial using data that is dynamically selected from a candidate training set to match the conditions of the trial. In this work, we extend the TBC method, proposing (1) a new similarity metric for selecting training data that results in significant gains over the one proposed in the original work, (2) a new option that enables the system to reject a trial when not enough matched data is available for training the calibration model, and (3) the use of regularization to improve the robustness of the calibration models trained for each trial. We test the proposed algorithms on a development set composed of several conditions and on the FBI multi-condition speaker recognition dataset, and we demonstrate that the proposed approach reduces calibration loss to values close to 0 for most conditions when matched calibration data is available for selection and that it can reject most trials for which relevant calibration data is unavailable.
Deep neural network (DNN)-based speaker embeddings have resulted in new, state-of-the-art text-independent speaker recognition technology. However, very limited effort has been made to understand DNN speaker embeddings. In this study, our aim is analyzing the behavior of the speaker recognition systems based on speaker embeddings toward different front-end features, including the standard Mel frequency cepstral coefficients (MFCC), as well as power normalized cepstral coefficients (PNCC), and perceptual linear prediction (PLP). Using a speaker recognition system based on DNN speaker embeddings and probabilistic linear discriminant analysis (PLDA), we compared different approaches to leveraging complementary information using score-, embeddings-, and feature-level combination. We report our results for Speakers in the Wild (SITW) and NIST SRE 2016 datasets. We found that first and second embeddings layers are complementary in nature. By applying score and embedding-level fusion we demonstrate relative improvements in equal error rate of 17% on NIST SRE 2016 and 10% on SITW over the baseline system.
Multi-domain language recognition involves the application of a language identification (LID) system to identify languages in more than one domain. This problem was the focus of the recent NIST LRE 2017, and this article presents the findings from the SRI team during system development for the evaluation. Approaches found to provide robustness in multi-domain LID include a domain-and-language-weighted Gaussian backend classifier, duration-aware calibration, and a source normalized multi-resolution neural network backend. The recently developed speaker embeddings technology is also applied to the task of language recognition, showing great potential for future LID research.
With the recent introduction of speaker embeddings for text-independent speaker recognition, many fundamental questions require addressing in order to fast-track the development of this new era of technology. Of particular interest is the ability of the speaker embeddings network to leverage artificially degraded data at a far greater rate beyond prior technologies, even in the evaluation of naturally
degraded data. In this study, we aim to explore some of the fundamental requirements for building a good speaker embeddings extractor. We analyze the impact of voice activity detection, types of degradation,
the amount of degraded data, and number of speakers required for a good network. These aspects are analyzed over a large set of 11 conditions from 7 evaluation datasets. We lay out a set of recommendations for training the network based on the observed trends. By applying these recommendations to enhance the default recipe provided in the Kaldi toolkit, a significant gain of 13-21% on the Speakers in the Wild and NIST SRE’16 datasets is achieved.
To date, automatic spoken language detection research has largely been based on a closed-set paradigm, in which the languages to be detected are known prior to system application. In actual practice, such systems may face previously unseen languages (out-of-set (OOS) languages) which should be rejected, a common problem that has received limited attention from the research community. In this paper, we focus on situations in which either (1) the system-modeled languages are not observed during use or (2) the test data contains OOS languages that are unseen during modeling or calibration. In these situations, the common multi-class objective function for calibration of language-detection scores is problematic. We describe how the assumptions of multi-class calibration are not always fulfilled in a practical sense and explore applying global and language-dependent binary objective functions to relax system constraints. We contrast the benefits and sensitivities of the calibration approaches on practical scenarios by presenting results using both LRE09 data and 14 languages from the BABEL dataset. We show that the global binary approach is less sensitive to the characteristics of the training data and that OOS modeling with individual detectors is the best option when OOS test languages are not known to the system.
Unsupervised techniques for the adaptation of speaker recognition are important due to the problem of condition mismatch that is prevalent when applying speaker recognition technology to new conditions and the general scarcity of labeled ‘in-domain’ data. In the recent NIST 2016 Speaker Recognition Evaluation (SRE), symmetric score normalization (Snorm) and calibration using unlabeled in-domain data were shown to be beneficial. Because calibration requires speaker labels for training, speaker-clustering techniques were used to generate pseudo-speakers for learning calibration parameters in those cases where only unlabeled in-domain data was available. These methods performed well in the SRE16. It is unclear, however, whether those techniques generalize well to other data sources. In this work, we benchmark these approaches on several distinctly different databases, after we describe our SRICON-UAM team system submission for the NIST 2016 SRE. Our analysis shows that while the benefit of S-norm is also observed across other datasets, applying speaker-clustered calibration provides considerably greater benefit to the system in the context of new acoustic conditions.
Speech conveys many things beyond content, including aspects of stance and attitude that have not been much studied. Considering 14 aspects of stance as they occur in radio news stories, we investigated the extent to which they could be inferred from prosody. By using time-spread prosodic features and by aggregating local estimates, many aspects of stance were at least somewhat predictable, with results significantly better than chance for many stance aspects, including, across English, Mandarin and Turkish, good, typical, local, background, new information, and relevant to a large group.
This article is concerned with the issue of calibration in the context of Deep Neural Network (DNN) based approaches to speaker recognition. DNNs have provided a new standard in technology when used in place of the traditional universal background model (UBM) for feature alignment, or to augment traditional features with those extracted from a bottleneck layer of the DNN. These techniques provide extremely good performance for constrained trial conditions that are well matched to development conditions. However, when applied to unseen conditions or a wide variety of conditions, some DNN-based techniques offer poor calibration performance. Through analysis on both PRISM and the recently released Speakers in the Wild (SITW) corpora, we illustrate that bottleneck features hinder calibration if used in the calculation of first-order Baum Welch statistics during i-vector extraction. We propose a hybrid alignment framework, which stems from our previous work in DNN senone alignment, that uses the bottleneck features only for the alignment of features during statistics calculation. This framework not only addresses the issue of calibration, but provides a more computationally efficient system based on bottleneck features with improved discriminative power.
The newly collected Speakers in the Wild (SITW) database was central to a text-independent speaker recognition challenge held as part of a special session at Interspeech 2016. The SITW database is composed of audio recordings from 299 speakers collected from open source media, with an average of 8 sessions per speaker. The recordings contain unconstrained or “wild” acoustic conditions, rarely found in large speaker recognition datasets, and multi-speaker recordings for both speaker enrollment and verification. This article provides details of the SITW speaker recognition challenge and analysis of evaluation results. There were 25 international teams involved in the challenge of which 11 teams participated in an evaluation track. Teams were tasked with applying existing and novel speaker recognition algorithms to the challenges associated with the real world conditions of SITW. We provide an analysis of some of the top performing systems submitted during the evaluation and provide future research directions.