We describe the methodology for the collection and annotation of a large corpus of emotional speech data through crowdsourcing. The corpus offers 187 hours of data from 2,965 subjects. Data includes non-emotional recordings from each subject as well as recordings for five emotions: angry, happy-low-arousal, happy-high-arousal, neutral, and sad. The data consist of spontaneous speech elicited from subjects via a web-based tool. Subjects used their own personal recording equipment, resulting in a data set that contains variation in room acoustics, microphone, etc. This offers the advantage of matching the type of variation one would expect to see when exposing speech technology in the wild in a web-based environment. The annotation scheme covers the quality of emotion expressed through the tone of voice and what was said, along with common audioquality issues. We discuss lessons learned in the process of the creation of this corpus.
In this paper, we investigate several automatic transcription schemes for using raw bilingual broadcast news data in semi-supervised bilingual acoustic model training. Specifically, we compare the transcription quality provided by a bilingual ASR system with another system performing language diarization at the front-end followed by two monolingual ASR systems chosen based on the assigned language label. Our research focuses on the Frisian-Dutch code-switching (CS) speech that is extracted from the archives of a local radio broadcaster. Using 11 hours of manually transcribed Frisian speech as a reference, we aim to increase the amount of available training data by using these automatic transcription techniques. By merging the manually and automatically transcribed data, we learn bilingual acoustic models and run ASR experiments on the development and test data of the FAME! speech corpus to quantify the quality of the automatic transcriptions. Using these acoustic models, we present speech recognition and CS detection accuracies. The results demonstrate that applying language diarization to the raw speech data to enable using the monolingual resources improves the automatic transcription quality compared to a baseline system using a bilingual ASR system.
This paper describes a two-step approach for keyword spotting task in which a query-by-example (QbE) search is followed by noise robust exemplar matching (N-REM) rescoring. In the first stage, subsequence dynamic time warping is performed to detect keywords in search utterances. In the second stage, these target frame sequences are rescored using the reconstruction errors provided by the linear combination of the available exemplars extracted from the training data. Due to data sparsity, we align the target frame sequence and the exemplars to a common frame length and the exemplar weights are obtained by solving a convex optimization problem with non-negative sparse coding. We run keyword spotting experiments on the Air Traffic Control (ATC) database and evaluate performance of multiple distance metrics for calculating the weights and reconstruction errors using convolutional neural network (CNN) bottleneck features. The results demonstrate that the proposed two-step keyword spotting approach provides better keyword detection compared to a baseline with only QbE search
Tackling Unseen Acoustic Conditions in Query-by-Example Search Using Time and Frequency Convolution for Multilingual Deep Bottleneck Features
Standard keyword spotting based on Automatic Speech Recognition (ASR) cannot be used on low- and no-resource languages due to lack of annotated data and/or linguistic resources. In recent years, query-by-example (QbE) has emerged as an alternate way to enroll and find spoken queries in large audio corpora, yet mismatched and unseen acoustic conditions remain a difficult challenge given the lack of enrollment data. This paper revisits two neural network architectures developed for noise and channel-robust ASR, and applies them to building a state-of-art multilingual QbE system. By applying convolution in time or frequency across the spectrum, those convolutional bottlenecks learn more discriminative deep bottleneck features. In conjunction with dynamic time warping (DTW), these features enable robust QbE systems. We use the MediaEval 2014 QUESST data to evaluate robustness against language and channel mismatches, and add several levels of artificial noise to the data to evaluate performance in degraded acoustic environments. We also assess performance on an Air Traffic Control QbE task with more realistic and higher levels of distortion in the push-to-talk domain.
Analysis of Phonetic Markedness and Gestural Effort Measures for Acoustic Speech-Based Depression Classification
While acoustic-based links between clinical depression and abnormal speech have been established, there is still however little knowledge regarding what kinds of phonological content is most impacted. Moreover, for automatic speech-based depression classification and depression assessment elicitation protocols, even less is understood as to what phonemes or phoneme transitions provide the best analysis. In this paper we analyze articulatory measures to gain further insight into how articulation is affected by depression. In our investigative experiments, by partitioning acoustic speech data based on
lower to high densities of specific phonetic markedness and gestural effort, we demonstrate improvements in depressed/non-depressed classification accuracy and F1 scores.
Articulatory information can effectively model variability in speech and can improve speech recognition performance under varying acoustic conditions. Learning speaker-independent articulatory models has always been challenging, as speaker-specific information in the articulatory and acoustic spaces increases the complexity of the speech-to-articulatory space inverse modeling, which is already an ill-posed problem due to its inherent nonlinearity and non-uniqueness. This paper investigates using deep neural networks (DNN) and convolutional neural networks (CNNs) for mapping speech data into its corresponding articulatory space. Our results indicate that the CNN models perform better than their DNN counterparts for speech inversion. In addition, we used the inverse models to generate articulatory trajectories from speech for three different standard speech recognition tasks. To effectively model the articulatory features’ temporal modulations while retaining the acoustic features’ spatiotemporal signatures, we explored a joint modeling strategy to simultaneously learn both the acoustic and articulatory spaces. The results from multiple speech recognition tasks indicate that articulatory features can improve recognition performance when the acoustic and articulatory spaces are jointly learned with one common objective function.
Growth in voice-based applications and personalized systems has led to increasing demand for speech- analytics technologies that estimate the state of a speaker from speech. Such systems support a wide range of applications, from more traditional call-center monitoring, to health monitoring, to human-robot interactions, and more. To work seamlessly in real-world contexts, such systems must meet certain requirements, including for speed, customizability, ease of use, robustness, and live integration of both acoustic and lexical cues. This demo introduces SenSay AnalyticsTM, a platform that performs real-time speaker-state classification from spoken audio. SenSay is easily configured and is customizable to new domains, while its underlying architecture offers extensibility and scalability.
Speech recognition in varying background conditions is a challenging problem. Acoustic condition mismatch between training and evaluation data can significantly reduce recognition performance. For mismatched conditions, data-adaptation techniques are typically found to be useful, as they expose the acoustic model to the new data condition(s). Supervised adaptation techniques usually provide substantial performance improvement, but such gain is contingent on having labeled or transcribed data, which is often unavailable. The alternative is unsupervised adaptation, where feature-transform methods and model-adaptation techniques are typically explored. This work investigates robust features, feature-space maximum likelihood linear regression (fMLLR) transform, and deep convolutional nets to address the problem of unseen channel and noise conditions. In addition, the work investigates bottleneck (BN) features extracted from deep autoencoder (DAE) networks trained by using acoustic features extracted from the speech signal. We demonstrate that such representations not only produce robust systems but also that they can be used to perform data selection for unsupervised model adaptation. Our results indicate that the techniques presented in this paper significantly improve performance of speech recognition systems in unseen channel and noise conditions.
Interactive voice technologies can leverage biosignals, such as heart rate (HR), to infer the psychophysiological state of the user. Voice-based detection of HR is attractive because it does not require additional sensors. We predict HR from speech using the SRI BioFrustration Corpus. In contrast to previous studies we use continuous spontaneous speech as input. Results using random forests show modest but significant effects on HR prediction. We further explore the effects on HR of speaking itself, and contrast the effects when interactions induce neutral versus frustrated responses from users. Results reveal that regardless of the user’s emotional state, HR tends to increase while the user is engaged in speaking to a dialog system relative to a silent region right before speech, and that this effect is greater when the subject is expressing frustration. We also find that the user’s HR does not recover to pre-speaking levels as quickly after frustrated speech as it does after neutral speech. Implications and future directions are discussed.
This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.
Target language speakers are used to assist a linguist in creating phonetic transcriptions for the adaptation of acoustic and language models, by respeaking more clearly a small portion of the target language audio. We also explore robust features and feature transform through deep auto-encoders for better phone recognition performance.
The proposed approach achieves lexical unit discovery performance comparable to state-of-the-art zero-resource methods. Since the system is built on phonetic recognition, discovered units are immediately interpretable. They can be used to automatically populate a pronunciation lexicon and enable iterative improvement through additional feedback from target language speakers.