Growth in voice-based applications and personalized systems has led to increasing demand for speech- analytics technologies that estimate the state of a speaker from speech. Such systems support a wide range of applications, from more traditional call-center monitoring, to health monitoring, to human-robot interactions, and more. To work seamlessly in real-world contexts, such systems must meet certain requirements, including for speed, customizability, ease of use, robustness, and live integration of both acoustic and lexical cues. This demo introduces SenSay AnalyticsTM, a platform that performs real-time speaker-state classification from spoken audio. SenSay is easily configured and is customizable to new domains, while its underlying architecture offers extensibility and scalability.
Interactive voice technologies can leverage biosignals, such as heart rate (HR), to infer the psychophysiological state of the user. Voice-based detection of HR is attractive because it does not require additional sensors. We predict HR from speech using the SRI BioFrustration Corpus. In contrast to previous studies we use continuous spontaneous speech as input. Results using random forests show modest but significant effects on HR prediction. We further explore the effects on HR of speaking itself, and contrast the effects when interactions induce neutral versus frustrated responses from users. Results reveal that regardless of the user’s emotional state, HR tends to increase while the user is engaged in speaking to a dialog system relative to a silent region right before speech, and that this effect is greater when the subject is expressing frustration. We also find that the user’s HR does not recover to pre-speaking levels as quickly after frustrated speech as it does after neutral speech. Implications and future directions are discussed.
This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.
Target language speakers are used to assist a linguist in creating phonetic transcriptions for the adaptation of acoustic and language models, by respeaking more clearly a small portion of the target language audio. We also explore robust features and feature transform through deep auto-encoders for better phone recognition performance.
The proposed approach achieves lexical unit discovery performance comparable to state-of-the-art zero-resource methods. Since the system is built on phonetic recognition, discovered units are immediately interpretable. They can be used to automatically populate a pronunciation lexicon and enable iterative improvement through additional feedback from target language speakers.
The rate at which endangered languages can be documented has been highly constrained by human factors. Although digital recording of natural speech in endangered languages may proceed at a fairly robust pace, transcription of this material is not only time consuming but severely limited by the lack of native-speaker personnel proficient in the orthography of their mother tongue. Our NSF-funded project in the Documenting Endangered Languages (DEL) program proposes to tackle this problem from two sides: first via a tool that helps native speakers become proficient in the orthographic conventions of their language, and second by using automatic speech recognition (ASR) output that assists in the transcription effort for newly recorded audio data. In the present study, we focus exclusively on progress in developing speech recognition for the language of interest, Yoloxóchitl Mixtec (YM), an Oto-Manguean language spoken by fewer than 5000 speakers on the Pacific coast of Guerrero, Mexico. In particular, we present results from an initial set of experiments and discuss future directions through which better and more robust acoustic models for endangered languages with limited resources can be created.
We introduce the SRI CLEO (Conversational Language about Everyday Objects) Speaker-State Corpus of speech, video, and biosignals. The goal of the corpus is providing insight on the speech and physiological changes resulting from subtle, context-based influences on affect and cognition. Speakers were prompted by collections of pictures of neutral everyday objects and were instructed to provide speech related to any subset of the objects for a preset period of time (120 or 180 seconds depending on task). The corpus provides signals for 43 speakers under four different speaker-state conditions: (1) neutral and emotionally charged audiovisual background; (2) cognitive load; (3) time pressure; and (4) various acted emotions. Unlike previous studies that have linked speaker state to the content of the speaking task itself, the CLEO prompts remain largely pragmatically, semantically, and affectively neutral across all conditions. This framework enables for more direct comparisons across both conditions and speakers. The corpus also includes more traditional speaker tasks involving reading and free-form reporting of neutral and emotionally charged content. The explored biosignals include skin conductance, respiration, blood pressure, and ECG. The corpus is in the final stages of processing and will be made available to the research community.
Prediction of heart rate changes from speech features during interaction with a misbehaving dialog system
Most research on detecting a speaker’s cognitive state when interacting with a dialog system has been based on selfreports, or on hand-coded subjective judgments based on audio or audio-visual observations. This study examines two questions: (1) how do undesirable system responses affect people physiologically, and (2) to what extent can we predict physiological changes from the speech signal alone? To address these questions, we use a new corpus of simultaneous speech and high-quality physiological recordings in the product returns domain (the SRI BioFrustration Corpus). “Triggers” were used to frustrate users at specific times during the interaction to produce emotional responses at similar times during the experiment across participants. For each of eight return tasks per participant, we compared speaker-normalized pre-trigger (cooperative system behavior) regions to posttrigger (uncooperative system behavior) regions. Results using random forest classifiers show that changes in spectral and
temporal features of speech can predict heart rate changes with an accuracy of ~70%. Implications for future research and applications are discussed.
We describe the SRI BioFrustration Corpus, an inprogress corpus of time-aligned audio, video, and autonomic nervous system signals recorded while users interact with a dialog system to make returns of faulty consumer items. The corpus offers two important advantages for the study of turn-taking under emotion. First, it contains state-of-the-art ECG, skin conductance, blood pressure, and respiration signals, along with multiple audio channels and video channels. Second, the collection paradigm is carefully controlled. Though the users believe they are interacting with an empathetic system, in reality the system afflicts each subject with an identical history of “frustration inducers.” This approach enables detailed within- and across-speaker comparisons of the effect of physiological state on user behavior. Continuous signal recording enables studying the effect of frustration inducers with respect to speech-based system-directed turns, interturn regions, and system text-to-speech responses.
Though depression is a common mental health problem with significant impact on human society, it often goes undetected. We explore a diverse set of features based only on spoken audio to understand which features correlate with self-reported depression scores according to the Beck depression rating scale. These features, many of which are novel for this task, include (1) estimated articulatory trajectories during speech production, (2) acoustic characteristics, (3) acoustic-phonetic characteristics and (4) prosodic features. Features are modeled using a variety of approaches, including support vector regression, a Gaussian backend and decision trees. We report results on the AVEC-2014 depression dataset and find that individual systems range from 9.18 to 11.87 in root mean squared error (RMSE), and from 7.68 to 9.99 in mean absolute error (MAE). Initial fusion brings further improvement; fusion and feature selection work is still in progress.
Reverberation in speech degrades the performance of speech recognition systems, leading to higher word error rates. Human listeners can often ignore reverberation, indicating that the auditory system somehow compensates for reverberation degradations. In this work, we present robust acoustic features motivated by the knowledge gained from human speech perception and production, and we demonstrate that these features provide reasonable robustness to reverberation effects compared to traditional mel-filterbank-based features. Using a single-feature system trained with the data distributed through the REVERB 2014 challenge on automatic speech recognition, we show a modest 12 pct. and 0.2 pct. relative reduction in word error rate (WER) compared to the mel-scale-feature-based […]