Background: The diagnosis of posttraumatic stress disorder (PTSD) is usually based on clinical interviews or self‐report measures. Both approaches are subject to underand over‐reporting of symptoms. An objective test is lacking. We have developed a classifier of PTSD based on objective speech‐marker features that discriminate PTSD cases from controls.
Methods: Speech samples were obtained from warzone‐exposed veterans, 52 cases with PTSD and 77 controls, assessed with the Clinician‐Administered PTSD Scale. Individuals with major depressive disorder (MDD) were excluded. Audio recordings of clinical interviews were used to obtain 40,526 speech features which were input to a random forest (RF) algorithm.
Results: The selected RF used 18 speech features and the receiver operating characteristic curve had an area under the curve (AUC) of 0.954. At a probability of PTSD cut point of 0.423, Youden’s index was 0.787, and overall correct classification rate was 89.1%. The probability of PTSD was higher for markers that indicated slower, more monotonous speech, less change in tonality, and less activation. Depression symptoms, alcohol use disorder, and TBI did not meet statistical tests to be considered confounders.
Conclusions: This study demonstrates that a speech‐based algorithm can objectively differentiate PTSD cases from controls. The RF classifier had a high AUC. Further validation in an independent sample and appraisal of the classifier to identify those with MDD only compared with those with PTSD comorbid with MDD is required.
Mapping Individual to Group Level Collaboration Indicators Using Speech Data
Automatic detection of collaboration quality from the students’ speech could support teachers in monitoring group dynamics, diagnosing issues, and developing pedagogical intervention plans. To address the challenge of mapping characteristics of individuals’ speech to information about the group, we coded behavioral and learning-related indicators of collaboration at the individual level. In this work, we investigate the feasibility of predicting the quality of collaboration among a group of students working together to solve a math problem from human-labelled collaboration indicators. We use a corpus of 6th, 7th, and 8th grade students working in groups of three to solve math problems collaboratively. Researchers labelled both the group-level collaboration quality during each problem and the student-level collaboration indicators. Results using random forests reveal that the individual indicators of collaboration aid in the prediction of group collaboration quality.
Crowdsourcing Emotional Speech
We describe the methodology for the collection and annotation of a large corpus of emotional speech data through crowdsourcing. The corpus offers 187 hours of data from 2,965 subjects. Data includes non-emotional recordings from each subject as well as recordings for five emotions: angry, happy-low-arousal, happy-high-arousal, neutral, and sad. The data consist of spontaneous speech elicited from subjects via a web-based tool. Subjects used their own personal recording equipment, resulting in a data set that contains variation in room acoustics, microphone, etc. This offers the advantage of matching the type of variation one would expect to see when exposing speech technology in the wild in a web-based environment. The annotation scheme covers the quality of emotion expressed through the tone of voice and what was said, along with common audioquality issues. We discuss lessons learned in the process of the creation of this corpus.
Inferring Stance from Prosody
Speech conveys many things beyond content, including aspects of stance and attitude that have not been much studied. Considering 14 aspects of stance as they occur in radio news stories, we investigated the extent to which they could be inferred from prosody. By using time-spread prosodic features and by aggregating local estimates, many aspects of stance were at least somewhat predictable, with results significantly better than chance for many stance aspects, including, across English, Mandarin and Turkish, good, typical, local, background, new information, and relevant to a large group.
Analysis and prediction of heart rate using speech features from natural speech
Interactive voice technologies can leverage biosignals, such as heart rate (HR), to infer the psychophysiological state of the user. Voice-based detection of HR is attractive because it does not require additional sensors. We predict HR from speech using the SRI BioFrustration Corpus. In contrast to previous studies we use continuous spontaneous speech as input. Results using random forests show modest but significant effects on HR prediction. We further explore the effects on HR of speaking itself, and contrast the effects when interactions induce neutral versus frustrated responses from users. Results reveal that regardless of the user’s emotional state, HR tends to increase while the user is engaged in speaking to a dialog system relative to a silent region right before speech, and that this effect is greater when the subject is expressing frustration. We also find that the user’s HR does not recover to pre-speaking levels as quickly after frustrated speech as it does after neutral speech. Implications and future directions are discussed.
Noise and reverberation effects on depression detection from speech
Speech-based depression detection has gained importance in recent years, but most research has used relatively quiet conditions or examined a single corpus per study. Little is thus known about the robustness of speech cues in the wild. This study compares the effect of noise and reverberation on depression prediction using 1) standard mel-frequency cepstral coefficients (MFCCs), and 2) features designed for noise robustness, damped oscillator cepstral coefficients (DOCCs). Data come from the 2014 Audio-Visual Emotion Recognition Challenge (AVEC). Results using additive noise and reverberation reveal a consistent pattern of findings for multiple evaluation metrics under both matched and mismatched conditions. First and most notably: standard MFCC features suffer dramatically under test/train mismatch for both noise and reverberation; DOCC features are far more robust. Second, including higher-order cepstral coefficients is generally beneficial. Third, artificial neural networks tend to outperform support vector regression. Fourth, spontaneous speech appears to offer better robustness than read speech. Finally, a cross-corpus (and crosslanguage) experiment reveals better noise and reverberation robustness for DOCCs than for MFCCs. Implications and future directions for real-world robust depression detection are discussed.
Spoken Interaction Modeling for Automatic Assessment of Collaborative Learning
Collaborative learning is a key skill for student success, but simultaneous monitoring of multiple small groups is untenable for teachers. This study investigates whether automatic audio- based monitoring of interactions can predict collaboration quality. Data consist of hand-labeled 30-second segments from audio recordings of students as they collaborated on solving math problems. Two types of features were explored: speech activity features, which were computed at the group level; and prosodic features (pitch, energy, durational, and voice quality patterns), which were computed at the speaker level. For both feature types, normalized and unnormalized versions were investigated; the latter facilitate real-time processing applications. Results using boosting classifiers, evaluated by F-measure and accuracy, reveal that (1) both speech activity and prosody features predict quality far beyond chance using majority-class approach; (2) speech activity features are the better predictors overall, but class performance using prosody shows potential synergies; and (3) it may not be necessary to session-normalize features by speaker. These novel results have impact for educational settings, where the approach could support teachers in the monitoring of group dynamics, diagnosis of issues, and development of pedagogical intervention plans.
Prediction of heart rate changes from speech features during interaction with a misbehaving dialog system
Most research on detecting a speaker’s cognitive state when interacting with a dialog system has been based on selfreports, or on hand-coded subjective judgments based on audio or audio-visual observations. This study examines two questions: (1) how do undesirable system responses affect people physiologically, and (2) to what extent can we predict physiological changes from the speech signal alone? To address these questions, we use a new corpus of simultaneous speech and high-quality physiological recordings in the product returns domain (the SRI BioFrustration Corpus). “Triggers” were used to frustrate users at specific times during the interaction to produce emotional responses at similar times during the experiment across participants. For each of eight return tasks per participant, we compared speaker-normalized pre-trigger (cooperative system behavior) regions to posttrigger (uncooperative system behavior) regions. Results using random forest classifiers show that changes in spectral and
temporal features of speech can predict heart rate changes with an accuracy of ~70%. Implications for future research and applications are discussed.