V. Mitra, A. Tsiartas and E. Shriberg, “Noise and reverberation effects on depression detection from speech,” in Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5795-5799.
Speech-based depression detection has gained importance in recent years, but most research has used relatively quiet conditions or examined a single corpus per study. Little is thus known about the robustness of speech cues in the wild. This study compares the effect of noise and reverberation on depression prediction using 1) standard mel-frequency cepstral coefficients (MFCCs), and 2) features designed for noise robustness, damped oscillator cepstral coefficients (DOCCs). Data come from the 2014 Audio-Visual Emotion Recognition Challenge (AVEC). Results using additive noise and reverberation reveal a consistent pattern of findings for multiple evaluation metrics under both matched and mismatched conditions. First and most notably: standard MFCC features suffer dramatically under test/train mismatch for both noise and reverberation; DOCC features are far more robust. Second, including higher-order cepstral coefficients is generally beneficial. Third, artificial neural networks tend to outperform support vector regression. Fourth, spontaneous speech appears to offer better robustness than read speech. Finally, a cross-corpus (and crosslanguage) experiment reveals better noise and reverberation robustness for DOCCs than for MFCCs. Implications and future directions for real-world robust depression detection are discussed.