We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak® computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.
Lexical Stress Classification for Language Learning Using Spectral and Segmental Features
We present a system for detecting lexical stress in English words spoken by English learners. The system uses both spectral and segmental features to detect three levels of stress for each syllable in a word.
Detecting Leadership and Cohesion in Spoken Interactions
We present a system for detecting leadership and group cohesion in multiparty dialogs and broadcast conversations in English and Mandarin.
Unsupervised topic modeling for leader detection in spoken discourse
In this paper, we describe a method for leader detection in multiparty spoken discourse that relies on unsupervised topic modeling to segment the discourse automatically.
Detection of agreement and disagreement in broadcast conversations
We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows.
Automatic identification of speaker role and agreement/disagreement in broadcast conversation
We present supervised approaches for detecting speaker roles and agreement/disagreement between speakers in broadcast conversation shows in three languages: English, Arabic, and Mandarin.
Implementing SRI’s Pashto speech-to-speech translation system on a smartphone
We describe our recent effort implementing SRI’s UMPC-based Pashto speech-to-speech (S2S) translation system on a smart phone running the Android operating system.
EduSpeak®: A Speech Recognition and Pronunciation Scoring Toolkit for Computer-Aided Language Learning Applications
SRI International’s EduSpeak® system is a SDK that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology.
Recent advances in SRI’s IraqComm Iraqi Arabic-English speech-to-speech translation system
We summarize recent progress on SRI’s IraqComm™ IraqiArabic-English two-way speech-to-speech translation system.