We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak® computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.
We present a system for detecting lexical stress in English words spoken by English learners. The system uses both spectral and segmental features to detect three levels of stress for each syllable in a word.
We review developments in the SRI Language Modeling Toolkit (SRILM) since 2002, when a previous paper on SRILM was published.
EduSpeak®: A Speech Recognition and Pronunciation Scoring Toolkit for Computer-Aided Language Learning Applications
SRI International’s EduSpeak® system is a SDK that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology.
The probabilistic optimum filtering (POF) algorithm is a piece wise linear transformation of the noisy speech feature space into the clean speech feature space. In this work we extend the POF algorithm to allow a more accurate way to select noisy-to-clean feature mappings, by allowing different combinations of speech and noise to have combination-specific mappings selected depending on the observation. This is especially important in nonstationary environments, where different noise segments will result in different observations in the noisy feature space. Experimental results using stationary and nonstationary noises show the effectiveness of the proposed technique compared to the old approach. We also explored the use of the extended POF method to train a map with multiple noises in order to gain generalization over different noise types and be able to tackle unknown noise environments.
We describe the developement and conceptual evolution of handheld spoken phrase translation systems, beginning with an initial undirectional system for translation of English phrases, and later extending to a limited bidirectional phrase translation system between English and Pashto, a major language of Afghanistan. We review the challenges posed by such projects, such as the constraints imposed by the computational platform, to the limitations of the phrase translation approach when dealing with naive respondents. We discuss our proposed solutions, in terms of architecture, algorithms, and software features, as well as some field experience by users of initial prototypes.
We introduce SRI’s new speech recognition engine,DynaSpeak(TM), which is characterized by its scalability and flexibility, high recognition accuracy, memory and speedefficiency, adaptation capability, efficient grammar optimization, support for natural language parsing functionality, and operationbased on integer arithmetic. These features are designed to address the needs of the fast-developing and changing domain ofembedded and mobile computing platforms.
The EduSpeak(TM) system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology.
We extend the input transformation approach for adapting hybrid connectionist speech recognizers to allow multiple transformations to be trained. Previous work has shown the efficacy of the linear input transformation approach for speaker adaptation   , but has focused only on training global transformations. This approach is clearly suboptimal since it assumes that a single transformation is appropriate for every region in the acoutic feature input space, that is, for every phonetic class, microphone, and noise level. In this paper, we propose a new algorithm to train mixtures of transformation networks (MTNs) in the hybrid connectionist recognition framework. This approach is based on the idea of partitioning the acoustic feature space into R regions and training an input transformation for each region. The transformations are combined probabilistically according to the degree to which the acoustic features belong to each region, where the combination weights are derived from a separate acoustic gating network (AGN). We apply the new algorithm to nonnative speaker adaptation, and present recognition results for the 1994 WSJ Spoke 3 development set. The MTN technique can also be used for noise or microphone robust recognition or for other nonspeech neural network pattern recognition problems.