Abstract
Variations in rate of speech (ROS) produce variations in both spectral features and word pronunciations that affect automatic speech recognition systems. To deal with these ROS effects, we propose to use a set of parallel rate-specific acoustic and pronunciation models. Rate switching is permitted at word boundaries, to allow within-sentence speech rate variation, which is common in conversational speech. Because of the parallel structure of rate-specific models and the maximum likelihood decoding method, our approach does not require ROS estimation before recognition, which is hard to achieve. We evaluate our models on a large-vocabulary conversational speech recognition task over the telephone. Experiments on the NIST 2000 Hub-5 development set show that word-level ROS-dependent modeling results in a 2.2 absolute reduction in word error rate over a rate-independent baseline system. […]
Share this



