Jing Zheng, H. Franco, Fuliang Weng, A. Sankar and H. Bratt, “Word-level rate of speech modeling using rate-specific phones and pronunciations,” 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 2000, pp. 1775-1778 vol.3, doi: 10.1109/ICASSP.2000.862097.
Variations in rate of speech (ROS) produce changes in both spectral features and word pronunciations that affect ASR systems. To cope with these effects, we propose to use rate-specific phone models and pronunciations for ROS modeling at the word level. Words are given three types of pronunciations — fast, slow, and medium — consisting of rate-specific phone models, respectively. This approach allows us to model within-sentence rate variation. To better model coarticulation effects, we introduce the concept of zero-length phones, which enables short phones to be skipped without having to change their neighboring phones’ contexts. A data-driven approach is used to prune the pronunciation dictionary derived from rules for phone reduction. We tested these approaches on the Hub 4 database and achieved a relative improvement of 2.0% over the baseline — an evaluation-quality version of SRI’s DECIPHERTM continuous speech recognition system — for clean native speech in the 1996 development set.