Stolcke, A. & Bratt, H. & Butzberger, J. & Franco, Horacio & Gadde, Venkata & Plauche, Madelaine & Richey, Colleen & Shriberg, Elizabeth & Sönmez, Mustafa & Weng, Fuliang & Zheng, Jun. (2000). The Sri March 2000 Hub-5 Conversational Speech Transcription System.
We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-mode-adapted models, (3) trigram lattice recognition with adapted cross-word triphone models, and (4) N-best rescoring and reranking with various additional knowledge sources. The system incorporates two new kinds of acoustic model: triphone models conditioned on speaking rate, and an explicit joint model of within-word phone durations. We also obtained an unusually large improvement from modeling cross-word pronunciation variants in “multiword” vocabulary items. The language model (LM) was enhanced with an “anti-LM” representing acoustically confusable word sequences. Finally, we applied a generalized ROVER algorithm to combine the N-best hypotheses from several systems based on different acoustic models.