Data-Driven Lexicon Expansion for Mandarin Broadcast News and Conversation Speech Recognition

Citation

X. Lei, W. Wang and A. Stolcke, “Data-driven lexicon expansion for Mandarin broadcast news and conversation speech recognition,” 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 4329-4332, doi: 10.1109/ICASSP.2009.4960587.

Abstract

We present a data-driven framework for expanding the lexicon to improve Mandarin broadcast news and conversation speech recognition. The lexicon expansion includes the generation of pronunciation variants for frequent words and vocabulary augmentation with new words and phrases derived from the training data. To learn multiple pronunciations, we first generate all possible pronunciation candidates for a word from its character pronunciation network. The top pronunciation variants are then selected from forced alignment statistics. To augment the acoustic vocabulary, we propose an efficient algorithm that derives new words based on N-gram statistics. Experiments show that a dictionary expanded in this manner yields significant improvements on a Mandarin broadcast speech recognition task.


Read more from SRI

  • surgeons around a surgical robot

    The SRI research behind today’s surgical robotics

    Intuitive’s da Vinci 5 system represents a major leap in robotic-assisted medicine. It all started at SRI, which continues to advance teleoperation technologies.

  • a collage of digital graphs

    A banner year for quantum

    SRI-managed QED-C’s annual report on quantum trends captures an industry accelerating rapidly from technical promise toward major global impact.

  • ICE Cube containing SRI’s aerogel experiment, photographed prior to launch. Source: Aerospace Applications North America

    An SRI carbon capture experiment launches into space

    By synthesizing carbon-absorbing aerogels in microgravity, SRI research will give us a rare glimpse into how these materials could be radically improved.