Hub4 Language Modeling Using Domain Interpolation and Data Clustering


Sankar, F. W. A. S. A. (1997). Hub4 language modeling using domain interpolation and data clustering. In DARPA Speech Recognition Workshop, Citeseer (Vol. 147).


In SRI’s language modeling experiments for the Hub4 domain, three basic approaches were pursued: interpolating multiple models estimated from Hub4 and non-Hub4 training data, adapting the language model (LM) to the focus conditions, and adapting the LM to different topic types.

In the first approach, we built separate LMs for the closely transcribed Hub4 material (acoustic training transcripts) and the loosely transcribed Hub4 material (LM training data), as well as the North-American Business News (NABN) and Switchboard training data, projected onto the Hub4 vocabulary. By interpolating the probabilities obtained from these models, we obtained a 20 percent reduction in perplexity and a 1.8 percent reduction in word error rate, compared to a baseline Hub4-only language model.

Two adaptation approaches are also described: adapting language models to the speech styles correlated with different focus conditions, and building cluster-specific LM mixtures.These two approaches give some reduction in perplexity, but no significant reduction in word error.

Finally, we identify the problems and future directions of our work. 

Read more from SRI