A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition


Sankar, A., & Lee, C. H. (1996). A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE transactions on speech and Audio Processing, 4(3), 190-202.


We present a maximum-likelihood(ML)stochastic matching approach to decrease the acoustic mismatch between a test utterance and a given set of speech models so as to reduce the recognition performance degradation caused by distortions in the test utterance and/or the model set. We assume that the speech signal is modeled by a set of subword hidden Markov models (HMM) X. The mismatch between the observed test utterance Y and the models X can be reduced in two ways: 1) by an inverse distortion function F(:) that maps Y into an utterance X which matches better with the models X, and 2) by a model transformation function G(:) that maps X to the transformed model Y which matches better with the utterance Y. We assume the functional form of the transformations F(:) or G(:) and estimate the parameters or in a maximum likelihood manner using the expectation-maximization (EM) algorithm. The choice of the form of F (:) or G(:) is based on our prior knowledge of the nature of the acoustic mismatch. The stochastic matching algorithm operates only on the given test utterance and the given set of speech models, and no additional training data is required for the estimation of the mismatch prior to actual testing.
Experimental results are presented to study the properties of the proposed algorithm and to verify the efficacy of the approach in improving the performance of an HMM-based continuous speech recognition system in the presence of mismatch due to different
transducers and transmission channels. The proposed stochastic matching algorithm is found to converge fast. Further, the recognition performance in mismatched conditions is greatly improved while the performance in matched conditions is well maintained.
The stochastic matching algorithm was able to reduce the word error rate by about 70% in mismatched conditions.

Read more from SRI