Fusion of acoustic, perceptual and production features for robust speech recognition non-stationary noise

Citation

G. Sivaraman, V. Mitra, and C. Espy-Wilson, C, “Fusion of acoustic, perceptual and production features for robust speech recognition in highly non-stationary noise,” in Proc. 2nd CHiME Workshop on Machine Listening in Multisource Environments, 2013, pp. 65–70.

Abstract

Improving the robustness of speech recognition systems to cope with adverse background noise is a challenging research topic. Extraction of noise robust acoustic features is one of the prominent methods used for incorporating robustness in speech recognition systems. Prior studies have proposed several perceptually motivated noise robust acoustic features, and the normalized modulation cepstral coefficient (NMCC) is one such feature which uses amplitude modulation estimates to create cepstrum-like parameters. Studies have shown that articulatory features in combination with traditional mel-cepstral features help to improve robustness of speech recognition systems in noisy conditions. This paper shows that fusion of multiple noise robust feature streams motivated by speech production and perception theories help to significantly improve the robustness of traditional speech recognition systems. Keyword recognition accuracies on the CHiME-2 noisy-training task reveal that utilizing an optimal combination of noise robust features help to improve the accuracies by more than 6% absolute across all the different signal-to-noise ratios.


Read more from SRI