Ferrer, L., Shriberg, E., & Stolcke, A. (2002). Is the speaker done yet? Faster and more accurate end-of-utterance detection using prosody. In Seventh international conference on spoken language processing.
We examine the problem of end-of-utterance (EOU) detection for real-time speech recognition, particularly in the context of a human-computer dialog system. Current EOU detection algorithms use only a simple pause threshold for making this decision, leading to two problems. First, users often pause inside utterances, resulting in a premature cut off by the system. Second, when users really are done, the minimum system wait is always the threshold value, needlessly adding time to the interaction. We have developed a new approach to EOU detection that uses prosodic features to address both of these problems. Prosodic features are modeled by decision trees and combined with an event N-gram language model to obtain a score that measures the likelihood that any nonspeech region is an EOU. We find that this approach dramatically improves both the accuracy and speed of online EOU detection.