Baron, D., Shriberg, E., & Stolcke, A. (2002). Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. In Seventh International Conference on Spoken Language Processing.
We investigate automatic approaches to finding “hidden” spontaneous speech events, such as sentence boundaries and disfluencies, in multi-party meetings. Hidden events are characterized prosodically by a large array of automatically extracted energy, duration, and pitch features, and are modeled by decision tree classifiers; lexical cues are modeled by N-gram language models. Both sources of information are combined in a hidden Markov model framework. Results show that combined classifiers achieve higher accuracy than either single knowledge source alone. We also study classifiers that use only the preceding context for predicting events, simulating online processing. We find that prosodic features are more robust than are language model features to this constraint. Finally, we examine the effect of automatic word recognition errors, in both training and testing, on classification accuracy. We find that lexical models degrade much more severely than do prosodic models in this case, again showing the relative robustness of prosodic information for hidden-event detection in natural conversation.