K. Laskowski and E. Shriberg, “Comparing the contributions of context and prosody in text-independent dialog act recognition,” in Proc. 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5374–5377.
Automatic segmentation and classification of dialog acts (DAs; e.g., statements versus questions) is important for spoken language understanding (SLU). While most systems have relied on word and word boundary information, interest in privacy-sensitive applications and non-ASR-based processing requires an approach that is text-independent. We propose a framework for employing both speech/non-speech-based (“contextual”) features and prosodic features, and apply it to DA segmentation and classification in multiparty meetings. We find that: (1) contextual features are better for recognizing turn edge DA types and DA boundary types, while prosodic features are better for finding floor mechanisms and backchannels; (2) the two knowledge sources are complementary for most of the DA types studied; and (3) the performance of the resulting system approaches that achieved using oracle lexical information for several DA types. These results suggest that there is significant promise in text-independent features for DA recognition, and possibly for other SLU tasks, particularly when words are not available.
Keywords— Dialog act tagging, Prosody, Turn taking, Speech activity modeling, Privacy-sensitive features, Meetings