Using Machine Learning to Cope with Imbalanced Classes in Natural Speech: Evidence from Sentence Boundary and Disfluency Detection

Citation

Liu, Y., Shriberg, E., Stolcke, A., & Harper, M. (2004). Using machine learning to cope with imbalanced classes in natural speech: Evidence from sentence boundary and disfluency detection. In Eighth International Conference on Spoken Language Processing.

Abstract

We investigate machine learning techniques for coping with highly skewed class distributions in two spontaneous speech processing tasks. Both tasks, sentence boundary and disfluency detection, provide important structural information for downstream language processing modules. We examine the effect of data set size, task, sampling method (no sampling, downsampling, oversampling, and ensemble sampling), and learning method (bagging, ensemble bagging, and boosting) for a decision tree prosody model. Results show that (1) bagging benefits both tasks, but to different degrees, (2) the benefit from ensemble bagging decreases as data size increases, and (3) boosting can outperform bagging under certain conditions.


Read more from SRI

  • surgeons around a surgical robot

    The SRI research behind today’s surgical robotics

    Intuitive’s da Vinci 5 system represents a major leap in robotic-assisted medicine. It all started at SRI, which continues to advance teleoperation technologies.

  • a collage of digital graphs

    A banner year for quantum

    SRI-managed QED-C’s annual report on quantum trends captures an industry accelerating rapidly from technical promise toward major global impact.

  • ICE Cube containing SRI’s aerogel experiment, photographed prior to launch. Source: Aerospace Applications North America

    An SRI carbon capture experiment launches into space

    By synthesizing carbon-absorbing aerogels in microgravity, SRI research will give us a rare glimpse into how these materials could be radically improved.