A Study in Machine Learning from Imbalanced Data for Sentence Boundary Detection in Speech | SRI International

Toggle Menu

A Study in Machine Learning from Imbalanced Data for Sentence Boundary Detection in Speech

October, 2006
Journal Name: 
Computer Speech & Language
20
Number: 
4
Abstract 

Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. [...]

Article
Search Publications
Browse by Sectors
Archive
E.g., 2019-06-25
E.g., 2019-06-25
Author