Statistical Sentence Extraction for Information Distillation


D. Hakkani-Tur and G. Tur, “Statistical Sentence Extraction for Information Distillation,” 2007 IEEE International Conference on Acoustics, Speech and Signal Processing – ICASSP ’07, 2007, pp. IV-1-IV-4, doi: 10.1109/ICASSP.2007.367148.


Information distillation aims to extract the most useful pieces of information related to a given query from massive, possibly multilingual, audio and textual document sources. One critical component in a distillation engine is detecting sentences to be extracted from each relevant document. In this paper, we present a statistical sentence extraction approach for distillation. Basically, we frame this task as a classification problem, where each candidate sentence in documents is classified as relevant to the query or not. These documents may be in textual or audio format and in a number of languages. For audio documents, we use both manual and automatic transcriptions, for non-English documents, we use automatic translations. In this work, we use AdaBoost, a discriminative classiſcation method with both lexical and semantic features. The results indicate 11%-13% relative improvement over a baseline keyword-spotting-based approach. We also show the robustness of our method on the audio subset of the document sources using manual and automatic transcriptions.

Index Terms: information distillation, information extraction, language understanding, speech processing, natural language processing

Read more from SRI