M. Levit, D. Hakkani-Tur, G. Tur and D. Gillick, “Integrating several annotation layers for statistical information distillation,” 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007, pp. 671-676, doi: 10.1109/ASRU.2007.4430192.
We present a sentence extraction algorithm for Information Distillation, a task where for a given templated query, relevant passages must be extracted from massive audio and textual document sources. For each sentence of the relevant documents (that are assumed to be known from the upstream stages) we employ statistical classification methods to estimate the extent of its relevance to the query, whereby two aspects of relevance are taken into account: the template (type) of the query and its slots (free-text descriptions of names, organizations, topic, events and so on, around which templates are centered). The idiosyncrasy of the presented method is in the choice of features used for classification. We extract our features from charts, compilations of elements from various annotation levels, such as word transcriptions, syntactic and semantic parses, and Information Extraction annotations. In our experiments we show that this integrated approach outperforms a purely lexical baseline by as much as 30% relative in terms of F-measure. We also investigate the algorithm’s behavior under noisy conditions, by comparing its performance on ASR output and on corresponding manual transcriptions.