Yeh, E., Niekrasz, J., & Freitag, D. (2013, 27 October – 1 November). Unsupervised discovery and extraction of semi-structured regions in text via self-information. Paper presented at the Workshop on Automated Knowledge Base Construction, San Francisco, CA.
We describe ongoing work into a general method for identifying and extracting information from semi-structured regions of text that are embedded within a natural language document. These are regions of text, usually in an ad hoc schema, forming structures such as tables, key-value listings, or long and repeated enumerations of properties. They present problems for standard information extraction algorithms that rely on regular grammatical text, as information is encoded in a combination of spatial layout, boilerplate, and repeated strings. Unlike previous work in table extraction, which relies on a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of structure types. Our approach is an unsupervised one, based on identifying regions of suprising regularity inside the document. Here, regularity is measured by self information, and is derived from patterns of semantically meaningful classes of text and visual layout. We present the results of an initial study to assess the ability of these measures to detect semistructured text in a corpus culled from the web, showing that they outperform baseline methods on an average precision measure. We present initial work that uses significant patterns to generate extraction rules, and conclude with a discussion of future directions of our work.