Unsupervised Discovery and Extraction of Semi-Structured Regions in Text Via Self-Information

SRI authors: ,

Citation

Yeh, E., Niekrasz, J., & Freitag, D. (2013, 27 October – 1 November). Unsupervised discovery and extraction of semi-structured regions in text via self-information. Paper presented at the Workshop on Automated Knowledge Base Construction, San Francisco, CA.

Abstract

We describe ongoing work into a general method for identifying and extracting information from semi-structured regions of text that are embedded within a natural language document. These are regions of text, usually in an ad hoc schema, forming structures such as tables, key-value listings, or long and repeated enumerations of properties. They present problems for standard information extraction algorithms that rely on regular grammatical text, as information is encoded in a combination of spatial layout, boilerplate, and repeated strings. Unlike previous work in table extraction, which relies on a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of structure types. Our approach is an unsupervised one, based on identifying regions of suprising regularity inside the document. Here, regularity is measured by self information, and is derived from patterns of semantically meaningful classes of text and visual layout. We present the results of an initial study to assess the ability of these measures to detect semistructured text in a corpus culled from the web, showing that they outperform baseline methods on an average precision measure. We present initial work that uses significant patterns to generate extraction rules, and conclude with a discussion of future directions of our work.


Read more from SRI

  • Banner and attendees at the IEEE Hard Tech Venture Summit

    Cultivating hard tech startups that scale

    IEEE’s Hard Tech Venture Summit convened innovators at SRI to refine strategies and build new networks.

  • Patient going into a MRI

    Bringing surgical tools inside the MRI

    Drawing on SRI’s unique innovation ecosystem, the startup Medical Devices Corner is seeking to improve cancer surgery by advancing MRI-safe teleoperation.

  • Christopher Mims and Susan Patrick

    PARC Forum: How to AI

    The Wall Street Journal tech columnist Christopher Mims and SRI Education’s Susan Patrick discuss how AI can strengthen human agency.