Information extraction using HMMs and shrinkage

Citation

Freitag D., McCallum A. Information extraction using HMMs and shrinkage, in Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, 1999.

Abstract

“Information extraction” refers to the process of converting documents to structured content summaries. Such summaries can be presented to users or be used by software agents engaged in text mining. This paper advocates for the use of HMMs for information extraction. The HMM state transition probabilities and word emission probabilities are learned from labeled training data. As in many learning problems, however, the lack of sufficient labeled training data hinders the reliability of the model. The key contribution of this paper is the use of relationships between HMM states and a statistical technique called “shrinkage” in order to significantly improve estimation of the HMM emission probabilities in the face of sparse training data. In experiments on seminar announcements and Reuters acquisitions articles, shrinkage is shown to reduce error by up to 40% and the resulting HMM outperforms a state-of-the-art rule-learning system.


Read more from SRI

  • surgeons around a surgical robot

    The SRI research behind today’s surgical robotics

    Intuitive’s da Vinci 5 system represents a major leap in robotic-assisted medicine. It all started at SRI, which continues to advance teleoperation technologies.

  • a collage of digital graphs

    A banner year for quantum

    SRI-managed QED-C’s annual report on quantum trends captures an industry accelerating rapidly from technical promise toward major global impact.

  • ICE Cube containing SRI’s aerogel experiment, photographed prior to launch. Source: Aerospace Applications North America

    An SRI carbon capture experiment launches into space

    By synthesizing carbon-absorbing aerogels in microgravity, SRI research will give us a rare glimpse into how these materials could be radically improved.