The ability to extract, index, and search digitized document images for relevant information is a
growing need in many business and government applications. Conventional approaches to this problem
involve the use of page, line, and character segmentation followed by optical character recognition
to convert the pixel information into symbol strings that can be manipulated. Document degradation,
however, causes the loss of important information at the pixel level, which in turn affects the
quality of the characters extracted from documents and therefore degrades the quality of
the search results.
SRI has been conducting several related research efforts that are exploring the use
of collateral or contextual information that can compensate for degradation-induced loss of
information. In general, these methods use shape information from entire words to complement
character recognition; lexicons organized in domain-specific ways to enhance recognition; language
models that can focus attention on parts of a document; and information combined from graphical
and textual modalities within a single document.
The goal of our research is to produce systems
that can synergistically extract information from different parts of a printed document to produce
a consistent and searchable representation of the total information content.