SRI Logo
SpacerAbout UsDividerR & DDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Video Text Recognition

Automatic identification of the contents of video imagery would permit videos to be indexed in a convenient and meaningful way for later reference. As part of SRI's MAESTRO system, SRI developed a text extraction capability for video ("video OCR"). Video imagery often contains text that is semantically related to the rest of the scene depicted in the video, especially in broadcast news programs. Such text can take the form of computer-generated text that is overlaid on the imagery (such as captions indicating who is shown in the imagery, locations of the scenery, or topics of a news story), or text that appears as part of the video scene itself (such as a sign outside a place of business, or placards in front of conference participants).

video to textThe location and recognition of text in video imagery is more difficult than many other OCR applications (e.g., reading printed matter) because of small character sizes and nonuniform backgrounds. In addition, the recognition of scene text can be particularly difficult if the text is viewed from an oblique angle, if the text is not in a two-dimensional plane, or if the text is blurred due to motion of either the camera or the text. SRI has developed an approach that involves binarizing individual color video frames and then applying a commercially developed OCR engine. video to text Figure 1 shows the recognition of text overlaid on a map, and Figure 2 shows two examples of the recognition of scene text.

The accuracy of the recognition result can be improved substantially by postprocessing the OCR results with a lexicon of named entities extracted by MAESTRO from the audio or closed caption tracks. Figure 3 shows OCR recognition of a video overlay caption. video to textWithout help from a lexicon, the OCR result is "Blakjb" because of the background of the Prime Minister's hand behind the overlay caption. However, with the help of the lexicon of named entities extracted from the speech recognized in the audio track, the system is able to correctly recognize "Blair". In addition to successfully recognizing the text, the semantic content of the video scene itself (in this case, an image of Tony Blair) can be inferred from the text label, because the detection of instances of named entities in the video imagery often correspond to a person, place, or organization depicted in the video scene.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2008 SRI International 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy