| |
Evaluation of the Video Text Recognition Process
The Video Text Recognition (VTR) process detects, tracks, and recognizes scene text in a video sequence. Its output is a sequence of recognition results consisting of the starting and ending frame numbers, and the recognized text. Figure 1a shows one frame from a sample video sequence, and Figure 1b shows the recognition results. Viewable below are five video sequences. The ground truth data for these sequences is in the file 5examples-truth.txt. The ideal recognition process would correctly report a single recognized result for each instance of text in the ground truth for its entire duration of appearance in the video, and would not produce any extra reports (false alarms).
 |
|
 |
| (a) Sample video frame |
|
(b) Video text recognition results
|
| Figure 1. Example Video "Innovation" |
The performance is highly dependent on the specific characteristics of the text (e.g., contrast, size) and the imaging conditions (e.g., motion blur). The parameter MinFrames in the VTR process controls the minimum number of frames that a text region must be tracked to be reported as text; text regions tracked for fewer frames are thought to be false text and are discarded. 5examples-mintrack10.txt shows performance with MinFrames = 10, and 5examples-mintrack3.txt shows performance with MinFrames = 3. Comparing the two results, we see that reducing MinFrames from 10 to 3 increases the fraction of text lines correctly recognized and decreases the fraction of missed text lines for video segments that have a high amount of camera motion (e.g., Caution-Iron and Fed Storage). However, this reduction of MinFrames causes the false text rate to increase in the other video segments.
Video Sequences
The following are 640 x 480 raw video files.
|
|