Text Extraction from Images
Captured via Mobile and Digital Devices
B.OBULIRAJ,
Department of Computer science and
Engineering,
Muthayammal engineering college, Raspuram.
EmailId: obuliraj.avl@gmail.com,
Abstract—This paper
presents the development of a human-machine interactive
software application, named ‘Textract’, for text extraction and recognition in
natural scene images captured via mobile and digital devices. The texts are
subsequently trans-lated into another language (simplified Chinese in this
project) so that a device such as a mobile phone serves as a portable language
translator. In considering of the resource constraint nature of mobile devices,
the proposed solution makes best possible choices to balance between
recognition accuracy and processing speed.
Index Terms—edge detection,
labeling, segmentation, OCR.
I. INTRODUCTION
MOBILE
devices (mobile phones, digital cameras, portable gaming devices, etc) are rampantly
available nowadays. These small and inexpensive devices facilitate new ways of
interac-tion with the physical world. Over the years, mobile devices gain
increasing computational power and storage capabilities. For mobile phones
only, beyond sending text messages and making voice calls, recent mobile phones
also offer various features like camera, music and video playback, games, web
browsing, even GPS, etc.
Meanwhile,
people have more opportunities to travel around globally nowadays. Language barrier
is usually a big problem. Information on signboards is particularly important.
The need for a fast mobile translation device for simple texts on natural
scenes especially signboards (e.g.,Fig. 1) is obvious. It is a natural to
firstly think of mobile phones as the platform since most mobile phones are
equipped with cameras and people carry them almost everywhere they go.
Automatic
text extraction from natural scene images is a hot yet tough research topic.
Primary challenges lie in the variety of texts: font, size, skew angle,
distortion by slant and tilt, shape of the object which texts is on, etc.
Environmental factors like uneven illumination and reflection, poor lighting
conditions as well as complex backgrounds add more
compli-cations.
In
developing a mobile application, one must always take into account the resource
constraint nature of mobile devices (limited processor speed, memory and
battery). Take, for example, a Nokia N81(8GB) which is a typical high-end
mobile phone as of Year 2009. It has a CPU clock rate of 369 MHz, which is
roughly 1/5 of a typical computer CPU (1.8GHz). A program which takes 4 seconds
in computer may take 20 seconds in a mobile phone. In order to make the program
considerably fast, techniques adopted in this paper may not deliver the best
result among all techniques available but instead deliver relatively good
results in short time.
Thus, a
mobile solution for automatic text extraction and translation must be able to
not only take into account the complexity posed by the problem but also deliver
results within a tolerable time. In developing Textract, we relax the modeling
of the problem by making following assumptions:
1)
The application will only recognize a few
commonly used non-italic font types which are usually the case in natural scene
images. It works best for Sans-serif and considerably well for Serif (see
Fig.2). However, it is not supposed to work on less common fonts like Blackletter,
Script, etc.
Fig.
2. Sans-serif font
(left) & Serif
font with serifs
in red (right)
2)
Texts on images should be sufficiently large
and thick relative to images frames as compared to small texts in a paper
document. This assumption is usually valid because texts on signboards are
usually large in order to get attentions.
3)
Image should not be taken under poor lighting
condi-tions or with uneven illumination.
4)
Texts should be roughly horizontal. A
significant skew angle is undesirable. The reason for not considering skew
angles is that skew angles can only be derived if there are sufficient texts
while the number of letters on signboards can usually be just two or three.
The application
comprises two main
stages: processing stage
and recognition stage. This paper focus more on
processing (text extraction) stage.
In the
processing stage, the original color image captured by mobile phone is firstly
transformed to a grayscale image. Next, the sharp transitions are detected
using a revised Prewitt edge detection algorithm. Then the image will be
segmented into several regions. Each region can be regarded as an object.
Finally, an elimination rule is specially developed to rule out abnormal
objects (area too large or too small, width is far longer than the height, etc).
In the
recognition stage, characters are first grouped into words, and words are
grouped into a sentence. The OCR (Optical Character Recognition) exploits two
main features of the character objects: crosses and white distances which will
be explained in detail in later sections. Finally, recognized words are checked
against a simple dictionary which includes most commonly used words when
travelling. The translated meanings can be presented immediately.
II. PROCESSING STAGE
The
processing stage can be subdivided into 3 sub-stages: pre-processing, region
segmentation and post-processing. A flow chart below (Fig.3) depicts the steps
involved in processing stage.
sharp transitions
in the image, so that the image will not be blurred too much as a result of
smoothing as compared to other smoothing models like 2D Gaussian smoothing.
3)
Contrast Enhancement: Contrast enhancement will be
ap-plied to an image captured with the background color similar to the text
color or under poor lighting conditions. A simplified enhancement algorithm is
given by:
G(x,
y) =(S(x,
y)−min S)×
|
255
|
(2)
|
max S − min
S
|
where S(x,
y)
denotes the pixel value, min S and max S are the
minimum and maximum grayscale values of S(x, y)
re-spectively. If max S and min S are close
to each other, e.g., max S − min S ≤ 80, the gray
scale image will be monoto-nous with a low contrast. Gray scale extension will
increase contrast via extrapolation to achieve max
S−min S= 255.
B. Region
segmentation
The main
objective of region segmentation is to segregate the gray image into several
regions, so as to separate the background from the objects. Two of the most
common tech-niques in segmentation are thresholding and edge detection.
Thresholding is good for images with bimodal histograms which imply clear cut
of background and objects. However, if the background illumination is uneven or
the histogram is multimodal which means more than 3 gray levels exist in the
image, thresholding will fail. Besides, adaptive algorithms like Ostu’s method
(Ostu, 1979) must be used to get an optimal threshold which is time consuming.
Edge detection works well as long as the uniform illumination assumption holds.
1) Edge
Detection: The goal of edge detection is to markthe points in
the photo where the luminous intensity changes sharply. Edge detection
algorithms generally compute a deriva-tive of this intensity change. Several
edge detection algorithms have been developed. Here, a revised first-order
Prewitt (Pre-witt) edge detector is applied to detect all possible edges. A
threshold is also calculated so that edge values beyond the threshold turns to
black while all other pixels turn to white.
A.
|
Pre-processing
|
||||||||||
1)
|
Gray Scale
Transformation: In this
paper, grayscale trans-
|
||||||||||
formation
means transforming a
color image to
a grayscale
|
|||||||||||
image. It will
reduce the computational complexity
as well as
|
|||||||||||
memory
requirements
significantly. For a
specific pixel with
|
|||||||||||
RGB values R
(red), G (green)
and B (blue)
respectively, the
|
|||||||||||
gray scale representation of
this pixel can
be derived from:
|
Fig. 4.
|
Convolution kernels
|
|||||||||
Y =
0.3R + 0.59G + 0.11B
|
(1)
|
First,
|
through
|
the
|
convolution operations,
|
D(x, y)
|
can
|
be
|
|||
obtained by
the following convolution
computation:
|
2)
Smoothing: The predominant use of smoothing is to sup-press
image noise. It is done by using a median value model (median filtering) which
makes use of redundancy in the image data: a pixel will be replaced with the
median value of itself and the 8 neighbors. Median value model preserves
D(x,
y) =
max(Fi⊗ S(x,
y)) (3)
where D(x,
y)
denotes the results after edge detection at (x, y), and Fi is the ith kernel
listed above. The following decision making process is proposed to differentiate
the edges:
1)
Calculate
the average value
of D(x, y).
M
N
D (x,
y)
aver E
= x=1y=1
|
(4)
|
M × N
|
2)
If D(x, y)>2.4×aver E, then the
pixel at (x, y) will be converted to black.
Otherwise, it will be kept white.
The reason why Prewitt is used instead of a Canny
edge detector (Canny, 1986) which is usually considered to give better results
is that Canny detector usually applies a 5x5 Gaussian smoothing filter and a
rather complicated edge determining algorithm, which is not a viable option in
a resource constrained device.
2) Fake
Edge Reduction: Through edge detection, many fakeedges may
exist in the image due to the non-uniform texture in the original image. A
reduction function is needed to remove these fake edges.
Let Aver(x,
y)
denotes the average value of the 8-neighbour of pixel located at (x,
y),
and D(x, y) denotes the value f the pixel.
Then the following are defined:
if |Aver(x, y) − D(x, y)| > 127.5
|
|
D(x, y) =
255 − D(x, y)
|
|
Otherwise, D(x, y)unchanged
|
(5)
|
After this
step, some isolated fake edges will be removed. Although this step can not
completely remove all fake edges, it reduces computation for the following
steps.
3)
Labeling: Connected component labeling seeks to assigna
unique label to each subset of objects which are connected. A two-pass
algorithm is developed to label the connected components in the image. It consists
of assigning provisional labels and analyzing final labels using a union-find
method.
C. Post-processing
1) Abnormal Objects Removal: (Integrated
with noise elimi-nation for computational simplicity)
After the
previous stage, each object is labeled and could be considered as an object.
The objective in this stage is to eliminate invalid objects and retain only
wanted objects, i.e., meaningful objects representing letters.
Abnormal
objects usually include boundary, signs, rifts etc. This part involves a
sequence of carefully designed rules for intelligently selecting the ‘right’
objects.
For
convenience, a term ‘enclosed’ is defined as: if object A is enclosed by object
B, it means the smallest rectangle that encloses B encloses the smallest
rectangle that encloses A, which is pictorially depicted as in Fig. 7. The
letter ‘d’ encloses the solid object because the red rectangle is enclosed by
the blue rectangle.
Four steps
are devised to filter out unwanted objects. (Note: objects in each step are
referring to those left after previous step only.)
Step I • Any
object touching the
frame is eliminated.
(This is based on the assumption that when a
snapshot is taken, the text will not touch the frame).
• Any object
satisfying any one of the following criterion is eliminated:
– height >0.8×height of
the image;
– width >0.4×width of
the image;
– height <15and area <150;
– height <0.06×height of
the image;
– height/width
>16; – width/height >2;
– area
>0.08×total area of
the image.
Step II • Any object satisfying any
one of the following criterion gets eliminated:
– area >8×average area;
– area <0.15×average area;
– area
< max area/20 (noise
elimination).
Step III • Any object satisfying any
one of the following criterion gets eliminated:
– height
>1.8×averageheight; – width >1.8×averagewidth;
Step IV • Any object which is
enclosed by another one is eliminated.
Step I
filters out abnormal objects using their absolute features (i.e. not utilizing
the relative features like average area and average height which include
information of other objects). The reason to do this is to ensure that no
obvious abnormal objects corrupt the relative features).
Step II
filters out abnormal objects using relative features like average area and
maximum area. Step I ensures that objects at this step include as many valid
objects as possible.
Step III
filters out abnormal objects also using relative features average height and
average width. Step II also ensures that this step includes as many valid
objects as possible.
Step IV
filters out those
which are enclosed
by another one.
Fig.8 shows
the results after
abnormal objects removal.
2) White distances are defined as the distance as it is traversed from one direction (left, right, downwards, upwards) until the first 1 is met, as shown pictorially in Fig.9.
Using
information of vertical crosses, horizontal crosses, left white distances,
right white distances, top white distances and bottom white distances and
comparing with a template of these 6 traits of all alphabets, characters will
be differentiated with very high accuracy.
2)
Objects’ Order Determination: This step is basically
todetermine the order of the letter objects when they are viewed as a part of a
word and a sentence. The order of letters should correspond to the order when
they are arranged as a word. i.e. ‘W’ is assigned the order 1; ‘A’ is assigned
the order 2, etc.
III. RECOGNITION STAGE
• V
ertical crosses = {1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2};
• Horizontal crosses
= {1,
1, 1, 2, 2, 2, 2, 1, 1, 1};
• Lef t white
distances = {3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0,
0};
• Right white
distances = {4, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0,
0};
• T op white
distances={13,9,6,2,0,0,2,5,9,12};
• Bottom white
distances={0,0,2,3,3,3,3,3,0,0}.
This stage
is about implementing a simple yet efficient OCR (Optical Character
Recognition) algorithm. OCR is the translation of images of handwritten,
typewritten or printed texts (usually captured by a scanner) into machine-editable
text. Artificial Neural Network (ANN) has gained a lot of popularities in
pattern recognition in recent researches. ANN need a large number of training
samples to train the network properly. There are a lot of standard database for
handwritten texts but not for printed texts especially when they undergo
previous processing steps.
In this
paper, since texts to be recognized are all printed texts, template matching
will yield considerably good results. A template matching method is developed
to recognize the characters. It utilizes the two main features of a character:
crosses and white distances.
Definitions of
crosses and white
distances:
Think of a
character ‘A’ strictly confined in a square box as shown in Fig.9. The white
pixels are 0 and blue pixels (i.e. body of the text ‘A’) are 1.
1)
A cross (either vertical or horizontal) is
defined as the number of zero one crosses as it is traversed from one side to
another (either vertically or horizontally) as shown pictorially in Fig.9.
A group of
templates of all information about the 52 alphabeti-cal letters (both capital
and small letters) is prepared; detected attributes of each letter are compared
against the templates using Least Squares Method.
Finally,
letters identified from the image and converted to ASCII values are
subsequently filled into a linguistic model to be grouped into words and
checked against a dictionary. The meanings or translations of the texts will be
retrieved immediately.
Certainly,
there are situations when a letter is misrecognized and the translation could
not be found, for example capital letter ‘I’ and small letter ‘l’ are very much
alike. Error correcting models are built to tackle these situations.
The
translation includes both words and phrases. Common phrases included in the
dictionary like “car park”, “fire hose reel”, “keep door closed” and etc could
be directly translated, while cases like “slippery when wet” would be
translated separately.
IV. PLATFORM& IMPLEMENTATION
The whole
application is developed on J2ME (Java 2 Micro Edition) platform which is a
specification of a subset of the Java platform for small and
resource-constrained devices such
as mobiles
phones and set-top boxes. Textract can be installed in any Java supported
camera phone, which is widely available in the market nowadays. A sample
emulation result in J2ME is shown in Fig.10.
V. RESULTS
There is
no standard database available for natural scene images captured by mobile
devices. 10 photos were captured manually at various places in National
University of Singa-pore. The results are shown in Table I.
It can be
seen that Textract offers good results in recognizing large texts. It gains
high accuracy if photos are taken well according previous assumption. However,
it should be noted that Textract will fail when texts are too thin as shown in
the last image where certain parts of the texts fade out.
A solution
proposed to remedy this problem as implemented in this application is to
capture images in larger size and allow user to manually zoom in and crop the
area of interest. Through this way, texts will be relatively larger and thicker
compared to the cropped image. With user interactions, recog-nition accuracy
increases significantly.
VI. IN
A REAL PHONE
seconds.
In the software, user interactions can be incorporated in the way that users
are able to select an area of interest. As the mobile phone processor speed and
memory size increase, the processing time will be greatly reduced. Also, there
is still space for improvement on robustness against font types and font
thickness, as well as translations as sentences instead of individual words
using machine translation techniques.
REFERENCES
[1]
R.
Brinkmann, Median filter,
The art and
science of digital
compositing.
, pp. 51
- 52, Morgan
Kaufmann, 1999.
[2]
J. C. Russ, Correcting Imaging Defects, The
image processing handbook,5th ed. , pp. 206 - 214, CRC Press, 2006.
[3]
J. Yang, J. Gao, Y. Zhang, X. Chen and A.
Waibel, An Automatic Sign Recognition and Translation System, PUI 2001,
November 15-16, 2001,Orlando, FL, USA..
No comments:
Post a Comment