BUFFALO, N.Y. -- Following the tragic events of Sept. 11, 2001,
political analysts observed that in the interest of national
security and global understanding, more American scholars and
students should study Arabic.
While more students now may be taking courses in the Arabic
language, the lack of digital tools to access Arabic documents on
the Web puts these fields of study and those who pursue them at a
Computer scientists at the University at Buffalo's Center for
Unified Biometrics and Sensors (CUBS) are remedying that by
developing optical character recognition (OCR) software for
handwritten and machine-printed Arabic documents.
The new software will make it possible to scan Arabic documents
digitally in search of specific information or keywords for
intelligence-gathering and other applications, according to Venu
Govindaraju, Ph.D., director of CUBS
and principal investigator.
The UB researchers have received $240,000 in funding from the
federal Director of Central Intelligence Postdoctoral Research
Fellowship Program for a two-year grant to develop the software,
which will allow Arabic documents to be digitized and posted on the
The researchers have submitted a paper outlining what needs to
be done to accomplish Arabic character recognition to IEEE
Transactions in Pattern Analysis and Machine Intelligence.
With up to 235 million speakers worldwide, Arabic is the fourth
most-spoken language in the world and for millions of Muslims it is
the language of their religious texts.
"Suppose you have several thousand Arabic documents and you want
them scanned for specific keywords so that you can narrow down the
number of documents that must be reviewed manually. Right now, this
cannot be done," says Govindaraju, professor of computer science
and engineering in the UB School of Engineering and Applied
He adds that the new software -- designed to be applicable to
both handwritten and machine-printed Arabic -- will be valuable
especially because handwritten annotations in the margins of a
machine-printed document often are of intrinsic interest.
By developing OCR software for Arabic handwriting and
machine-print, the UB researchers will increase access to modern
Arabic documents and resources, as well as ancient Arabic
manuscripts, helping to close the rapidly growing digital divide
between the English and non-English speaking worlds.
"The whole Internet is skewed toward people who speak English,"
observed Govindaraju. "The fear is that if an OCR is not developed
for a particular language, then all the classic texts in that
language will disappear into oblivion. The automation of the
interpretation of written Arabic will have major benefits for
The research also will help the UB group explore the use of
handwriting as a biometric, he added.
"Handwriting is what we consider a soft biometric," he noted.
"While it's not a trait that can be used to identify individuals,
it can be used to group individuals together and, in combination
with other, stronger biometrics, could be applied to more precise
He added that features of handwriting that show up even when an
individual is writing in a foreign language may reveal information
about his or her native language.
Arabic presents important challenges to computer science,
Govindaraju explained, because characters may take different forms
if they appear at the beginning, middle or end of a word;
boundaries between words are not always marked consistently, and
Arabic vowels are pronounced, but often not written.
"So in addition to the benefits for readers of Arabic, this
project will help push the frontiers of computer vision, pattern
recognition and artificial intelligence in general," he said.
OCR software, Govindaraju explained, essentially trains the
computer to correctly interpret the images of a particular alphabet
based on "truthed" data, that is, numerous scanned images of
characters or words and their interpretation recorded by humans who
have examined the original images.
Govindaraju was involved in the development at UB of the first
comprehensive OCR software for interpreting handwritten addresses
in English, a milestone that spurred research into handwriting
recognition that led to some applications now taken for granted,
such as personal digital assistants. He and his UB colleagues also
created a software tool that is the first step in developing OCR
software for Devanagari script, which will allow digitization of
documents in Sanskrit, Hindi and dozens of other Indian and South
The University at Buffalo is a premier research-intensive
public university, the largest and most comprehensive campus in the
State University of New York.