This article is from the archives of the UB Reporter.

Digital tools for Arabic

CUBS developing OCR software for Arabic documents

Published: January 20, 2005

Contributing Editor

Following the tragic events of Sept. 11, 2001, political analysts observed that in the interest of national security and global understanding, more American scholars and students should study Arabic.

While more students now may be taking courses in the Arabic language, the lack of digital tools to access Arabic documents on the Web puts these fields of study and those who pursue them at a distinct disadvantage.


CUBS researchers are developing the first optical character recognition software for handwritten and machine-printed Arabic documents.

Computer scientists at UB's Center for Unified Biometrics and Sensors (CUBS) are remedying that by developing an optical character-recognition (OCR) software for handwritten and machine-printed Arabic documents.

The new software will make it possible to scan Arabic documents digitally in search of specific information or keywords for intelligence-gathering and other applications, according to Venu Govindaraju, director of CUBS and principal investigator.

The UB researchers have received $240,000 in funding from the federal Director of Central Intelligence Postdoctoral Research Fellowship Program for a two-year grant to develop the software, which will allow Arabic documents to be digitized and posted on the Web.

The researchers have submitted a paper outlining what needs to be done to accomplish Arabic character recognition to IEEE Transactions in Pattern Analysis and Machine Intelligence.

With up to 235 million speakers worldwide, Arabic is the fourth most-spoken language in the world and for millions of Muslims it is the language of their religious texts.

"Suppose you have several thousand Arabic documents and you want them scanned for specific keywords so that you can narrow down the number of documents that must be reviewed manually. Right now, this cannot be done," says Govindaraju, professor of computer science and engineering in the School of Engineering and Applied Sciences.

He adds that the new software—designed to be applicable to both handwritten and machine-printed Arabic—will be valuable especially because handwritten annotations in the margins of a machine-printed document often are of intrinsic interest.

By developing OCR software for Arabic handwriting and machine-print, the UB researchers will increase access to modern Arabic documents and resources, as well as ancient Arabic manuscripts, helping to close the rapidly growing digital divide between the English and non-English speaking worlds.

"The whole Internet is skewed toward people who speak English," observed Govindaraju. "The fear is that if an OCR is not developed for a particular language, then all the classic texts in that language will disappear into oblivion. The automation of the interpretation of written Arabic will have major benefits for numerous applications."

The research also will help the UB group explore the use of handwriting as a biometric, he added.

"Handwriting is what we consider a soft biometric," he noted. "While it's not a trait that can be used to identify individuals, it can be used to group individuals together and, in combination with other, stronger biometrics, could be applied to more precise identification."

He added that features of handwriting that show up even when an individual is writing in a foreign language may reveal information about his or her native language.

Arabic presents important challenges to computer science, Govindaraju explained, because characters may take different forms if they appear at the beginning, middle or end of a word; boundaries between words are not always marked consistently; and Arabic vowels are pronounced, but often not written.

"So in addition to the benefits for readers of Arabic, this project will help push the frontiers of computer vision, pattern recognition and artificial intelligence in general," he said.

OCR software, Govindaraju explained, essentially trains the computer to correctly interpret the images of a particular alphabet based on "truthed" data—that is, numerous scanned images of characters or words and their interpretation recorded by humans who have examined the original images.

Govindaraju was involved in the development at UB of the first comprehensive OCR software for interpreting handwritten addresses in English, a milestone that spurred research into handwriting recognition that led to some applications now taken for granted, such as personal digital assistants. He and his UB colleagues also created a software tool that is the first step in developing OCR software for Devanagari script, which will allow digitization of documents in Sanskrit, Hindi and dozens of other Indian and South Asian languages.