This article is from the archives of the UB Reporter.
Archives

Bridging the digital divide

Software tools boost Web access to Indian-language documents

Published: March 6, 2003

By ELLEN GOLDBAUM
Contributing Editor

So, you think searching for things in English on the Internet is frustrating?

Well, try searching for documents written in ancient Sanskrit, modern Hindi and any of dozens of Indian and South Asian languages that are based on the beautiful, intricate symbols of the Devanagari script.

The ability to put this valuable content online from printed sources in Devanagari requires optical character recognition (OCR), the tool necessary to turn any text document into a digital one.

photo

"The lack of a good OCR for Devanagari has made it very difficult to make available on the Web the vast majority of Devanagari documents," said Venu Govindaraju, associate director of the Center of Excellence in Document Analysis and Recognition (CEDAR) and professor of computer science and engineering.

Now, with funding from the National Science Foundation, Govindaraju and his UB colleagues are taking a major step toward boosting online access to these documents.

The researchers happen to share not only expertise in machine-print and handwriting recognition, but also a rare passion for—and fluency in—Sanskrit and other Indian languages.

Their project, funded under a $487,000 grant from the NSF's International Digital Libraries initiative, endeavors to make Devanagari documents, ranging from ancient Sanskrit masterpieces, such as the Bhagavadgita and the Vedas, to contemporary documents in Hindu, Marathi and other Indian languages, easily accessible on the Web.

The researchers, based at CEDAR, have created a software tool that is the first step in developing OCR for Devanagari, ultimately allowing documents in these scripts to be widely searchable on the Web.

It will be presented by Govindaraju, who is the principal investigator, on Tuesday at the 13th International Workshop on Research Issues on Data Engineering in Hyderabad, India.

The UB researchers expect to make it available for free on the Web by the end of March.

"We are developing machine technologies to read Devanagari documents, whether they are contemporary documents written in Hindi or ancient documents that were handwritten on palm leaves," said Govindaraju.

The project, which involves collaboration with the Indian Statistical Institute in Kolkata, one of India's premier research institutions, takes an important step toward bridging the digital divide between the developed world and some developing nations, according to the UB researchers.

"The half-billion people around the world whose main language is Hindi, or based on Devanagari, are totally missing out on the 'information revolution,'" said Govindaraju. "In IT, the native languages all have taken a back seat."

While Sanskrit has been considered a "dead" language, he noted that in his native India a movement to revive it, both in written and spoken forms, has been gaining ground and in certain regions, schools are including Sanskrit in their curricula.

He and his UB colleagues on the project are among those in the U.S. who have rediscovered the language; they teach Sanskrit to their own children and hold classes in it at the Hindu Cultural Society of Western New York.

"The Indian civilization is 5,000 years old," said Govindaraju. "So there are many, many documents written in Devanagari script, but if we want to include them in a digital library in order to preserve access to them, we need to develop software that recognizes the script."

OCR, the UB researchers explain, essentially "trains" the computer to correctly interpret the images of a particular alphabet based on "truthed" data, that is, numerous scanned images of characters or words and their interpretation recorded by humans who have visually examined the original images.

About 15 years ago, CEDAR, the largest research center in the world devoted to developing new technologies that can recognize and read handwriting, developed the first comprehensive OCR for handwritten documents in English.

That turned out to be a milestone, spurring numerous new research projects into handwriting recognition that led to some of the applications now taken for granted, such as personal digital assistants.

"Similarly, we are expecting that the development of benchmarked OCR for Devanagari will trigger a groundswell of research in machine-reading technologies for these Indian languages," said Govindaraju.

To develop benchmarked OCRs, the UB researchers have constructed a dataset of 400 pages of Hindi and Sanskrit documents from books and periodicals, both ancient and contemporary, that is representative of the huge variety of documents available in these languages.

The researchers have used the tool they developed to record information about these documents that indicate how OCR for Devanagari should interpret each word. The researchers also plan to develop character databases and on-line dictionaries, text corpora and other tools for linguistic analysis that will be invaluable to the OCR community.

"The availability of our truthing and evaluation tool, together with the availability of new truthed Devanagari data, will spur greater research in the development of Devanagari OCR," said Srirangaraj Setlur, senior research scientist at CEDAR and co-investigator.

Vemulapati Ramanaprasad, senior research scientist at CEDAR, also is co-investigator.

In the future, the UB researchers plan to extend the scope of this tool to include OCR evaluation for other Indian languages, such as Kannada, Malayalam, Tamil and Telugu, that do not use the Devanagari script, as well as for Arabic and Urdu.