This article is from the archives of the UB Reporter.
News

Urdu enters the digital age

The work that UB researchers led by Rohini Srihari have done for Urdu also could help the development of similar systems for other languages that lack basic, electronic resources, such as Dari and Somali. Photo: DOUGLAS LEVERE

  • Multimedia

    Watch a video about a new software system that will allow for computational processing of documents in Pakistan’s national language.

By ELLEN GOLDBAUM
Published: March 10, 2011

From Libya to Bahrain to Egypt, social media sites like Twitter and Facebook continue to play a substantial role in the political unrest of the region, but the major languages in the region remain poorly served by basic, electronic resources that the West takes for granted. Without such resources, analysis of documents in those languages or any meaningful data mining of them is much more difficult.

Now, computer scientists at UB and at Janya Inc. have developed the first software system that will allow for computational processing of documents in Urdu, Pakistan’s national language and one of the world’s five most-spoken languages.

The system provides a foundation for data mining in Urdu and allows for more accurate transliteration, converting from Urdu’s writing system into English. It also is helping the computer scientists develop sophisticated ways to begin to do sentiment analysis of social media content.

“This is the first comprehensive, natural language processing system for Urdu,” says Rohini Srihari, UB associate professor of computer science and engineering and co-author of “An Information-Extraction System for Urdu—A Resource-Poor Language” with Smruthi Mukund, a doctoral candidate in the UB Department of Computer Science and Engineering, and Erik Peterson, a research scientist at Janya Inc.

It is a joint project between the UB computer science department and Janya Inc., an Amherst, company founded by Srihari that is a leading provider of information-extraction technology in such languages as Chinese, Arabic, Pashto and Russian.

The work was discussed in a presentation Srihari gave last month at “Blogs & Bullets: Social Media and the Struggle for Political Change,” a conference jointly hosted at Stanford University by the U.S. Institute of Peace and the George Washington University School of Public Diplomacy. It also was published in ACM Transactions on Asian Language Information Processing in December.

“The system we developed provides the first full pipeline of electronic language processing capabilities in Urdu,” Srihari says. “It facilitates electronic tasks ranging from the simplest keyword search to sentiment analysis of social networks, where you use computational methods to analyze opinions in a country or culture.”

Srihari and her colleagues became interested in Urdu because they were looking at blogs in different cultures.

“The advent of the Web has really increased the amount of content in languages like Urdu,” says Srihari. “When you start looking at blogs in different cultures, you can really start to understand public sentiment and opinions.”

The problem, she notes, is that these languages don’t have the established electronic infrastructures that are taken for granted in English and the European languages, such as lexicons, annotated electronic dictionaries and well-developed ontologies that describe relationships among words and entities in documents.

“If you are trying to do sentiment analysis—to find out what are the main topics people are talking about in a country, is there intensity building up over something and who is swaying opinion—then you must have an information-extraction system,” she says.

Srihari explains that information extraction uses a combination of linguistics and computer science to extract salient information, such as entities, relationships between entities and events, from large collections of unstructured text.

“Now we have developed the first system that will recognize everything in a raw—unprocessed—Urdu document,” she says. “It will be able to plot all the interesting names, dates, times—all the entities that might be of interest in a particular set of documents. That’s what allows you to start data mining, whether it’s blogs, social networks or comments on a news site.”

She describes the information-extraction system they developed as a “pipeline of processing” that begins with simple processing, akin to looking a word up in a dictionary to find its meaning, and progresses to more complex processing, such as diagramming a sentence to find its subject and object and establishing context.

The system performs several functions, including word segmentation, in which individual words are properly segmented; part-of-speech tagging, in which parts of speech are properly identified; and named-entity tagging, in which names of people, places, organizations, dates and other specific pieces of information are identified and translated into English.

Because spoken Urdu shares much with Hindi, a language in which Srihari is fluent, the system she and her colleagues have developed draws on this similarity, exploiting some of the electronic resources that exist for Hindi.

The work the UB researchers have done for Urdu also could help the development of similar systems for other languages that lack basic, electronic resources, such as Dari and Somali, Srihari notes.