Digitizing Urdu: Software Will Improve Analysis of Documents, Social Networks in Pakistan's National Language

Release Date: March 3, 2011

BUFFALO, N.Y. -- From Libya to Bahrain to Egypt, social media sites like Twitter and Facebook continue to play a substantial role in the political unrest of the region, but its major languages remain poorly served by basic, electronic resources that the West takes for granted. Without such resources, analysis of documents in those languages or any meaningful data mining of them is much more difficult.

Now, computer scientists at the University at Buffalo and at Janya Inc. have developed the first software system that will allow for computational processing of documents in Urdu, Pakistan's national language and one of the world's five most-spoken languages.

View a video about the new software system http://www.youtube.com/watch?v=pxZoHlpTIn8 here.

The system provides a foundation for data mining in Urdu and allows for more accurate transliteration, converting from Urdu's writing system into English. It also is helping the computer scientists develop sophisticated ways to begin to do sentiment analysis of social media content.

"This is the first comprehensive, natural language processing system for Urdu," says Rohini Srihari, PhD, UB associate professor of computer science and engineering and co-author of "An Information-Extraction System for Urdu -- A Resource-Poor Language" with Smruthi Mukund, a doctoral candidate in the UB Department of Computer Science and Engineering, and Erik Peterson, a research scientist at Janya Inc.

It is a joint project between the UB Department of Computer Science and Engineering and Janya Inc., an Amherst, N.Y., company founded by Srihari that is a leading provider of information extraction technology in languages that include Chinese, Arabic, Pashto and Russian.

The work was discussed in a presentation Srihari gave Feb. 24 at "Blogs & Bullets: Social Media and the Struggle for Political Change," a conference jointly hosted by the U.S. Institute of Peace and the George Washington University School of Public Diplomacy at Stanford University. It also was published in ACM Transactions on Asian Language Information Processing in December.

"The system we developed provides the first full pipeline of electronic language processing capabilities in Urdu," Srihari says. "It facilitates electronic tasks ranging from the simplest keyword search to sentiment analysis of social networks, where you use computational methods to analyze opinions in a country or culture."

Srihari and her colleagues became interested in Urdu because they were looking at blogs in different cultures.

"The advent of the Web has really increased the amount of content in languages like Urdu," says Srihari. "When you start looking at blogs in different cultures, you can really start to understand public sentiment and opinions."

The problem, she notes, is that these languages don't have the established electronic infrastructures that are taken for granted in English and the European languages, such as lexicons, annotated electronic dictionaries and well-developed ontologies that describe relationships among words and entities in documents.

"If you are trying to do sentiment analysis -- to find out what are the main topics people are talking about in a country, is there intensity building up over something and who is swaying opinion -- then you must have an information extraction system," she says.

According to Srihari, information extraction uses a combination of linguistics and computer science to extract salient information such as entities, relationships between entities and events from large collections of unstructured text.

"Now we have the developed the first system that will recognize everything in a raw -- unprocessed -- Urdu document," she says. "It will be able to plot all the interesting names, dates, times, all the entities that might be of interest in a particular set of documents. That's what allows you to start data mining, whether it's blogs, social networks or comments on a news site."

She describes the information extraction system they developed as a "pipeline of processing" that begins with simple processing, akin to looking a word up in a dictionary to find its meaning, and progresses to more complex processing, such as diagramming a sentence to find its subject and object and establishing context.

The system performs several functions, including word segmentation, in which individual words are properly segmented, part-of-speech tagging in which parts of speech are properly identified and named entity-tagging, in which names of people, places, organizations, dates and other specific pieces of information are identified and translated into English.

Because spoken Urdu shares much with Hindi, a language in which Srihari is fluent, the system she and her colleagues have developed draws on this similarity, exploiting some of the electronic resources that exist for Hindi.

The work the UB researchers have done for Urdu also could help the development of similar systems for other languages that lack basic, electronic resources, such as Dari and Somali, Srihari says.

The University at Buffalo is a premier research-intensive public university, a flagship institution in the State University of New York system and its largest and most comprehensive campus. UB's more than 28,000 students pursue their academic interests through more than 300 undergraduate, graduate and professional degree programs. Founded in 1846, the University at Buffalo is a member of the Association of American Universities.

Media Contact Information

Ellen Goldbaum
News Content Manager
Medicine
Tel: 716-645-4605
goldbaum@buffalo.edu
Twitter: @UBmednews