is associate professor of computer science and engineering. She teaches classes on Web search and mining, and is founder and CEO of Janya Inc., a text analysis/text mining firm.
Why is Google’s data trove so valuable?
The sheer volume of data that they have is powerful—there’s so much of it and it’s so diverse. It reflects the voices of consumers, the voices of citizens, the voices of people across countries. One way they can exploit this information is through usage mining, which is tracking how people are using the Internet. They know what people are querying. Google has access to all sorts of information that marketers would love to get their hands on. When people query a brand, for instance, what are they querying for? Google was able to spot outbreaks of flu-like illnesses before government agencies could because government agencies rely on traditional reporting—waiting for hospitals to send in statistics—whereas Google relies on queries. They know what people are querying for and where those queries are coming from.
What other companies or organizations are investing in data mining on the Web, and why?
Practically everyone. The telecoms, credit card agencies, major retailers, airlines, e-commerce providers like Amazon—all of these entities are engaged in data mining. One emerging technology is socially targeted advertising. Companies that provide this service analyze the browsing patterns of brand loyalists, identify Internet users with similar browsing patterns and use that information to target advertising. The success stories of companies attracting new customers through socially targeted advertising are amazing.
What are some interesting challenges that researchers and companies face when mining data on the Web?
The No. 1 challenge is balancing privacy with data mining. We’ve come to a stage where we do less than we can for fear of spooking the public. How do you gain enough information to help a retailer without creating a backlash? You don’t want people to feel like you’re invading their privacy. There are technical challenges, like making sense of text with multiple languages or spelling mistakes, but it’s achieving that balance between data mining and privacy that is the No. 1 challenge.
What are some potential public benefits that could come from data mining?
Data mining has the potential for making a serious impact on societal problems. Trends emerge quickly on the Web, and that can be used in an advantageous way. Google’s ability to spot outbreaksof flu-like activity is one example. Law enforcement is another. We’ve heard that gang members often post on their Facebook pages what they did, so law enforcement agents frequently go and look at Facebook to glean additional information. In local communities, if the volume of communication or chatter about some topic increases to a certain level—maybe roads need fixing or there’s a dangerous traffic light—public officials might take notice.
How might data mining affect the average Internet user?
We’re going to see more of this socially targeted advertising and it might start making people wonder, “How did they know that I was interested in traveling to Peru, or that I was looking to buy this thing?” It’s one thing when you’re doing a Google search and you see some advertising appear on the side. It’s quite another thing when you’re reading the newspaper online and you suddenly see an ad that’s targeted specifically at you that’s unrelated to the content on the page. As people become more aware of how much their Internet activities reveal, they may become more wary about the way they communicate. We’re going to see more debate about privacy.