Wednesday, July 1, 2009

Keynote on Day 3: Relating Content through Web Usage by Ricardo Baeza-Yates

Ricardo is VP of Yahoo! Research in Barcelona, Santiago, and Haifa, Israel. He was a PhD student at the University of Waterloo. He also maintains ties with universities in Spain. His talk is on Web Content through Web Usage. He has a book which is the standard in information retrieval. According to Ricardo, web search is no longer about document retrieval, there is now a new breed of search experiences which involve the Wisdom of Crowds behind Web 2.0. Search is evolving more than just documents towards identifying a user's task and task completion. However the challenges are on-line and scalability.

We now have more complete information available in one search such as shortcuts, deep links and enhanced results. But for search, it is content vs. intent, the premise for the user is that they don't want to search, they just want to get tasks done and straight to their answers. We do searching when we don't know what to ask or who to ask. We are now moving from a web of pages to a web of objects. Objects have attributes, they will be missing, noisy, incomplete, but that is ok. Attributes define faceted search. However, the question is how do we get structured objects/attributes? This will come from metadata/semantic web/ontologies, web usage, and building out an open ecosystem.

From the AOL experience, obtaining queries and clicks is private. Crawling the web is expensive. From James Surowiecki, a New Yorker columnist in his 2004 book: Under the right circumstances, groups are remarkably intelligent. So what do you get from the wisdom of crowds? Popularity, diversity, quality and coverage are what we get out. The wisdom of crowds is crucial for search ranking, we use text (web writers and editors), links (web publishers), now tags (web taggers), and what Yates is mentioning next is taking all the queries.

20 years later, the basic ideas of cross references and dynamic links from Frank Tompa in 1988 is still relevant today. Yahoo Research has some demos of their research, one is TagExplorer which is based on tag similarity. How this is done? First, tag mining needs to be classified and tag semantics are done using WordNet. Yates showed a demo in TagExplorer where you can find tags related to locations, subjects and activities based on a query, he gave the example of Torino. Based on this and finding similar pictures, we can tag pictures automatically. We could also suggest tags to people based on a picture, however if you do that in Flickr, this is not folksonomy any more. This would be biased towards the algorithm and that is what we don't want.

We can also do visual annotations by associating text with a visual area which is done in Flickr as well as tagging people in Facebook. Content-based image retrieval is based on first extracting visual features and describing them, and then building a visual vocabulary using k-means clustering. This is an example of combining tagging and visual image retrieval. Besides WordNet, you can also use Wikipedia search and use that to drive the algorithm. By using this, Yahoo Research has created Correlator to find relations in the Wikipedia. Correlator works by retrieving related sentences and ranking them.

The next part of Yates' talk is Web Usage. We can use clicks by following hyperlinks, queries that express user interest. For example, if q4 is related to q3 because the words in the pages are similar and because the user clicked it. We can see what people are looking for, mapping queries to ODP. You can do hierarchical clustering on the graph (Francisco, Baeza-Yates and Oliveira).

So what are some of the open issues? Data volume versus better algorithms, explicit versus implicit social networks (are there any fundamental similarities), how to evaluate with (small) partial knowledge, and user aggregation vs. personalization. We have a virtuous cycle and improve the web.

So now it's questions. First question was about how Yahoo Research's work on tagging and search compares with Wolfram Alpha. Yates answered that the two come from different ends of the spectrum. Yahoo Research is making some of their datasets like Yahoo Answers available to researchers to use.

,

No comments: