Semantics on the Web: How Do We Get There?

Semantics on the Web:How Do We Get There? Raghu Ramakrishnan Chief Scientist for Audience & Cloud Computing Yahoo!

Prediction is very hard, especially about the future. Yogi Berra

Trends in Search

I want to book a vacation in Tuscany. Start Finish Search Intent • Premise: • People don’t want to search • People want to get tasks done Broder 2002, A Taxonomy of web search

Y! Shortcuts

Google Base

Opening Up Yahoo! Search Phase 1 Phase 2 BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo! Search infrastructure and technology to developers and companies to help them build their own search experiences. Giving site owners and developers control over the appearance of Yahoo! Search results. (Slide courtesy Prabhakar Raghavan)

Search Results of the Future yelp.com Gawker babycenter New York Times epicurious LinkedIn answers.com webmd (Slide courtesy Andrew Tomkins)

Custom Search Experiences Social Search Vertical Search Visual Search (Slide courtesy Prabhakar Raghavan)

How Does SearchMonkey Work? 1 Site owners/publishers share structured data with Yahoo!. Site owners & third-party developers build SearchMonkey apps. 2 Consumers customize their search experience with Enhanced Results 3 Page Extraction RDF/Microformat Markup Acme.com’s Web Pages Index DataRSS feed Web Services Acme.com’s database (Slide courtesy Amit Kumar)

BOSS Offerings BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search. • ACADEMIC • Working with the following universities to allow for wide-scale research in the search field: API A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences. CUSTOM Working with 3rd parties to build a more relevant, brand/site specific web search experience. This option is jointly built by Yahoo! and select partners. • University of Illinois Urbana Champaign • Carnegie Mellon University • Stanford University • Purdue University • • MIT • Indian Institute of • Technology Bombay • University of • Massachusetts (Slide courtesy Prabhakar Raghavan)

Partner Examples

Aside: Cloud Computing APIs • On demand access to enormous datasets and computing power • Rapid application development • No maintenance worries for application developer

Cloud Services Compute-Centric Data-Centric • Messaging • Ordered? • Guaranteed delivery? • Scalability • (# msgs, producers, • subscribers) Virtual m/c E.g., EC2, PoolBoy, Tashi HTC job scheduling E.g., Condor Analytic / Batch Processing E.g., Hadoop, SSDS Serving Stores E.g., Mobstor, MySQL, S3, Sherpa Archival / Warehouse Storage

Google Co-Op Query-based direct-display, programmed by Contributor This query matches a pattern provided by Contributor… …so SERP displays (query-specific) links programmed by Contributor. Subscribed Link edit | remove Users “opts-in” by “subscribing” to them

Structured Search Content:Where Do We Get It? • Semantic Web? • Three ways to create semantically rich summaries that address the user’s information needs: • Editorial, Extraction, UGC • Unleash community computing—PeopleWeb! Challenge: Design social interactions that lead to creation and maintenance of high-quality structured content

Evolution of Online Communities

Metadata Drove most advances in search from 1996-present • Estimated growth of metadata • Anchortext: 100Mb/day • Tags: 40Mb/day • Pageviews: 100-200Gb/day • Reviews: Around 10Mb/day • Ratings: <small> Increasingly rich and available, but not yet useful in search This is in spite of the fact that interactions on the web are currently limited by the fact that each site is essentially a silo

PeopleWeb: Site-Centric People-Centric Global Object Model Portable Social Environment • Common web-wide id for objects (incl. users) • Even common attributes? (e.g., pixels for camera objects) • As users move across sites, their personas and social networks will be carried along • Increased semantics on the web through community activity (another path to the goals of the Semantic Web) Community Search (Towards a PeopleWeb, Ramakrishnan &Tomkins, IEEE Computer, August 2007)

Social Search

Social Search • Explicitly open up search • Enable communities, sites and consumers to explicitly re-define search results (e.g., SearchMonkey, Boss) • Right unit for a “search result”? Can we “stitch together” more informative abstracts from multiple sources? • Creation of specialized ranking engines for different tasks, or different user communities • Implicitly leverage socially engaged users and their interactions • Learning from shared community interactions, and leveraging community interactions to create and refine content • Expanding search results to include sources of information • E.g., Experts, sub-communities of shared interest, particular search engines (in a world with many, this is valuable!) Reputation, Quality, Trust, Privacy

Web Search Results for “Lisa” Latest news results for “Lisa”. Mostly about people because Lisa is a popular name 41 results from My Web! Web search results are very diversified, covering pages about organizations, projects, people, events, etc.

Save / Tag Pages You Like Enter your note for personal recall and sharing purpose You can save / tag pages you like into My Web from toolbar / bookmarklet / save buttons You can pick tags from the suggested tags based on collaborative tagging technology Type-ahead based on the tags you have used You can specify a sharing mode You can save a cache copy of the page content (Slide courtesy: Raymie Stata)

My Web 2.0 Search Results for “Lisa” Excellent set of search results from my community because a couple of people in my community are interested in Usenix Lisa-related topics

Tech Support at COMPAQ “In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.” “Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.” – Steve Young, VP of Customer Care, Compaq

- Partner Experts - - Customer Champions - Employees How It Works QUESTION QUESTION KNOWLEDGE Customer KNOWLEDGE BASE BASE SELF SERVICE SELF SERVICE Answer added to power self service Answer added to power self service ANSWER Support Agent

Timely Answers 77% of answers provided within 24h 6,845 • No effort to answer each question • No added experts • No monetary incentives for enthusiasts 86%(4,328) 74%answered 77%(3,862) 65%(3,247) 40%(2,057) Answers provided in 3h Answers provided in 12h Answers provided in 24h Answers provided in 48h Questions

Power of Knowledge Creation SHIELD 2 SHIELD 1 Knowledge Creation Self-Service *) ~80% Customer Mass Collaboration *) 5-10 % Support Incidents Agent Cases *) Averages from QUIQ implementations

Mass Contribution Users who on average provide only 2 answers provide 50% of all answers Answers 100 % (6,718) Contributed by mass of users 50 % (3,329) Top users Contributing Users 7 %(120) 93 %(1,503)

Interesting Problems • Question categorization • Detecting undesirable questions & answers • Identifying “trolls” • Ranking results in Answers search • Finding related questions • Estimating question & answer quality (Byron Dom: SIGIR 2006 talk)

Better Search via Information Extraction • Extract, then exploit, structured data from raw text: For years, Microsoft CorporationCEOBill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Select Name From PEOPLE Where Organization = ‘Microsoft’ PEOPLE Name Title Organization Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanFounderFree Soft.. Bill Gates Bill Veghte (from Cohen’s IE tutorial, 2003)

DBLife: Community Information Mgmt • Integrated information about a (focused) real-world community • Collaboratively built and maintained by the community • Semantic web, “bottom-up” • CIMple software package

DBLife • Integrate data of the DB research community • 1164 data sources Crawled daily, 11000+ pages = 160+ MB / day

Entity Extraction and Resolution co-authors = A. Doan, Divesh Srivastava, ... Raghu Ramakrishnan

“Proactive Re-optimization write write write Pedro Bizarro Shivnath Babu coauthor coauthor David DeWitt advise advise coauthor Jennifer Widom PC-member PC-Chair SIGMOD 2005 Resulting ER Graph

Mass Collaboration We want to leverage user feedback to improve the quality of extraction over time. Maintaining an extracted “view” on a collection of documents over time is very costly; getting feedback from users can help In fact, distributing the maintenance task across a large group of users may be the best approach

Mass Collaboration Example Not David! Picture is removed if enough users vote “no”.

Mass Collaboration Meets Spam Jeffrey F. Naughton swears that this is David J. DeWitt

DBLife • Faculty: AnHai Doan & Raghu Ramakrishnan • Students: P. DeRose, W. Shen, F. Chen, R. McCann, Y. Lee, M. Sayyadian • Prototype system up and running since early 2005 • Plan to release a public version of the system in Spring 2007 • 1164 sources, crawled daily, 11000+ pages / day • 160+ MB, 121400+ people mentions, 5600+ persons • See DE overview article, CIDR 2007 demo

Information Extraction … and the challenge of managing it

Economics of IE • Data $, Supervision $ • The cost of supervision, especially large, high-quality training sets, is high • By comparison, the cost of data is low • Therefore • Rapid training set construction/active learning techniques • Tolerance for low- (or low-quality) supervision • Take feedback and iterate rapidly

Extraction Management System (e.g Vertex, Societek) The Purple SOX Project (SOcialeXtraction) Application Layer Shopping, Travel, Autos Academic Portals (DBLife/MeYahoo) Enthusiast Platform …and many others Operator Library

Example: Accepted Papers • Every conference comes with a slightly different format for accepted papers • We want to extract accepted papers directly (before they make their way into DBLP etc.) • Assume • Lots of background knowledge (e.g., DBLP from last year) • No supervision on the target page • What can you do?

Down the Page a Bit

Record Identification • To get started, we need to identify records • Hey, we could write an XPath, no? • So, what if no supervision is allowed? • Given a crude classifier for paper records, can we recursively split up this page?

First Level Splits

Semantics on the Web: How Do We Get There?

Semantics on the Web: How Do We Get There?

Presentation Transcript

Semantics in Japanese Kanji

Distributed File Systems

Cognitive Lexical Semantics

Speech Recognition and Understanding

ICLC 5, K.U. Leuven, 9 July 2008 Meaning Merger: An Object of Study for Contrastive Semantics and Pragmatics?

Chapter 5 Semantics

BAN LOGIC

Chapter 1

Multiagent Systems

ADJECTIVES AND AP

Noam Rinetzky Lecture 2: Program Semantics

Answer Extraction: Semantics

Learning Approximate Inference Policies for Fast Prediction

Meaning

Applying Semantics to Service Oriented Architectures

Relating Static and Dynamic Semantics

Chapter 3

Chapter 3 Describing Web Resources in RDF

Modeling of Heterogeneous Systems in Metropolis

Semantics for Biodiversity

HTML5: Smart Markup for Smarter Websites [Future of Web Apps, Las Vegas 2011]