1 / 38

Information Retrieval in Context

Information Retrieval in Context. Presenter: Xuehua Shen xshen@uiuc.edu. Presentation Layout. Problem Description Terminology Challenges IntelliZap System[WWW2001] Concerns. Problem. Search Engine has become key source of information

nitza
Télécharger la présentation

Information Retrieval in Context

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval in Context Presenter: Xuehua Shen xshen@uiuc.edu Xuehua Shen @CS, UIUC

  2. Presentation Layout • Problem Description • Terminology • Challenges • IntelliZap System[WWW2001] • Concerns Xuehua Shen @CS, UIUC

  3. Problem • Search Engine has become key source of information 1998[GVU WWW Study]: 85% people use search engine to locate information Now [Craig’s Talk]: 500 million search on Internet per day 150 million search at Google per day • Efforts on Coverage and Relevance Xuehua Shen @CS, UIUC

  4. Web Search Fact • Given 3-5 billion web pages on the Web huge and diverse info provided by Web • On average 1.7-words per query [Eric Brewer CACM 09/2002] little info provided by Users • Can search engine retrieve web pages very well? Xuehua Shen @CS, UIUC

  5. Context • Context may provide extra information to help improve search result relevance • An example: Searching flowers [DirectHit 1999] Man: typically want sites that let them send flowers Woman: often want sites that let them order flower seeds or plants for gardening purposes • What context information useful? Xuehua Shen @CS, UIUC

  6. Terminology • Ephemeral Context In a single search session Category[Inquirus2], Document being viewed [Watson], Feedback • Persistent Context increment over time, used in subsequent sessions User profile [My Yahoo!], Query history & Clickthrough Data [Google] Xuehua Shen @CS, UIUC

  7. Terminology cont. • Personalization Search Engine use context information to provide different search results for different users • Customization Users manually configure their preferences Xuehua Shen @CS, UIUC

  8. Challenges • How to capture and store useful information? • SearchPad[WWW2001]: • Server-proxy-client architecture • User explicitly mark relevant pages • Any shortcomings? Better ways? Xuehua Shen @CS, UIUC

  9. Challenges cont. • Many retrieval models, also many user models, But how to merge them? • language model is used to represent context by Croft Xuehua Shen @CS, UIUC

  10. Challenges • How to build such system, such as architecture Server side, client side? User Interface? • Server side: scalability, privacy • Client side: communication of context info with server Xuehua Shen @CS, UIUC

  11. Challenges • How to evaluate such work? Metrics? • HARD (Hard Accuracy Relevance from Document) Track added this year leverage additional information about searcher and/or search context Xuehua Shen @CS, UIUC

  12. Intellizap – General Description • Assumption: a large fraction of searches originate while users are reading documents on their computers. • Standpoint: Context is a body of words of surrounding a user-selected phrase • Intellizap System: Meta Search Engine with context-based query augmentation, search engine selection and reranking Xuehua Shen @CS, UIUC

  13. Walkthrough of IntelliZap Xuehua Shen @CS, UIUC

  14. Walkthrough cont. Xuehua Shen @CS, UIUC

  15. Walkthrough cont. Xuehua Shen @CS, UIUC

  16. Walkthrough cont. Xuehua Shen @CS, UIUC

  17. Walkthrough cont. Xuehua Shen @CS, UIUC

  18. How to use Context • augment query before sending queries to search engines • rerank the results returned by search engines Xuehua Shen @CS, UIUC

  19. How to collect right amount of context • Don’t include all document as Watson System • Heuristics 1 establishing optimal context length as a function of the length of text phrase and individual frequencies • Heuristics 2 relative weighting of the text and context in augmented query emphasize marked text phrase weight of context word: monotonic function of their proximity to text Xuehua Shen @CS, UIUC

  20. Algorithm Overview Xuehua Shen @CS, UIUC

  21. Step 0: Semantic Network • Build Semantic Network (offline): statistics-based semantic network • Linear combination of vector-based correlation metric and WordNet-based metric Xuehua Shen @CS, UIUC

  22. Semantic Network cont. • Vector-based correlation metric: 27 knowledge domains (computer, business etc.) 10,000 documents samples on Internet each word: a 27-dimension vector use correlation to measure distance • WordNet: capture semantic relations between words (hypernymy, hyponymy, meronymy and holonymy). WordNet:http://www.cogsci.princeton.edu/~wn/ Xuehua Shen @CS, UIUC

  23. Step 1: Query Augmentation • Extract keywords from context surrounding the user-selected text utilizing semantic network typically context – about 50 words • use clustering algorithm to construct several queries of different topics Xuehua Shen @CS, UIUC

  24. Step 2: Search Engine Selection • IntelliZap is a Meta Search Engine • Several general search engines ( such as Google, Altavista) • For several domains, specific search engines( such as WebMD, FindLaw) is assigned to as a priori. Xuehua Shen @CS, UIUC

  25. Step 3: Results Reranking • There are several lists of results returned by several search engines. • Use semantic network to calculate distance between results titles/summaries and text/context Xuehua Shen @CS, UIUC

  26. Evaluation Method • State-of-the-art: lack the benchmark • Use subjects recruited by external agency • Subjects don’t know objective of the experiments, just asked to do search and evaluate results Xuehua Shen @CS, UIUC

  27. Experiment Result Xuehua Shen @CS, UIUC

  28. Experiment Results cont. Xuehua Shen @CS, UIUC

  29. Concerns • Privacy and security Million users info database of My Yahoo! Monitor users through queries they sent! • Relevance consistency Communication Problem Xuehua Shen @CS, UIUC

  30. End • Thank you! Xuehua Shen @CS, UIUC

  31. Backup Slides Xuehua Shen @CS, UIUC

  32. Web Statistics • Accessibility of Information on the Web Steve Lawrence, Nature 1999 Xuehua Shen @CS, UIUC

  33. Semantic Relation • Hypernymy: the semantic relation of being superordinate or belonging to a higher rank or class Synonym: superordination • Hyponymy: the semantic relation of being subordinate or belonging to a lower rank or class Synonym: subordination • Meronymy: the semantic relation that holds between a part and the whole Synonym: part to whole relation • Holonymy: the semantic relation that holds between a whole and its partsSynonym: whole to part relation • More at http://dictionary.metor.com/wnet/ Xuehua Shen @CS, UIUC

  34. Clustering algorithm • Traditional clustering algorithm doesn’t work due to a large amount of noise and a small amount of information available 50 context words represented in 27 D space • Special clustering algorithm-High Dimensional clustering perform Recurrent Clustering analysis (averaging over iterations) refine results statistically Xuehua Shen @CS, UIUC

  35. Limitation of Web • Freshness • Coverage( only publicly indexable web) • Bias (not index sites equally) Xuehua Shen @CS, UIUC

  36. Several Systems--1 • Inquirus2: meta search engine • Watson Project (Jay Budzik,NWU): contents of full documents being edited in MS Word or Viewed in Explorer • Remembrance Agent (Bradley Rhodes,MIT): software agent just-in-time information retrieval Xuehua Shen @CS, UIUC

  37. Several System--2 • Outride (renamed in 2001) GroupFire (spin off from PARC Xerox) in 2000 Xuehua Shen @CS, UIUC

  38. Reference • [1] Graphic,Visualization and Usability Center GVU’s 10th WWW User Survey,1998 Xuehua Shen @CS, UIUC

More Related