1 / 22

Information Retrieval Part 2

Information Retrieval Part 2. Sissi 11/17/2008. Information Retrieval cont. Web-Based Document Search Page Rank Anchor Text Document Matching Inverted Lists. Page Rank. PR(A) : the page rank of page A. C(T): the number of outgoing links from page T.

lzellmer
Télécharger la présentation

Information Retrieval Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval Part 2 Sissi 11/17/2008

  2. Information Retrieval cont.. • Web-Based Document Search • Page Rank • Anchor Text • Document Matching • Inverted Lists

  3. Page Rank • PR(A):the page rank of page A. • C(T): the number of outgoing links from page T. • d: minimum value assigned to any page. • : a page pointing to A.

  4. Algorithm of Page Rank • Use the PageRank Equation to compute PageRank for each page in the collection using latest PageRanks of pages. • Repeat step 1 until no significant change to any PageRank.

  5. Example in the first iteration: • PR(A)=0.1+0.9*(PR(B)+PR(C)) =0.1+0.9*(1+1) =1.9 • PR(B)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95 • PR(C)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95 PR(A)=1.48, PR(B)=0.76, PR(C)=0.76 initial value: PR(A)=PR(B)=PR(C)=1 d=0.1

  6. Anchor Text • The anchor text is the visible, clickable text in a hyperlink. • For example: • <a href=“http://www.wikipedia.org”>Wikipedia</a> • The anchor text is Wikipedia; the complex URL http://www.wikipedia.org/ displays on the web page as Wikipedia, contributing to a clean, easy to read text or document.

  7. Anchor Text • Anchor text usually gives the user relevant descriptive or contextual information about the content of the link’s destination. • The anchor text may or may not be related to the actual text of the URL of the link. • The words contained in the Anchor Text can determine the ranking that the page will receive by search engines.

  8. Common Misunderstanding • Webmasters sometimes tend to misunderstand anchor text. • Instead of turning appropriate words inside of a sentence into a clickable link, webmasters frequently insert extra text.

  9. Example • today our troops have liberated another country from tyranny. To know more, click here. • The more concise way of coding that would be: today our troops have liberated another country from tyranny.

  10. Anchor Text • This proper method of linking is beneficial not only to users, but also to the webmasters as anchor text holds significant weight in search engine ranking. • Most search engine optimization experts recommend against using “click here” to designate a link.

  11. Google Bomb • In September 2000, the first Google bomb was created by Hugedisk Men’s Magazine, a now-defunct online humor magazine. • It linked the text “dumbmotherfucker” to a site selling George W. Bush-related merchandise. • A google search for this term would return the pro-Bush online store as its top result. • After a fair amount of publicity the George W. Bush-related merchandise site retained lawyers and sent a cease and desist letter to Hugedisk, thereby ending the Google bomb.

  12. Existed Google Bomb • When search “more evil than Satan”, it returns the home page of microsoft company. • “miserable failure”, or “worst president”, or ”unelectable” it returns the resume of George W. Bush in the White House website. • “out of touch executives”, or “out of touch management” it returns the home page of google. • Other commercial use

  13. Document Matching • An arbitrarily long document is the query, not just a few key words. • But the goal is still to rank and output an ordered list of relevant documents. • The most similar documents are found using the measures described earlier.

  14. Generalization of searching • Matching a document to a collection of documents looks like a tedious and expensive operation. • Even for a short query, comparison to all large documents in the collection implies a relatively intensive computation task.

  15. Example of document matching • Consider an online help desk, where a complete description of a problem is submitted. • That document could be matched to stored documents, hopefully finding descriptions of similar problems and solutions without having the user experiment with numerous key word searches.

  16. Summarize • Search engines and document matchers are not focused on classification of new documents. • Their primary goal is to retrieve the most relevant documents from a collection of stored documents.

  17. Inverted Lists • What is inverted lists? • Instead of documents pointing to words, a list of words pointing to documents is the primary internal representation for processing queries and matching documents.

  18. Inverted Lists

  19. Example • If the query contained words 100 and 200 • First processing W(100) to compute the similarity S(i) of each document i: S(1)=0+1 S(2)=0+1 … • Then process W(200) in the same way: S(2)=1+1 …

  20. Summarize • The inverted list is the key to the efficiency of information retrieval systems. • The inverted list has contributed to make nearest-neighbor methods a pragmatic possibility for prediction.

  21. Conclusion • Information retrieval methods are specialized nearest-neighbor methods, which are well-known prediction methods. • IR methods typically process unlabeled data and order and display the retrieved documents. • The IR methods have no training and induce no new rules for classification.

  22. Thank You!

More Related