1 / 28

Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked Environment. By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June 30 2011. Ranking for searching results.

koto
Télécharger la présentation

Authoritative Sources in a Hyperlinked Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June 30 2011

  2. Ranking for searching results • Modern search engines may return millions of pages for a single query. This amount is prohibitive to preview for human users, hence need a method to filter a small set of most authoritative results. • An ranking method will help to process the query results and put the most useful information on the top of the list. • Link based methods focus on the way that pages reference on another, provided an efficient way to filter the authoritative results. • Queries: • Specific queries. E.g. “What does Dr. Chris Mattmann’s think of the presentations between 3:30-5:00 PM PDT, June 30 2011. ” – very few pages, difficult to determine the identity of these pages. • Broad-topic queries. E.g. “java” – Too many pages, difficult to find the authority pages for traditional text-based search engine. • Similar-page queries. E.g. “find page similar to java” – similar as broad-topic queries.

  3. Related to Class material • HITS stands for Hypertext Induced Topic Search • HITS was a pioneered link based ranking. One of the major web ranking model mentioned in the class. • This presentation will goes into the details of how to calculate “authority” and “hub” pages, which is mentioned in the class. • We will compare with the other link based algorithm: PageRank • We will evaluate the pros and cons of the paper.

  4. Outline • Link-based algorithms • HITS algorithm • Constructing a Focused Subgraph of the WWW • Computing Hubs and Authorities • Comparison with PageRank • Expansions • Similar-Page Queries (modification) • Social Network/Scientific Citation • Multiple Set of Hubs and Authorities • Diffusion and generalization • Evaluation • Pros and Cons of the paper

  5. Link based ranking algorithm • Challenge of the text-based ranking • www.harvard.edu, most authoritative pages for query “harvard”. However, other pages may content “harvard” keyword more often. • Pages are not sufficiently self descriptive: e.g. query “search engine”. Google do not use the term on their pages. • Number of pages too large to preview.

  6. Link based ranking algorithm • Links encoded some human latent judgment • Creating a page p by including a link to page q has in some measure conferred authority on q. No need self-descriptive. • Balance of relevance and popularity in the authority criteria (automobile  VW, Benz, BMW webpage, also www.yahoo.com, large number of in-degree, lack thematic unity.)

  7. Link based ranking algorithm • Authority: A authority is a page with many in-links. • The page may have good or authoritative content on some topic and many people trust it and link to it. • Hub: A hub is a page with many out-links. • The page serves as an organizer of the information on a particular topic and points to many good authority pages on the topic.

  8. Link based ranking algorithm • PageRank (Brin & Page 1998): • Computed for all the webpages before query (Query independent). • Compute the authority only • Fast to compute • HITS • Performed on the set of retrieved webpages for each query (Query dependent) • Compute authority and hubs • More calculation needed, slow in real time query

  9. HITS Algorithm Requirement: Sq (collection of pages wrt query q) is small Sq is rich in relevant pages Sq contains most of the strongest authorities Subgraph(q,E,t,d) q: a query string E: a text-based searching engine /*Narrow down: form AltaVista*/ Let Rq denote the top t results of E on q. Set Sq := Rq For each page p in Rq : /*Expanding*/ Add all pages that p points to into the Sq; Add all pages point to p to Sq. (If the number of these pages is greater than d, randomly select d pages and add to Sq.) /* Limit: a single pointed pages can bring in maximum d pages. Otherwise, can involve hundred thousands extra pages */ /*remove intrinsic links (for website navigation), and anti-collusion (allow up to m pages from a single domain to point to any given page)*/ Return Sq • Step1: Constructing a Focused Subgraph of the WWW.

  10. HITS Algorithm • Step 2: Computing Hubs and Authorities Rules: A good hub points to many good authorities. A good authority is pointed to by many good hubs. Authorities and hubs have a mutual reinforcement relationship. Let authority score of the page i be x(i), and the hub score of pagei be y(i). mutual reinforcing relationship: I step: O step:

  11. HITS Algorithm 5 2 3 1 1 6 4 7 y(1) = x(5) + x(6) + x(7) x(1) = y(2) + y(3) + y(4)

  12. HITS Algorithm

  13. HITS Algorithm •  Recap: • If A is a square matrix, a non-zero vector v is an eigenvector of A if there is a scalar λ such that Av = λv

  14. HITS Algorithm

  15. HITS Algorithm

  16. HITS Algorithm • The Iterate(G,k) procedure can be applied to filter out the top c authorities and top c hubs.

  17. HITS Results • www.roadahead.com rank 123rd by AltaVista. • Text-based search ignore the authorities. • Text-based search + link analysis works. Do not content many of the query string “Gates”.

  18. Related work • Similar page queries: • find t pages containing the string q • find t pages pointing to p. • Honda  ford, toyota, etc. • Social Network • Measure of standing by path counting(Katz): • Scientific Citations • Multiple set of Hubs and Authorities • Same query string corresponding to different meaning.

  19. Multiple set of Hubs and Authorities

  20. Highlights of the method • Developed a set of algorithmic tools for extracting information from the link structures environments. • Formulate the notion of authority based on relationship between a set of “authority” pages and “hub” pages. • Proposed a heuristic algorithm to find these pages. • Surveyed variants and applications

  21. Evaluation: HITS vsPageRank • EigenGaps • Difference between the largest and 2nd largest eigenvalue of M matrix. • Work from Ng 2001, compared the stability of convergence. Idea: The Cora database is a collection containing the citation (similar to link) information from several thousand papers in AI. Article is truly authoritative or influential, then surely the addition of a few links or a few citations should not make us change our minds about these sites or articles having been very influential. Based on this idea, Ng et. al. constructed a set of five perturbed databases in which 30% of the papers from the base set were randomly deleted

  22. Evaluation: HITS vsPageRank • HITS • PageRank

  23. Evaluation: HITS vsPageRank • The eigenvalues of the matrices are indicated by the directions of the principal axes of the ellipses. • Small perturbation cause 45 degree change when eigengap small. No change when eigengap large.

  24. Evaluation: Pros • Creative idea of formulating the authority concept into “Authority” and “Hub”, especially in 1998 • Efficient heuristic algorithm so solve the Authority weights and Hub weights. • Query-driven dynamic ranking • Solid theoretical background • Abundant variants and applications

  25. Evaluation: Cons • The convergence is not as robust as PageRank when there are some perturbation. • Topic drift • In-efficiency at run-time. • User behavior information is not integrated.

  26. Reference • J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997. • Stable algorithms for link analysis. A. Y. Ng, A. X. Zheng, and M. I. Jordan. Proceedings of the 24th International Conference on Research and Development in Information Retrieval (SIGIR), New York, NY: ACM Press, 2001 • Wikipedia: www.wikipedia.org

  27. Questions? • Thanks for time!

More Related