1 / 33

Authoritative Sources in a Hyperlinked environment

Authoritative Sources in a Hyperlinked environment . Presented by, Lokesh Chikkakempanna. Agenda. Introduction. Central Issue. Queries. Constructing a focused subgraph . Computing hubs and authorities. Extracting authorities and hubs. Similar page queries. conclusion. Introduction.

emilia
Télécharger la présentation

Authoritative Sources in a Hyperlinked environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authoritative Sources in a Hyperlinked environment Presented by, Lokesh Chikkakempanna

  2. Agenda • Introduction. • Central Issue. • Queries. • Constructing a focused subgraph. • Computing hubs and authorities. • Extracting authorities and hubs. • Similar page queries. • conclusion

  3. Introduction • Process of discovering pages that are relevant to a particular query. • A hyperlinked environment can be a rich source of information. • Analyzing the link structure of WWW environment. • The WWW is a hypertext corpus of enormous complexity, and it continues to expand at very fast rate. • High level structure can only emerge through the complete analysis of the WWW environment.

  4. Central Issue • Distillation of broad search topics through the discovery of “Authoritative” information sources. • Link analysis for discovering “authoritative pages” • Improving the quality of search methods on WWW is a rich and interesting problem, because it should be both algoritmic and storage efficient. • What does a typical search tool computes in the extra time it takes to produce results that are of greater value to the user? • There is no objective function that is concretely defined and correspond to human notions of quality..

  5. Queries • Types of queries • -specific queries: lead to scarcity problem. • -Broad topic queries: Abundance problem. • Filter and provide from a huge set of relevant pages, A small set of the most “authoritative” or “definitive” ones.

  6. Problems in identifying authorities • Example: “harvard” • There are over million pages on web that use the term “harvard”. • Remember “TF”- Term frequency. • How do we circumvent this problem?

  7. Link analysis • Human judgement is needed to formulate the notion of authority. • If a person includes a link for page q in page p, He has conferred authority on q in some measure. • What are the problems in this?

  8. Links may be created for various reasons. • Example: • for navigational purposes. • Paid advertisements. • A hacker may create a bot that keeps on adding links to all the pages. • Solution?

  9. Link-based model for the Conferral of Authority • Identifies relevant authoritative www pages for broad search topics. • Based on the relationship between authorities and hubs. • Exploit the equilibrium between authorities and hubs to develop an algorithm that identifies both type of pages simultaneously.

  10. Algorithm operates on focused subgraph produced by text based search engines. • Produces small collection of pages likely to contain the most authoritative pages for a given topic. • Example: Alta Vista

  11. Constructing a focused subgraph of www • We can view any collection V of hyperlinked pages as a directed graph G=(V,E) • The nodes correspond to the pages. • Edge(p,q) indicates the presence of a link from p to q. • Construct a subgraph on www on which the algorithm operates.

  12. The Goal is to focus the computational effort on relevant pages. • (i) S(sigma) is relatively small. • (ii)S(sigma) is rich in relevant pages. • (iii) S(sigma) contains most (or many) of the strongest authorities. • How to find such a collection of pages?

  13. “t” highest ranked pages for the query (sigma) from a text based search engine. • These “t” pages are refered as root set R(sigma) • The root set satisfies both conditions (i) and (ii) • It is far from satisfying (iii) . Why?

  14. There are often extremely few links between pages in R(sigma), Rendering it essentially structureless. • Eample: root set for the query “java” contained 15 links between pages in different domains. • Total number of possible links 200*199. (t=200)

  15. We can use the root set R(sigma) to produce s(sigma) that satisfies all the conditions. • A strong authority may not be in the set R(sigma), but it is likely to be pointed to by atleast one page in R(sigma). • Subgraph(sigma,€,t,d) • Sigma: query string,€-a text based search engine,t and d are natural numbers.

  16. S(sigma) is obtained by growing R(sigma) to include any page pointed to by a page in R(sigma) and any page that points to a page in R(sigma). • A single page in R(sigma) brings atmost d pages into S(sigma). • Does this S(sigma) contains authorities?

  17. Heuristics to reduce S(sigma) • Two types of links: • Transverse: between pages with different domain names. • Intrinsic: between pages with the same domain name. • Remove all the intrinsic links to get a graph G(sigma)

  18. A large number of pages from a single domain point to a page p. • This is because of advertisements. • Allow only m≈4-8 pages from a single domain to point to any given page p. • G(sigma) now contains many relevant pages and strong authorities.

  19. Computing hubs and authorities • Extracting authorities based on maximum indegree does not work. • Example: For the query “java” the largest indegree pages consisted of www.gamelan.com and java.sun.com, together with advertising pages and home page of amazon. • While the first two are good answers, others are not relevant.

  20. Authoritative pages relevant to the initial query should not only have large in-degree; • Since they are all authorities , there should be considerable overlap in the sets of pages that point to them. • Thus in addition to authorities we should find what are called as hub pages. • Hub pages: That have links to multiple relevant authoritative pages.

  21. Hub pages allow us to throw away unrelated pages with high indegree. • Mutually reinforcing relationship: A good hub is page that points to many good authorities; a good authority is a page that is pointed to by many good hubs. • We should break this circularity to identify hubs and authorities. • How?

  22. An Iterative algorithm • Maintains and updates numerical weights for each page. • Each page is assosciated with a non-negative authority weight x^p and non-negative hub weight y^p. • Each type is normalized so their squares sum to 1. • Pages with larger x and y values are considered better authorities and hubs respe∞ctively. • Two operations for weights.

  23. The second operation updates the hub weights as follows:

  24. The set of weights is represented as a vector with a co-ordinate for each page in G(sigma). The set of weights is represented as a vector y.

  25. Iterate(G,k) G: a collection of n linked pages k: a natural number Let z denote the vector (1, 1, 1, ..., 1)εRn. Set x0 :=z. Set y0 :=z. For i=1, 2, . . . , k Apply the ϑ operation to (xi-1, yi-1), obtaining new x-weightsxi’. Apply the Θ operation to (xi, yi-1), obtaining new y-weights yi’. Normalize xi’, obtaining xi. Normalize yi’, obtaining yi. End Return (xk, yk).

  26. Filter out top c authorities and top c hubs Filter(G,k,c) G: a collection of n linked pages k,c: natural numbers (xk, yk) :=iterate(G, k). Report the pages with the c largest coordinates in xk as authorities. Report the pages with the c largest coordinates in yk as hubs.

  27. The is applied with G set equal to G(sigma) and c ≈ 5-10 • With arbitrarily large values of k, the sequences of vectors {xk} and {yk} converge to fixed points x* and y *. • What is R^n in the ITERATE algorithm?. • Eigenspaceassosciated with λ. • λ is the eigen value of an n x n matrix M, with the property that Mω=λω. For some vector ω

  28. Similar-Page Queries • The algorithm discussed can be applied to another type of problem. • Using the link structure to infer the notion of “similarity” among pages. • We begin with the page p and pose the request “Find t pages pointing to p

  29. Conclusion • The approach developed here might be integrated into a study of traffic patterns on www. • Future work can be done to include other than broad topic queries. • It would be interesting to understand eigenvector based heuristics completely in the context of algorithms presented here.

  30. Thank You!

More Related