1 / 19

Ch 14. Link Analysis

Padmini Srinivasan Computer Science Department http:// cs.uiowa.edu / ~ psriniva padmini-srinivasan@uiowa.edu. Ch 14. Link Analysis. Web Search. Hard problem Hats off to ‘information retrieval’ Complex information needs Keywords Synonyms, polysemy (multiple meanings)

belden
Télécharger la présentation

Ch 14. Link Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Padmini Srinivasan Computer Science Department http://cs.uiowa.edu/~psriniva padmini-srinivasan@uiowa.edu Ch 14. Link Analysis

  2. Web Search • Hard problem • Hats off to ‘information retrieval’ • Complex information needs • Keywords • Synonyms, polysemy (multiple meanings) • True homonyms: row (oar) row (argue); delta (greek and of a river) • Polysemous homonyms: mouth (of a river), mouth (of an animal); right ‘hand’ person, ‘hand’ it to me • The age of intermediaries (BRS After Dark) • Diversity in writing + Diversity in queries + Diversity in Indexing + Diversity in motivations • Controlled vocabularies vs free-texts • Majority rule? ‘Cornell’

  3. Web Search Peculiarities • Compared to the good old days • Needle in a haystack problem; many needles in many haystacks! Which ones to look for? • How distinct is this from the “traditional” methods for IR? Libraries etc. • Can we do without libraries? • Quality – a serious question? • Does redundancy promote quality? • Does collaboration promote quality? • Scale • Retrieve and FILTER/ORGANIZE • Satisfying versus satisficing

  4. Link Analysis • In-links and out-links; in-degree and out-degree • A matter of endorsement! (directional) • Akin to citations • What are differences? Must one out-link? • Power laws all the way through!

  5. Some studies • (Kumar et. al. 99): Alexaweb crawl from 1997 over 40 million nodes. Trawling the Web for cyber communities, Proc. 8th WWW , Apr 1999 • Probability page has in-degree k = 1/k2 • Probability page has at least in-degree k = 1/k • Actual exponent slightly larger than 2. • Barabasi and Albert 1999 – studied the U. Notre Dame web site with some extensions

  6. Broder et al. Graph Structure of the Web Note that the exponent is different. Note also the deviation In the low end of the out-degree.

  7. Fractals? • Broder et al “almost fractal like quality for the power law in-degree and out-degree distributions, in that it appears both as a macroscopic phenomenon on the entire web, as a microscopic phenomenon at the level of a single university website, and at intermediate levels between these two.” • Graph structure in the web

  8. Similar Studies • Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are • In-degree: power law; exponent 2.1 (Fig. 4) • Out-degree: not so good (Fig. 5) • Check out Fig. 8: SCC distribution (number of SCCs versus Size of SCC). Power law; exponent 2.09 • Webbase, 200 Million Stanford crawl (2001) • 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million) next SCC: 10 thousand!

  9. Hubs & Authorities • In-links: votes • HITS algorithm: Hyperlink induced topic search. • A good hub is one that points to good authorities [lists; directories] • A good authority is one that is pointed to by good hubs • A good hub need not be an authority and vice versa. • Those who have knowledge; those who know well about those who have knowledge • Dynamic estimation; repeated application of update rules. Converges!

  10. Algorithm • First conduct retrieval. Compute Hubs and Authorities on relevant set • Rank the retrieved set by a list of hubs and a list of authorities • Initialize hub and authority scores (say to all 1, or some other positive number) • Apply authority score update rule • Apply hub score update rule • Example: fig 14.15 and 14.18 (problem 3)

  11. Its all about convergence • First show how the update rule works with matrices M and MT • Then show the same using eigenvectors • Then show that the initialization of hub scores really does not matter. As long as it is a positive vector, i.e., all hub scores are initialized to a positive number

  12. PageRank • Endorsements repeatedly move through out-links. A  B • Principle of repeated improvement: • Weight of ‘current’ endorsement depends on ‘current’ estimate of A’s PageRank. • More important nodes convey higher endorsements. • Stabilize ~ till the network changes

  13. Calculation • Initialize: each node has a PageRank = 1/n where n is the number of nodes • Basic PageRank Update Rule: • A node divides its PageRank equally over its out-links. If no out-links, it keeps its PageRank. • The PageRank of a node = sum of PageRanks it receives in that iteration. • Total PageRank stays constant, so no need for normalizing. • Iterate till convergence OR a number of iterations.

  14. Equilibrium • No further changes in PageRanks • Degenerate cases exist (Scaled PageRank Updates) • Values need not be unique except where the network is strongly connected.

  15. Slow leaks?

  16. Scaled PageRank Update Rule • Scaling factor: (between 0 and 1) generally (0.8 and 0.9) • Apply basic PageRank update rule. For each page: • Scale down all by some value s (say 0.9), so each gets 0.9 * PageRank.. • Total PageRank = s • Divide remaining PageRank (1-s) equitably over all nodes. • Get a unique set of values for each setting of s. [shown later in proofs] • Random walk model [Browsing not Searching]: probability of reaching a page is equal to prob(coming across an in-link) + prob(getting there at random)

  17. Summary • Link based analysis • Power laws: in-links, out-links etc. • Hubs and Authorities • convergence • PageRank • convergence

More Related