260 likes | 377 Vues
| 1. 26-01-2012. Zoekmachines. Gertjan van Noord 2013. Lecture 6: PageRank. Page B. Page A. hyperlink. Anchor. The web graph. Links as sources of authenticity and authority. The Good , The Bad and The Unknown Good nodes won’t point to Bad nodes
E N D
| 1 26-01-2012 Zoekmachines • Gertjan van Noord 2013 Lecture 6: PageRank
Page B Page A hyperlink Anchor The web graph
Links as sources of authenticity and authority The Good, The Bad and The Unknown Good nodes won’t point to Bad nodes All other combinations plausible ? Good ? ? Bad ? 3
Links as sources of authenticity and authority Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good ? Good ? ? Bad ? 4
Links as sources of authenticity and authority Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good Good ? Bad ? 5
Links as sources of authenticity and authority Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good Good ? Bad 6
Origins of PageRank: Citation analysis (1) • Citation analysis: analysis of citations in the scientific literature. • Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” • We can view “Miller (2001)” as a hyperlink linking two scientific articles. • Citation frequency can be used to measure the impact of an article . • Simplest measure: Each article gets one vote – not very accurate. • On the web: citation frequency = inlink count • A high inlink count does not necessarily mean high quality ... ... mainly because of link spam. 7
Origins of PageRank: Citation analysis (2) • Better measure: weighted citation frequency or citation rank • An article’s vote is weighted according to its citation impact. • This is basically PageRank. • PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s. 8
Sec. 21.2 Pagerank scoring • Imagine a browser doing a random walk on web pages: • Start at a random page • At each step, go out of the current page along one of the links on that page, equiprobably • “In the steady state” each page has a long-term visit rate - this is the page’s score. 1/3 1/3 1/3
Sec. 21.2 Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. ??
Sec. 21.2 Teleporting • At a dead end, jump to a random web page. • At any non-dead end, with probability 10%, jump to a random web page. • With remaining probability (90%), go out on a random link. • 10% - a parameter (adjustable)
Sec. 21.2 Result of teleporting • Now cannot get stuck locally. • There is a long-term rate at which any page is visited (our page rank) • How do we compute this visit rate?
Sec. 21.2.1 Markov chains • A Markov chain consists of n states, plus an nntransition probability matrixP. • For 1 i,j n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i. i j Pij
Sec. 21.2.1 Probability vectors • A probability (row) vector x= (x1, … xn) tells us where the walk is at any point. • E.g., (000…1…000) means we’re in state i. 1 i n • More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi.
Sec. 21.2.1 Change in probability vector If the probability vector is x= (x1, … xn) at this step, what is it at the next step? Recall that row i of the transition prob. matrix P tells us where we go next from state i. So from x, our next state is distributed as xP The one after that is xP2, then xP3, etc. (Where) Does this converge?
1 2 3 An example
1 2 3 An example Rows represent state you come from Columns represent state you go to
1 2 3 An example
Chance going from 1 to 1 Chance going from 2 to 1 (1/6 x 1/6) + (2/3 x 5/12) + (1/6 x 1/6) Chance going from 3 to 1
Sec. 19.2.2 Simplest forms of ranking • First generation engines relied heavily on tf/idf • The top-ranked pages for the ‘query maui resort’ were the ones containing the most ‘maui’s and ‘resort’s • SEOs responded with dense repetitions of chosen terms • e.g., “maui resort maui resort maui resort ” • Often, the repetitions would be in the same color as the background of the web page • Repeated terms got indexed by crawlers • But not visible to humans on browsers • Variant: repeated/misleading meta tags
Sec. 19.2.2 SPAM N Is this a Search Engine spider? Real Doc Y Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Cloaking 24
Sec. 19.2.2 More spam techniques • Doorway pages (pages optimized for a single keyword that re-direct to the real target page) • Link spamming • mutual admiration societies, hidden links, awards • domain flooding (numerous domains that point or re-direct to a target page)
More on spam • Web search engines have policies on SEO practices they tolerate/block • http://help.yahoo.com/help/us/ysearch/index.html • http://www.google.com/intl/en/webmasters/ • Adversarial IR: the unending (technical) battle between SEO’s and web search engines • Research http://airweb.cse.lehigh.edu/ 26