Information Retrieval

Information Retrieval Link-based Ranking

Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the highest number of incoming links obtained 70% of the new links after 7 months, while the bottom 60% of the pages obtained virtually no new incoming links during that period…” [ Cho et al., 04 ]

Query processing • First retrieve all pages meeting the text query • Order these by their link popularity +… other scores: TF-IDF, Title, Anchor,

Query-independent ordering • First generation: using link counts as simple measures of popularity. • Two basic suggestions: • Undirected popularity: • Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). • Directed popularity: • Score of a page = number of its in-links (es. 3). SPAM!!!!

Second generation: PageRank • Each link has its own importance!! • PageRank is • independent of the query • many interpretations…

Basic Intuition… What about nodes with no out-links?

Google’s Pagerank Random jump B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}.

Pagerank: use in Search Engines • Preprocessing: • Given graph of links, build matrix P • Compute its principal eigenvector • The i-th entry is the pagerank of page i • We are interested in the relative order • Query processing: • Retrieve pages meeting query • Rank them by their pagerank • Order is query-independent

Three different interpretations • Graph (intuitive interpretation) • What is the random jump? • What happens to the “real” web graph? • Matrix (easy for computation) • Turns out to be an eigenvector computation • Or a linear system solution • Markov Chain (useful to show convergence) • a sort of Usage Simulation

1/3 1/3 1/3 Pagerank as Usage Simulation • Imagine a user doing a random walk on web: • Start at a random page • At each step, go out of the current page along one of the links on that page, equiprobably • “In the steady state” each page has a long-term visit rate - use this as the page’s score.

Not quite enough • The web is full of dead-ends. • Random walk can get stuck in dead-ends. • Cannot talk about long-term visit rates.

Teleporting • At each step, • with probability 10%, jump to a random page. • With remaining probability (90%), go out on a out-link taken at random. Any node 0.1 0.9 Neighbors

Result of teleporting • Now we CANNOT get stuck locally. • There is a long-term rate at which any page is visited • Not obvious  Markov chains

i j Pij Markov chains • A Markov chain consists of n states, plus an nntransition probability matrixP. • At each step, we are in exactly one of the states. • For 1  i,j  n, the matrix entry Pij tells us the probability of going from state i to state j. Pii>0 is OK.

Not ergodic Not ergodic Markov chains • Clearly, for all i, • Markov chains abstracts random walks • A Markov chain is ergodic if • you have a path from any state to any other • you can be in any state at every time step, with non-zero probability.

Ergodic Markov chains • For any ergodic Markov chain, there is a unique long-term visit rate for each state. • Steady-state distribution. • Over a long time-period, we visit each state in proportion to this rate. • It doesn’t matter where we start.

Computing steady-state probabilities ? • Let x= (x1, …, xn) denote the row vector of steady-state probabilities. • If our current position is described by x, then the next step is distributed as xP. • But x is the steady state, so x=xP. • Solving this matrix equation gives us x. • So x is the (left) eigenvector for P. • (Corresponds to the “principal” eigenvector of P with the largest eigenvalue.)

Computing x by Power Method • Recall, regardless of where we start, we eventually reach the steady state a. • Start with any distribution (say x=(10…0)). • After one step, we’re at xP; • after two steps at xP2 , then xP3 and so on. • “Eventually” means for “large” k, a = xPk • Algorithm:multiply x by increasing powers of P until the product looks stable.

Why we need a fast ranking? “…The link structure of the Web is significantly more dynamic than the contents on the Web. Every week, about 25% new links are created. After a year, about 80% of the links on the Web are replaced with new ones. This result indicates that search engines need to update link-based ranking metrics very often…” [ Cho et al., 04]

Accelerating PageRank • Web Graph Compression to fit in internal memory • Efficient external-memory implementation • Mathematical approaches • Combination of the above strategies

a b Personalized PageRank

Influencing PageRank (“Personalization”) • Input: • Web graph W • influence vector v v : page  degree of influence • Output: • Rank vector r: page  importance wrt v • r= PR(W , v)

Personalized PageRank • α is the dumping factor • v is a personalization vector

HITS: Hubs and Authorities • A good hub page for a topic points to many authoritative pages for that topic. • A good authority page for a topic is pointed to by many good hubs for that topic. • Circular definition - will turn this into an iterative computation.

HITS: Hypertext Induced Topic Search • It is query-dependent • Produces two scores per page: • Authority score • Hub score

Base Set Root Set Hubs & Authorities Calculation Query

Assembling the base set • Root set typically 200-1000 nodes. • Base set may have up to 5000 nodes. • How do you find the base set nodes? • Follow out-links by parsing root set pages. • Get in-links (and out-links) from a connectivity server. (Actually, suffices to text-index strings of the form href=“URL” to get in-links to URL.)

Authority and Hub scores 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)

Iterative update • Repeat the following updates, for all x: x x

HITS: Link Analysis Computation Where a: Vector of Authorities’ scores h: Vector of Hubs’ scores. A: Adjacency matrix in which ai,j = 1 if i points to j. scaling Thus, h is an eigenvector of AAt a is an eigenvector of AtA SVD of A

Math Behind the Algorithm • Theorem (Kleinberg, 1998).The vectors a(p) and h(p) converge to the principal eigenvectors of ATA and AAT, where A is the adjacency matrix of the (directed) Web subgraph.

How many iterations? • Claim: relative values of scores will converge after a few iterations: • We only require the relative orders of the h() and a() scores - not their absolute values. • In practice ~5 iterations get you close to stability. • Scores are sensitive to small variations of the graph. Bad !!

HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights

Weighting links • In hub/authority link analysis, can match text to query, then weight link.

Weighting links • Should increase with the number of query terms in anchor text. • E.g.: 1+ number of query terms. Armonk, NY-based computer giant IBM announced today x y www.ibm.com Weight of this link for query computer is 2.

Practical HITS • Used by ASK.com • How to compute it efficiently??????

Pagerank Pros Hard to spam Quality signal for all pages Cons Non-trivial to compute Not query specific Doesn’t work on small graphs Proven to be effective for general purpose ranking HITS & Variants Pros Query specific Works on small graphs Cons Local graph structure can be manufactured (spam!) Real-time computation is hard. Well suited for supervised directory construction Comparison

SALSA • Proposed by Lempel and Moran 2001. • Probabilistic extension of the HITS algorithm • Random walk is carried out by following hyperlinks both in the forward and in the backward direction • Two separate random walks • Hub walk • Authority walk

Forming a Bipartite Graph in SALSA

SALSA: Random Walks • Hub walk • Follow a forward-link from an hub uh to an authority wa • Traverse a back-link from wa to some page vh’ • Authority Walk • Follow a back-link from an authority wa to an hub uh • Traverse a forward-link from an hub uh to an authority za Lempel and Moran (2001) proved that SALSA weights are more robust that HITS weights in some cases. Authority-score converges to Indegree, Hub-score converges to outdegree

Information Retrieval Spamming

Spamming PageRank Spam Farm (SF), rules of thumb • Use all available own pages in the SF • Accumulate the maximum number of inlinks to SF • NO links pointing outside the SF • Avoid dangling nodes within the SF

Spamming HITS • Easy to spam • Create a new page ppointing to many authority pages (e.g., Yahoo, Google, etc.)  p becomes a good hub page … On p, add a link to your home page

Many applications Ranking Web News Financial Institutions Peers Human Trustness Viruses Social networks You propose!

Information Retrieval Other tools

snippet

Joint with A. Gullì WWW 2005, Tokio (JP)

Information Retrieval