1 / 15

Inverted Index

Inverted Index. Allows quick lookup of document ids with a particular word. Posting list. lexicon/dictionary DIC. PL(Stanford). Stanford. PL(UCLA). UCLA. MIT. PL(MIT). …. PageRank. A page is important if it is pointed by many important pages

dbaptiste
Télécharger la présentation

Inverted Index

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inverted Index Allows quick lookup of document ids with a particular word Posting list lexicon/dictionary DIC PL(Stanford) Stanford PL(UCLA) UCLA MIT PL(MIT) …

  2. PageRank A page is important if it is pointed by many important pages PR(p) = PR(p1)/c1 + … + PR(pk)/ckpi : page pointing to p, ci : number of links in pi PageRank of p is the sum of PageRanks of its parents One equation for every page N equations, N unknown variables Junghoo "John" Cho (UCLA Computer Science) 2

  3. Example: Web of 1842 Ne MS Am • Netscape, Microsoft and Amazon PR(n) = PR(n)/2 + PR(a)/2 PR(m) = PR(a)/2 PR(a) = PR(n)/2+PR(m) Junghoo "John" Cho (UCLA Computer Science) 3

  4. PageRank: Matrix Notation Web graph matrix M = { mij } Each page i corresponds to row i and column i of the matrix M mij = 1/c if page i is one of the c children of page jmij = 0 otherwise PageRank vector PageRank equation Junghoo "John" Cho (UCLA Computer Science) 4

  5. PageRank: Iterative Computation Initially every page has a unit of importance At each round, each page shares its importance among its children and receives new importance from its parents Eventually the importance of each page reaches a limit Stochastic matrix Junghoo "John" Cho (UCLA Computer Science) 5

  6. Example: Web of 1842 Ne MS Am Junghoo "John" Cho (UCLA Computer Science) 6

  7. PageRank: Random Surfer Model The probability of a Web surfer to reach a page after many clicks, following random links Random Click Junghoo "John" Cho (UCLA Computer Science) 7

  8. Problems on the Real Web Dead end A page with no links to send importance All importance “leak out of” the Web Crawler trap A group of one or more pages that have no links out of the group Accumulate all the importance of the Web Junghoo "John" Cho (UCLA Computer Science) 8

  9. Example: Dead End No link from Microsoft Dead end Ne MS Am Junghoo "John" Cho (UCLA Computer Science) 9

  10. Example: Dead End Ne MS Am Junghoo "John" Cho (UCLA Computer Science) 10

  11. Solution to Dead End Assume a surfer to jumps to a random page at a dead end Ne MS Am Junghoo "John" Cho (UCLA Computer Science) 11

  12. Example: Crawler Trap Only self-link at Microsoft Crawler trap Ne MS Am Junghoo "John" Cho (UCLA Computer Science) 12

  13. Example: Crawler Trap Ne MS Am Junghoo "John" Cho (UCLA Computer Science) 13

  14. Crawler Trap: Damping Factor “Tax” each page some fraction of its importance and distribute it equally Probability to jump to a random page Assuming 20% tax Junghoo "John" Cho (UCLA Computer Science) 14

  15. Algorithm KMP while (m + i) < |D| do: if W[i] = D[m + i], let i = i + 1 if i = |W|, return m otherwise, let m = m + i - T[i], if i > 0, let i = T[i] return no-match

More Related