1 / 34

ICS 215: Advances in Database Management System Technology Spring 2004

ICS 215: Advances in Database Management System Technology Spring 2004. Professor Chen Li Information and Computer Science University of California, Irvine. Course Web Server. URL: http://www.ics.uci.edu/~ics215/ All course info will be posted online Instructor: Chen Li

violet
Télécharger la présentation

ICS 215: Advances in Database Management System Technology Spring 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine

  2. Course Web Server • URL: http://www.ics.uci.edu/~ics215/ • All course info will be posted online • Instructor: Chen Li • ICS 424B, chenli@ics.uci.edu • Course general info: http://www.ics.uci.edu/~ics215/geninfo.html Notes 01

  3. Topic today: Web Search • How did earlier search engines work? • How does Google work? • Readings: • Lawrence and Giles, Searching the World Wide Web, Science, 1998. • Brin and Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine WWW7/Computer Networks 30(1-7): 107-117, 1998. Notes 01

  4. Earlier Search Engines • Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … • Main technique: “inverted index” • Conceptually: use a matrix to represent how many times a term appears in one page • # of columns = # of pages (huge!) • # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ 1 0 1 0 ‘toyota’ 0 2 0 1  page 2 mentions ‘toyota’ twice ‘honda’ 2 1 0 0 … Notes 01

  5. Search by Keywords • If the query has one keyword, just return all the pages that have the word • E.g., “toyota”  all pages containing “toyota”: page2, page4,… • There could be many many pages! • Solution: return those pages with most frequencies of the word first Notes 01

  6. Multi-keyword Search • For each keyword W, find all the set of pages mentioning W • Intersect all the sets of pages • Assuming an “AND” operation of those keywords • Example: • A search “toyota honda” will return all the pages that mention both “toyota” and “honda” Notes 01

  7. Observations • The “matrix” can be huge: • Now the Web has 4.2 billion pages! • There are many “terms” on the Web. Many of them are typos. • It’s not easy to do the computation efficiently: • Given a word, find all the pages… • Intersect many sets of pages… • For these reasons, search engines never store this “matrix” so naively. Notes 01

  8. Problems • Spamming: • People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times • Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time). • Search engines can be easily “fooled” Notes 01

  9. Closer look at the problems • Lacking the concept of “importance” of each page on each topic • E.g.: Our ICS215 class page is not as “important” as Yahoo’s main page. • A link from Yahoo is more important than a link from our class page • But, how to capture the importance of a page? • A guess: # of hits?  where to get that info? • # of inlinks to a page  Google’s main idea. Notes 01

  10. Google’s History • Started at Stanford DB Group as a research project (Brin and Page) • Used to be at: google.stanford.edu • Very soon many people started liking it • Incorporated in 1998: www.google.com • The “largest” search engine now • Started other businesses: froogle, gmail, … Notes 01

  11. PageRank • Intuition: • The importance of each page should be decided by what other pages “say” about this page • One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) • Problem: • We can easily fool this technique by generating many dummy pages that point to our class page Notes 01

  12. Details of PageRank • At the beginning, each page has weight 1 • In each iteration, each page propagates its current weight W to all its N forward neighbors. Each of them gets weight: W/N • Meanwhile, a page accumulates the weights from its backward neighbors • Iterate until all weights converge. Usually 6-7 times are good enough. • The final weight of each page is its importance. • NOTICE: currently Google is using many other techniques/heuristics to do search. Here we just cover some of the initial ideas. Notes 01

  13. Example: MiniWeb • (Materials used by courtesy of Jeff Ullman) • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. • Their weights are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS. Notes 01

  14. Iterative computation Final result: • Netscape and Amazon have the same importance, and twice the importance of Microsoft. • Does it capture the intuition? Yes. Ne MS Am Notes 01

  15. Observations • We cannot get absolute weights: • We can only know (and we are only interested in) those relative weights of the pages • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation: Notes 01

  16. Problem 1 of algorithm: dead ends • MS does not point to anybody • Result: weights of the Web “leak out” Ne MS Am Notes 01

  17. Problem 2 of algorithm: spider traps • MS only points to itself • Result: all weights go to MS! Ne MS Am Notes 01

  18. Google’s solution: “tax each page” • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. • Example: assume 20% tax rate in the “spider trap” example. Notes 01

  19. The War of Search Engines • More companies are realizing the importance of search engines • More competitors in the market: Microsoft, Yahoo!, etc. Notes 01

  20. Next: HITS / Web communities • Readings: • Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46(5): 604-632, 1999. • Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities, WWW 1999 Notes 01

  21. Hubs and Authorities • Motivation: find web pages to a topic • E.g.: “find all web sites about automobiles” • “Authority”: a page that offers info about a topic • E.g.: DBLP is a page about papers • E.g.: google.com, aj.com, teoma.com, lycos.com • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic • E.g.: our ICS215 page linking to pages about papers • E.g.: www.searchenginewatch.com is a hub of search engines Notes 01

  22. Two values of a page • Each page has a hub value and an authority value. • In PageRank, each page has one value: “weight” • Two vectors: • H: hub values • A: authority values Notes 01

  23. HITS algorithm: find hubs and authorities • First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” • Find pages S containing the keyword (“automobile”) • Find all pages these S pages point to, i.e., their forward neighbors. • Find all pages that point to S pages, i.e., their backward neighbors • Compute the subgraph of these pages root Focused subgraph Notes 01

  24. Step 2: computing H and A • Initially: set hub and authority to 1 • In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) • The authority value of each page is the total hub value of its backward neighbors (after normalization) • Iterate until converge authorities hubs Notes 01

  25. Example: MiniWeb Normalization! Ne Therefore: MS Am Notes 01

  26. Example: MiniWeb Ne MS Am Notes 01

  27. Trawling: finding online communities • Motivation: find groups of individuals who share a common interest, together with the Web pages most popular among them (similar to “hubs”) • Examples: • Web pages of NBA fans • Community of Turkish student organizations in the US • Fans of movie star Jack Lemmon • Applications: • Provide valuable and timely info for interested people • Represent the sociology of the web • Target advertising Notes 01

  28. How: analyzing web structure • These pages often do not reference each other • Competitions • Different view points • Main idea: “co-citations” • Often these pages share a large number of pages • Example: the following two web sites share many pages • http://kcm.co.kr/English/ • www.cyberkorean.com/church Notes 01

  29. Bipartite subgraphs C “Centers” F “Fans” • Bipartite graphs: sets of nodes, F and C • Dense bipartite graph: there are “enough” number of edges between F and C • Complete bipartite graph: there is an edge between each node in F and each node in C • (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C • (i,j)-Core is a good signature for finding online communities • Usually i and j are between 3 and 9 Notes 01

  30. “Trawling”: finding cores • Find all (i,j)-cores in the Web graph. • In particular: find “fans” (or “hubs”) in the graph • “centers” = “authorities” • Challenge: Web is huge. How to find cores efficiently? • Experiments: 200M pages, 1 TB data • Main idea: pruning • Step 1: using out-degrees • Rule: each fan must point to at least 6 different websites • Pruning results: 12% of all pages (= 24M pages) are potential fans • Retain only links, and ignore page contents Notes 01

  31. Step 2: eliminate mirroring pages • Many pages are mirrors (exactly the same page) • They can produce many spurious fans • Use a “shingling” method to identify and eliminate duplicates • Results: • 60% of 24M potential-fan pages are removed • # of potential centers is 30 times of # of potential fans Notes 01

  32. Step 3: using in-degrees of pages • Delete pages highly referenced, e.g., yahoo, altavista • Reason: they are referenced for many reasons, not likely forming an emerging community • Formally: remove all pages with more than k inlinks (k = 50, for instance) • Results: • 60M pages pointing to 20M pages • 2M potential fans Notes 01

  33. Step 4: iterative pruning • To find (i,j)-cores • Remove all pages whose # of out-links is < i • Remove all pages whose # of in-links is < j • Do it iteratively Notes 01

  34. Step 5: inclusion-exclusion pruning • Idea: in each step, we • Either “include” a community • Or we “exclude” a page from further contention • Check a page x with j out-degree. x is a fan of an (i,j)-core if: • There are i-1 fans point to all the forward neighbors of x • This step can be checked easily using the index on fans and centers • Result: for (3,3)-cores, 5M pages remained • Final step: • Since the graph is much smaller, we can afford to “enumerate” the remaining cores • Result: • (3,3)-cores: about 75 KB • High-quality communities • Check a few in the paper by yourself Notes 01

More Related