1 / 58

Web Search

Web Search. Week 6 LBSC 796/INFM 718R March 9, 2011. Washington Post, February 10, 2011. What “Caused” the Web?. Affordable storage 300,000 words/$ by 1995 Adequate backbone capacity 25,000 simultaneous transfers by 1995 Adequate “last mile” bandwidth 1 second/screen (of text) by 1995

odell
Télécharger la présentation

Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search Week 6 LBSC 796/INFM 718R March 9, 2011

  2. Washington Post, February 10, 2011

  3. What “Caused” the Web? • Affordable storage • 300,000 words/$ by 1995 • Adequate backbone capacity • 25,000 simultaneous transfers by 1995 • Adequate “last mile” bandwidth • 1 second/screen (of text) by 1995 • Display capability • 10% of US population could see images by 1995 • Effective search capabilities • Lycos and Yahoo! achieved useful scale in 1994-1995

  4. Defining the Web • HTTP, HTML, or URL? • Static, dynamic or streaming? • Public, protected, or internal?

  5. Number of Web Sites

  6. What’s a Web “Site”? • Any server at port 80? • Misses many servers at other ports • Some servers host unrelated content • Geocities • Some content requires specialized servers • rtsp

  7. Web Servers

  8. Web Pages (2005) Gulli and Signorini, 2005

  9. The Indexed Web in 2011

  10. Crawling the Web

  11. Basic crawl architecture 12

  12. URL frontier 13

  13. Mercator URL frontier • URLs flow in from the top intothefrontier. • Front queues manage prioritization. • Back queuesenforcepoliteness. • Eachqueueis FIFO. 14

  14. Mercator URL frontier: Front queues • Selectionfrom front queuesisinitiatedby back queues • Pick a front queuefromwhichtoselectnext URL: Roundrobin, randomly, ormoresophisticated variant • But with a bias in favorofhigh-priority front queues 15

  15. Mercator URL frontier: Back queues • Whenwehaveemptied a back queueq: • Repeat (i) pull URLs ufrom front queuesand (ii) add u to its corresponding back queue . . . • . . . until we get a u whosehostdoes not have a back queue. • Then put u in q and createheapentryfor it. 16

  16. Web Crawling Algorithm • Put a set of known sites on a queue • Repeat the until the queue is empty: • Take the first page off of the queue • Check to see if this page has been processed • If this page has not yet been processed: • Add this page to the index • Add each link on the current page to the queue • Record that this page has been processed

  17. Link Structure of the Web

  18. Web Crawl Challenges • Politeness • Discovering “islands” and “peninsulas” • Duplicate and near-duplicate content • 30-40% of total content • Server and network loads • Dynamic content generation • Link rot • Changes at ~1% per week • Temporary server interruptions • Spider traps

  19. Duplicate Detection • Structural • Identical directory structure (e.g., mirrors, aliases) • Syntactic • Identical bytes • Identical markup (HTML, XML, …) • Semantic • Identical content • Similar content (e.g., with a different banner ad) • Related content (e.g., translated)

  20. Near-duplicates: Example

  21. Detectingnear-duplicates • Compute similarity with an edit-distance measure • We want “syntactic” (as opposed to semantic) similarity. • True semantic similarity (similarity in content) is too difficult tocompute. • We do not consider documents near-duplicates if they have the same content, but express it with different words. • Use similarity threshold θ to make the call “is/isn’t a near-duplicate”. • E.g., two documents are near-duplicates if similarity • > θ = 80%.

  22. Represent each document as set of shingles • A shingle is simply a word n-gram. • Shingles are used as features to measure syntactic similarity of documents. • For example, for n = 3, “a rose is a rose is a rose” would be represented as this set of shingles: • { a-rose-is, rose-is-a, is-a-rose } • We can map shingles to 1..2m (e.g., m = 64) by fingerprinting. • From now on: sk refers to the shingle’s fingerprint in 1..2m. • We define the similarity of two documents as the Jaccard coefficient of their shingle sets.

  23. Shingling: Summary • Input: Ndocuments • Choose n-gram size for shingling, e.g., n = 5 • Pick 200 random permutations, represented as hash functions • Compute N sketches: 200 × N matrix shown on previous slide, onerow per permutation, onecolumn per document • Computepairwisesimilarities • Transitive closure of documents with similarity > θ • Index only one document from each equivalence class

  24. Robots Exclusion Protocol • Depends on voluntary compliance by crawlers • Exclusion by site • Create a robots.txt file at the server’s top level • Indicate which directories not to crawl • Exclusion by document (in HTML head) • Not implemented by all crawlers <meta name="robots“ content="noindex,nofollow">

  25. Hands on:The Internet Archive • Web crawls since 1997 • http://archive.org • Check out the iSchool’sWeb site in 1997 • Check out the history of your favorite site

  26. Indexing Anchor Text • A type of “document expansion” • Terms near links describe content of the target • Works even when you can’t index content • Image retrieval, uncrawled links, …

  27. Estimating Authority from Links Hub Authority Authority

  28. Simplified PageRank Algorithm R(u): PageRankscore of page u Bu: the set of pages that link to u R(v): PageRank score of page v NV: number of links from page v c: normalization factor

  29. PageRank Algorithm Example 1/2 A B 1/2 1 1/2 1/2 1 D C

  30. R2(A) = ½ x 2 + 1 x 2 + 1 x 2 = 5 R2(B)’ = ½ x 2 = 1 R1(A) = 2 1/2 R1(B) = 2 A B 1/2 1 1/2 1/2 1 C D R2(C) = 2 R2(D) = 2 R1(C)’ = ½ x 2 = 1 R1(D)’ = ½ x 2 = 1

  31. R3(A) = ½ x 1 + 1 x 1 + 1 x 1 = 2.5 R3(B) = ½ x 5 = 2.5 R2(A) = 5 1/2 R2(B) = 1 A B 1/2 1 1/2 1/2 1 C D R2(C) = 1 R2(D) = 1 R3(C) = ½ x 5 = 2.5 R3(D) = ½ x 1 = 0.5

  32. Convergence R(B)’ = ½ x 4 = 2 R(A)’ = ½ x 2 + 1 x 2 + 1 x 1 = 4 R(A) = 4 1/2 R(B) = 2 A B 1/2 1 1/2 1/2 1 C D R(D) = 1 R(C) = 2 R(C)’ = ½ x 4 = 2 R(D)’ = ½ x 2 = 1

  33. Index Spam • Goal: Manipulate rankings of an IR system • Multiple strategies: • Create bogus user-assigned metadata • Add invisible text (font in background color, …) • Alter your text to include desired query terms • “Link exchanges” create links to your page • Cloaking

  34. Spamming with Link Farms 1 R(C) = 1 C A B 1 R(B) = 1 1 R(A) = 1 1 2 2 2 3 2 3 3 … …

  35. Adversarial IR • Search is user-controlled suppression • Everything is known to the search system • Goal: avoid showing things the user doesn’t want • Other stakeholders have different goals • Authors risk little by wasting your time • Marketers hope for serendipitous interest

  36. “Safe Search” • Text • Whitelists and blacklists • Link structure • Image analysis

  37. Computational Advertizing • Variant of a search problem • Jointly optimize relevance and revenue • Triangulating features • Queries • Clicks • Page visits • Auction markers

  38. Internet Users http://www.internetworldstats.com/

  39. Global Internet Users Web Pages

  40. Global Internet Users Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

  41. Global Internet Users Web Pages Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

  42. A: Southeast Asia (Dec 27, 2004)    B: Indonesia (Mar 29, 2005)    C; Pakistan (Oct 10, 2005)    D; Hawaii (Oct 16, 2006)    E: Indonesia (Aug 8, 2007) F: Peru (Aug 16, 2007) Search EngineQuery Logs

  43. Query Statistics Pass, et al., “A Picture of Search,” 2007

  44. Pass, et al., “A Picture of Search,” 2007

  45. Temporal Variation Pass, et al., “A Picture of Search,” 2007

  46. Pass, et al., “A Picture of Search,” 2007

  47. 28% of Queries are Reformulations Pass, et al., “A Picture of Search,” 2007

  48. The Tracking Ecosystem http://wsj.com/wtk

  49. AOL User 4417749

More Related