1 / 19

Detecting Phrase-Level Duplication on the World Wide Web

Detecting Phrase-Level Duplication on the World Wide Web. Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel. Introduction. Problem Identify instances “slice and dice” generation Example German spammer 1 million URLs originating from single IP (but use of many host names)

taliesin
Télécharger la présentation

Detecting Phrase-Level Duplication on the World Wide Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Phrase-Level Duplication on the World Wide Web Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel

  2. Introduction • Problem • Identify instances “slice and dice” generation • Example • German spammer • 1 million URLs originating from single IP (but use of many host names) • Pages changed completely on every download • Pages consisted of grammatically well-formed sentences stitched together at random

  3. Goal • Find instances of sentence level synthesis of web pages • More generally, of pages with an unusually large number of popular phrases

  4. The Data • Datasets • DS1 • BFS crawl starting at www.yahoo.com • 151 million HTML pages • DS2 • Large crawl conducted by MSN search • 96 million HTML pages chosen at random

  5. Finding Phrase Replication • Sampling • Reduce each document to a feature vector • Employ a variant of the shingling algorithm of Broder et al. • Significantly reduces the data volume

  6. Sampling method • Replace all HTML markup by white-space • k-phrases of a document: all sequences of k consecutive words • Treat the document as a circle: last word followed by first word • n word document has exactly n phrases

  7. Sampling method • Exploit properties of Rabin fingerprints • Rabin fingerprints support efficient extension and prefix deletion • Fingerprints of distinct bit patterns are distinct

  8. Computing feature vectors • Fingerprint each word in the document - gives n tokens • Compute fingerprint of each k-token phrase - gives n phrase fingerprints • Apply m different fingerprint functions • Retain the smallest of the n resulting values for each function • Vector of m fingerprints representative of document (elements referred to as shingles)

  9. Duplicate Suppression • Replication rampant on the web • Clustered all pages in data set into equivalence classes • Each class contains all pages that are exact or near duplicates of one another

  10. Popular phrases • Occur in more documents than would be expected by chance • Assumptions: • “Normal” web pages characterized by a generative model • Sought web pages - copying model (need to consider number of phrases, length of typical documents…)

  11. Popular Phrases • Limit attention to the shingles chosen by sampling functions • Phrase is popular if selected as shingle in sufficiently many documents • To determine popular phrases, consider triplets (i,s,d)

  12. Popular Phrases • First 24 most popular phrases not very interesting • Starting from the 36th phrase, discover phrases caused by machine generated content • Templatic form: common text, “fill in the blank” slots and optional • 60th phrase - instance of idiomatic phrase

  13. Zipfian Distribution

  14. Histogram of popular shingles per doc

  15. Covering set • Covering sets for shingles of each page • Approximate a minimum covering set using a greedy heuristic

  16. Distribution of covering set sizes

  17. German spammer

  18. Looking for likely sources

  19. Conclusion • Power law distribution • Popular phrases • Often limited by design choices • Legal disclaimers • Navigational phrases • “fill in the blanks” • More replicated than original content

More Related