1 / 28

Learning URL Patterns for Webpage De-duplication

Learning URL Patterns for Webpage De-duplication. Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: D9815013@mail.ntust.edu.tw. Outlines. Introduction Duplicate URLs Problem Definition Related Works Algorithms URL Preprocessing Rule Generation Evaluation Conclusions.

aileen
Télécharger la présentation

Learning URL Patterns for Webpage De-duplication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: D9815013@mail.ntust.edu.tw Data Mining & Machine Learning Lab

  2. Outlines • Introduction • Duplicate URLs • Problem Definition • Related Works • Algorithms • URL Preprocessing • Rule Generation • Evaluation • Conclusions Data Mining & Machine Learning Lab

  3. Introduction • Duplicate URLs • Problem Definition Data Mining & Machine Learning Lab

  4. Duplicate URLs • Making URLs search engine friendly • http://en.wikipedia.org/wiki/Casino_Royale • http://en.wikipedia.org/?title=Casino_Royale • Session-id or cookie information present in URLs • http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873&cat=8 • http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813&cat=8 • Irrelevant or superfluous components in URLs • http://www.amazon.com/Lord-Rings/dp/B000634DCW • http://www.amazon.com/dp/B000634DCW • Webmaster construct URL representations with custom delimiters • http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0QQ_fclsZ1QQ_pcatidZ1QQ_pidZ43973351QQ_tabZ2 • http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0?_fcls=1&_pcatid=1&_pid=43973351&_tab=2 Data Mining & Machine Learning Lab

  5. Problem Definition • Given a set of duplicate clusters and their corresponding URLs • Learning Rules from URL strings which can identify duplicates • Utilizing learned Rules for normalizing unseen duplicate URLs into a unique normalized URL • Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL Data Mining & Machine Learning Lab

  6. Related Works • Do not crawl in the dust: different urls with similar text • Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld. • Conference: International conference on World Wide Web 2007 • DUST algorithm • Discovering substring substitution rules to transform URLs of similar content to one canonical URL • Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure Data Mining & Machine Learning Lab

  7. Related Works (cont.) • De-duping urls via rewrite rules • Authors: A. Dasgupta, R. Kumar, and A. Sasturkar • Conference: ACM SIGKDD international conference on Knowledge discovery and data mining • Considering a broader set of rule types which subsume the DUST rules • DUST rules • session-id rules • irrelevant path components • Complicate rewrites • Algorithm learns rules from a cluster of URLs with similar page content • such a cluster is referred to as a duplicate cluster or a dup cluster Data Mining & Machine Learning Lab

  8. Algorithms • URL Preprocessing • Basic Tokenization • Deep Tokenization • Rule Generation • Pair-wise Rule Generation • Rule Generalization Data Mining & Machine Learning Lab

  9. URL Preprocessing • Basic Tokenization • Using the standard delimiters specified in theRFC 1738 • Extracted Tokens: • Protocol • Hostname • Path components • Query-args • Deep Tokenization • Using unsupervised technique to learn custom URL encodings used by webmasters Data Mining & Machine Learning Lab

  10. URL Preprocessing (cont.) Data Mining & Machine Learning Lab

  11. Rule Generation • Definitions • URL • Rule • Example • u1: http://360.yahoo.com/friends-lttU7d6kIuGq • u1 = {k(1,3)= http, k(2,2)= 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = lttU7d6kIuGq} • u2: http://360.yahoo.com/friendsnMfcaJRPUSMQ • u2 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = nMfcaJRPUSMQ} • Rule • Context (C ): • c(k(1,3)) = http, c(k(2,2)) = 360.yahoo.com, c(k(3.1,1.3)) = friends, c(k(3.2,1.2)) = −, c(k(3.3,1.1)) = nMfcaJRPUSMQ • Transformation (T): • t(k(3.3,1.1)) = lttU7d6kIuGq. Data Mining & Machine Learning Lab

  12. Rule Generation (cont.) • Pair-wise Rule Generation • Target Selection • Source Selection • Rule Generalization • Pair 1: • http://www.imdb.com/title/tt0810900/photogallery • http://www.imdb.com/title/tt0810900/mediaindex • Pair 2: • http://www.imdb.com/title/tt0053198/photogallery • http://www.imdb.com/title/tt0053198/mediaindex • Rule 1: • c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0810900, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex • Rule 2: • c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0053198, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex Data Mining & Machine Learning Lab

  13. Evaluation • Dataset • Rule Numbers after each step Data Mining & Machine Learning Lab

  14. Evaluation (cont.) • Small dataset Data Mining & Machine Learning Lab

  15. Evaluation (cont.) • Small dataset Data Mining & Machine Learning Lab

  16. Evaluation (cont.) • Large dataset Data Mining & Machine Learning Lab

  17. Evaluation (cont.) • Large dataset Data Mining & Machine Learning Lab

  18. Conclusion • Presented a set of scalable and robust techniques for de-duplication of URLs • Basic and deep tokenization • Rule generation and generalization • Easy adaptability to MapReduce paradigm • Evaluate effectiveness on both small and large dataset Data Mining & Machine Learning Lab

  19. Thanks for your attention • Questions? Data Mining & Machine Learning Lab

  20. Algorithm 1 Data Mining & Machine Learning Lab

  21. Algorithm 2 Data Mining & Machine Learning Lab

  22. Algrithm 3 Data Mining & Machine Learning Lab

  23. Algorithm 4 Data Mining & Machine Learning Lab

  24. Algorithm 5 Data Mining & Machine Learning Lab

  25. Definitions of URL • URL: A URL u is defined as function • u : K → V ∪ {⊥} • K: keys • k(x.i,y.j) • x, y represent the position index from the start and end of the URL • i,j represent the deep token index • V: Values • A key not present in the URL is denoted by ⊥ Data Mining & Machine Learning Lab

  26. Definitions of Rule • RULE: A Rule r is defined as a function • r : C → T • C: context • C : K → V ∪ {∗} • T: transformation • T : K → V ∪ {⊥,K’} • K’ = K ∪ ValueConversions • ValueConversions = {Lowercase(K), Uppercase(K), Encode(K), Decode(K), ...} Data Mining & Machine Learning Lab

  27. Rule Coverage Data Mining & Machine Learning Lab

  28. MapReduce Data Mining & Machine Learning Lab

More Related