280 likes | 401 Vues
This paper presents a method for identifying and normalizing duplicate URLs to improve the efficiency of web crawlers and search engines. It covers the problem of duplicate URLs, algorithms for URL preprocessing, rule generation, and the evaluation of these methods. Key elements include a discussion on making URLs search-engine friendly, the impact of session IDs and irrelevant components, and a comparison of existing algorithms for handling duplicates. Through experimental data, the study demonstrates the effectiveness of learned rules in generating unique normalized URLs.
E N D
Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: D9815013@mail.ntust.edu.tw Data Mining & Machine Learning Lab
Outlines • Introduction • Duplicate URLs • Problem Definition • Related Works • Algorithms • URL Preprocessing • Rule Generation • Evaluation • Conclusions Data Mining & Machine Learning Lab
Introduction • Duplicate URLs • Problem Definition Data Mining & Machine Learning Lab
Duplicate URLs • Making URLs search engine friendly • http://en.wikipedia.org/wiki/Casino_Royale • http://en.wikipedia.org/?title=Casino_Royale • Session-id or cookie information present in URLs • http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873&cat=8 • http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813&cat=8 • Irrelevant or superfluous components in URLs • http://www.amazon.com/Lord-Rings/dp/B000634DCW • http://www.amazon.com/dp/B000634DCW • Webmaster construct URL representations with custom delimiters • http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0QQ_fclsZ1QQ_pcatidZ1QQ_pidZ43973351QQ_tabZ2 • http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0?_fcls=1&_pcatid=1&_pid=43973351&_tab=2 Data Mining & Machine Learning Lab
Problem Definition • Given a set of duplicate clusters and their corresponding URLs • Learning Rules from URL strings which can identify duplicates • Utilizing learned Rules for normalizing unseen duplicate URLs into a unique normalized URL • Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL Data Mining & Machine Learning Lab
Related Works • Do not crawl in the dust: different urls with similar text • Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld. • Conference: International conference on World Wide Web 2007 • DUST algorithm • Discovering substring substitution rules to transform URLs of similar content to one canonical URL • Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure Data Mining & Machine Learning Lab
Related Works (cont.) • De-duping urls via rewrite rules • Authors: A. Dasgupta, R. Kumar, and A. Sasturkar • Conference: ACM SIGKDD international conference on Knowledge discovery and data mining • Considering a broader set of rule types which subsume the DUST rules • DUST rules • session-id rules • irrelevant path components • Complicate rewrites • Algorithm learns rules from a cluster of URLs with similar page content • such a cluster is referred to as a duplicate cluster or a dup cluster Data Mining & Machine Learning Lab
Algorithms • URL Preprocessing • Basic Tokenization • Deep Tokenization • Rule Generation • Pair-wise Rule Generation • Rule Generalization Data Mining & Machine Learning Lab
URL Preprocessing • Basic Tokenization • Using the standard delimiters specified in theRFC 1738 • Extracted Tokens: • Protocol • Hostname • Path components • Query-args • Deep Tokenization • Using unsupervised technique to learn custom URL encodings used by webmasters Data Mining & Machine Learning Lab
URL Preprocessing (cont.) Data Mining & Machine Learning Lab
Rule Generation • Definitions • URL • Rule • Example • u1: http://360.yahoo.com/friends-lttU7d6kIuGq • u1 = {k(1,3)= http, k(2,2)= 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = lttU7d6kIuGq} • u2: http://360.yahoo.com/friendsnMfcaJRPUSMQ • u2 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = nMfcaJRPUSMQ} • Rule • Context (C ): • c(k(1,3)) = http, c(k(2,2)) = 360.yahoo.com, c(k(3.1,1.3)) = friends, c(k(3.2,1.2)) = −, c(k(3.3,1.1)) = nMfcaJRPUSMQ • Transformation (T): • t(k(3.3,1.1)) = lttU7d6kIuGq. Data Mining & Machine Learning Lab
Rule Generation (cont.) • Pair-wise Rule Generation • Target Selection • Source Selection • Rule Generalization • Pair 1: • http://www.imdb.com/title/tt0810900/photogallery • http://www.imdb.com/title/tt0810900/mediaindex • Pair 2: • http://www.imdb.com/title/tt0053198/photogallery • http://www.imdb.com/title/tt0053198/mediaindex • Rule 1: • c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0810900, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex • Rule 2: • c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0053198, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex Data Mining & Machine Learning Lab
Evaluation • Dataset • Rule Numbers after each step Data Mining & Machine Learning Lab
Evaluation (cont.) • Small dataset Data Mining & Machine Learning Lab
Evaluation (cont.) • Small dataset Data Mining & Machine Learning Lab
Evaluation (cont.) • Large dataset Data Mining & Machine Learning Lab
Evaluation (cont.) • Large dataset Data Mining & Machine Learning Lab
Conclusion • Presented a set of scalable and robust techniques for de-duplication of URLs • Basic and deep tokenization • Rule generation and generalization • Easy adaptability to MapReduce paradigm • Evaluate effectiveness on both small and large dataset Data Mining & Machine Learning Lab
Thanks for your attention • Questions? Data Mining & Machine Learning Lab
Algorithm 1 Data Mining & Machine Learning Lab
Algorithm 2 Data Mining & Machine Learning Lab
Algrithm 3 Data Mining & Machine Learning Lab
Algorithm 4 Data Mining & Machine Learning Lab
Algorithm 5 Data Mining & Machine Learning Lab
Definitions of URL • URL: A URL u is defined as function • u : K → V ∪ {⊥} • K: keys • k(x.i,y.j) • x, y represent the position index from the start and end of the URL • i,j represent the deep token index • V: Values • A key not present in the URL is denoted by ⊥ Data Mining & Machine Learning Lab
Definitions of Rule • RULE: A Rule r is defined as a function • r : C → T • C: context • C : K → V ∪ {∗} • T: transformation • T : K → V ∪ {⊥,K’} • K’ = K ∪ ValueConversions • ValueConversions = {Lowercase(K), Uppercase(K), Encode(K), Decode(K), ...} Data Mining & Machine Learning Lab
Rule Coverage Data Mining & Machine Learning Lab
MapReduce Data Mining & Machine Learning Lab