Learning URL Patterns for Webpage De-duplication

Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: D9815013@mail.ntust.edu.tw Data Mining & Machine Learning Lab

Outlines • Introduction • Duplicate URLs • Problem Definition • Related Works • Algorithms • URL Preprocessing • Rule Generation • Evaluation • Conclusions Data Mining & Machine Learning Lab

Introduction • Duplicate URLs • Problem Definition Data Mining & Machine Learning Lab

Duplicate URLs • Making URLs search engine friendly • http://en.wikipedia.org/wiki/Casino_Royale • http://en.wikipedia.org/?title=Casino_Royale • Session-id or cookie information present in URLs • http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873&cat=8 • http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813&cat=8 • Irrelevant or superfluous components in URLs • http://www.amazon.com/Lord-Rings/dp/B000634DCW • http://www.amazon.com/dp/B000634DCW • Webmaster construct URL representations with custom delimiters • http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0QQ_fclsZ1QQ_pcatidZ1QQ_pidZ43973351QQ_tabZ2 • http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0?_fcls=1&_pcatid=1&_pid=43973351&_tab=2 Data Mining & Machine Learning Lab

Problem Definition • Given a set of duplicate clusters and their corresponding URLs • Learning Rules from URL strings which can identify duplicates • Utilizing learned Rules for normalizing unseen duplicate URLs into a unique normalized URL • Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL Data Mining & Machine Learning Lab

Related Works • Do not crawl in the dust: different urls with similar text • Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld. • Conference: International conference on World Wide Web 2007 • DUST algorithm • Discovering substring substitution rules to transform URLs of similar content to one canonical URL • Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure Data Mining & Machine Learning Lab

Related Works (cont.) • De-duping urls via rewrite rules • Authors: A. Dasgupta, R. Kumar, and A. Sasturkar • Conference: ACM SIGKDD international conference on Knowledge discovery and data mining • Considering a broader set of rule types which subsume the DUST rules • DUST rules • session-id rules • irrelevant path components • Complicate rewrites • Algorithm learns rules from a cluster of URLs with similar page content • such a cluster is referred to as a duplicate cluster or a dup cluster Data Mining & Machine Learning Lab

Algorithms • URL Preprocessing • Basic Tokenization • Deep Tokenization • Rule Generation • Pair-wise Rule Generation • Rule Generalization Data Mining & Machine Learning Lab

URL Preprocessing • Basic Tokenization • Using the standard delimiters specified in theRFC 1738 • Extracted Tokens: • Protocol • Hostname • Path components • Query-args • Deep Tokenization • Using unsupervised technique to learn custom URL encodings used by webmasters Data Mining & Machine Learning Lab

URL Preprocessing (cont.) Data Mining & Machine Learning Lab

Rule Generation • Definitions • URL • Rule • Example • u1: http://360.yahoo.com/friends-lttU7d6kIuGq • u1 = {k(1,3)= http, k(2,2)= 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = lttU7d6kIuGq} • u2: http://360.yahoo.com/friendsnMfcaJRPUSMQ • u2 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = nMfcaJRPUSMQ} • Rule • Context (C ): • c(k(1,3)) = http, c(k(2,2)) = 360.yahoo.com, c(k(3.1,1.3)) = friends, c(k(3.2,1.2)) = −, c(k(3.3,1.1)) = nMfcaJRPUSMQ • Transformation (T): • t(k(3.3,1.1)) = lttU7d6kIuGq. Data Mining & Machine Learning Lab

Rule Generation (cont.) • Pair-wise Rule Generation • Target Selection • Source Selection • Rule Generalization • Pair 1: • http://www.imdb.com/title/tt0810900/photogallery • http://www.imdb.com/title/tt0810900/mediaindex • Pair 2: • http://www.imdb.com/title/tt0053198/photogallery • http://www.imdb.com/title/tt0053198/mediaindex • Rule 1: • c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0810900, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex • Rule 2: • c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0053198, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex Data Mining & Machine Learning Lab

Evaluation • Dataset • Rule Numbers after each step Data Mining & Machine Learning Lab

Evaluation (cont.) • Small dataset Data Mining & Machine Learning Lab

Evaluation (cont.) • Large dataset Data Mining & Machine Learning Lab

Conclusion • Presented a set of scalable and robust techniques for de-duplication of URLs • Basic and deep tokenization • Rule generation and generalization • Easy adaptability to MapReduce paradigm • Evaluate effectiveness on both small and large dataset Data Mining & Machine Learning Lab

Thanks for your attention • Questions? Data Mining & Machine Learning Lab

Algorithm 1 Data Mining & Machine Learning Lab

Algrithm 3 Data Mining & Machine Learning Lab

Definitions of URL • URL: A URL u is defined as function • u : K → V ∪ {⊥} • K: keys • k(x.i,y.j) • x, y represent the position index from the start and end of the URL • i,j represent the deep token index • V: Values • A key not present in the URL is denoted by ⊥ Data Mining & Machine Learning Lab

Definitions of Rule • RULE: A Rule r is defined as a function • r : C → T • C: context • C : K → V ∪ {∗} • T: transformation • T : K → V ∪ {⊥,K’} • K’ = K ∪ ValueConversions • ValueConversions = {Lowercase(K), Uppercase(K), Encode(K), Decode(K), ...} Data Mining & Machine Learning Lab

Rule Coverage Data Mining & Machine Learning Lab

MapReduce Data Mining & Machine Learning Lab

Learning URL Patterns for Webpage De-duplication

Learning URL Patterns for Webpage De-duplication

Presentation Transcript

CD Duplication

Gene Duplication

WebPage

Depository De-duplication

Learning Patterns

Content De-duplication for CDNi

Duplication

Shared Memory De-duplication

Learning Effective Patterns for Information Extraction

URL

URL

De-duplication of Bibliographic Records

CD Duplication, DVD Duplication

CD Duplication

URL

Dvd duplication and dvd duplication services

URL

Patterns – Learning Outcomes

De-Duplication

De-duplication of Bibliographic Records