PageSim: A Novel Link-based Measure of Web Page Similarity

PageSim: A Novel Link-based Measure of Web Page Similarity LIN Zhenjiang, 28 April 2006 zjlin@cse.cuhk.edu.hk http://www.cse.cuhk.edu.hk/~zjlin

Outline • 1. Background • 2. Motivation • 3. Existing approaches • 4. PageSim: a new approach • 5. Demonstrations • 6. Conclusion and future work

1. Background I Mining the World-Wide Web I • Web mining -data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996). • Web mining research –integrate research from several research communities (Kosala and Blockeel, July 2000) such as: • Database (DB) • Information retrieval (IR) • The sub-areas of machine learning (ML) • Natural language processing (NLP)

1. Background II Mining the World-Wide Web II • WWW is huge, widely distributed, global information source for • Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. • Hyper-link information • Access and usage information • Web Site contents and Organization

1. Background III Mining the World-Wide Web III • Growing and changing very rapidly • Broad diversity of user communities • Only a small portion of the information on the Web is truly relevant or useful to Web users • How to find high-quality Web pages on a specified topic? • WWW provides rich sources for data mining

1. Background IV Challenges on the Web • Finding Relevant Information • Creating knowledge from Information available • Personalization of the information • Learning about customers / individual users • …

1. Background V Web Mining Taxonomy • Web Content Mining:extract/mine useful information or knowledge from web page contents, including text, image, audio, video, and metadata, etc. • Web Structure Mining:discover useful knowledge from the structure of hyperlinks. • Web Usage Mining:refers to the discovery of user access patterns from Web usage logs.

1. Background VI Web Structure Mining I • Hyperlinks can infer the notion of authority • The Web consists not only of pages, but also of hyperlinks pointing from one page to another • These hyperlinks contain an enormous amount of latent human annotation • A hyperlink pointing to another Web page, this can be considered as the author's endorsement of the other page.

1. Background VII Web Structure Mining II • Web pages categorization (Chakrabarti, et al., 1998) • Discovering micro-communities on the web - Example: Clever system (Chakrabarti, et al., 1999), Google (Brin and Page, 1998) • Schema Discovery in Semi-structured Environment (identify typical structuring info.)

2. Motivation I Finding related or similar web pages I • web search engines

2. Motivation II Finding related or similar web pages II • web document classification

3. Existing approaches I • Text-based • Classic IR, Jaccard’s coefficient, Adamic/Adar • Pure link-based • Single-step: cocitation, common neighbor, … • Multi-step: • Companion (Dean, Henzinger, 1998) • SimRank (Jeh, Widom, 2002) • Hybrid • Anchor text based (Haveliwala et al. 2002)

3. Existing approaches II • Notations • Sim(a,b): similarity score of web page a and b. • I(a): in-link neighbors of web page a. • O(a): out-link neighbors of web page a. • Common neighbor method • Sim(a,b) = |O(a)∩O(b)| = |(c,d)| = 2 • Cocitation method • Sim(a,b) = |I(a)∩I(b)| = |(c,d)| = 2

3. Existing approaches III • SimRank “two pages are similar if they are referenced (cited, or linked to) by similar pages” • (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition • C is a constant between 0 and 1. • The iteration starts with Sim(u,u)=1, Sim(u,v)=0if u≠v.

4. PageSim: a new approach I • Two considerations • On the Web, not all links are equally important. Common neighbor, cocitation • A similarity measure should be able to measure the similarity between any two web pages. SimRank • PageSim • Take the above problems into account.

4. PageSim: a new approach II • Cocitation • Which page is more similar to d, c or e? • Suppose page a is YAHOO!’s homepage, and b is a personal web page. Authoritative pages are more important.

4. PageSim: a new approach III • SimRank • Are a and b similar? • SimRank says “NO”s. Are the answers reasonable?

4. PageSim: a new approach IV • Page a linking to b and c means a “thinks” • b and c are similar. • both b and c are similar to a. • Intuitions • Page a spreads similarity to its neighbors. • Authoritative pages spread more similarity.

4. PageSim: a new approach V • PageSim • In PageSim, PageRank (PR) score is used to measure the authority of a web page. PR assigns global importance scores to all web pages. • Each page spreads its own similarity score (PR score) to its neighbors. • Each page also propagates other pages’ similarity scores to its neighbors. • After the similarity score propagation finished, each page contains an array of similarity scores. • PageRank score propagation

4. PageSim: a new approach VI • Example: similarity propagation (page a only) • PR(a)=100, PR(b)=55, PR(c)=102 • Each page propagate 80% of its similarity score averagely to its neighbors.

4. PageSim: a new approach VII • Example: similarity propagation II • PR(a)=100, PR(b)=55, PR(c)=102 • Each page contains a similarity score vector(SV). • SV(a) = (100, 35, 82 ), • SV(b) = ( 40, 55, 33 ), • SV(c) = ( 72, 44, 102 ), • PageSim score (PS) computation • PS(a,b)=Σmin( SV(a), SV(b) ) = 40+35+33 = 108 • Two pages are more similar if they share more common similarity scores.

4. PageSim: a new approach VIII • Example: similarity spreading III • PageSim score matrix • PS_matrix = (PS(u,v))nxn= a: 217 b: 108 128 c: 189 117 219 • PS_matrix is symmetric. • PS(a,b) = PS(b, a) • Any web page is most similar to itself. • PS(u,u) = max ( PS(u,v) ), for any v.

4. PageSim: a new approach IX • Propagation radius pruning I • The time complexity of propagating one page’s similarity score to all the others is O(kn), where k is the average number of out-links. • Similarity score propagated to distant pages is too small to be omitted. • Reducing complexity of propagation to O(kr) by limiting the radius of propagation to r.

4. PageSim: a new approach X • Propagation radius pruning II • Real data (CSE homepage) and synthetic data

5. Demonstrations I • Example 1: single link • PageSim matrixa: 100b: 80 265c: 64212469.2d: 51.2 169.6 375.4694.1 • PR = (100, 185, 257.2, 318.6) • SimRank matrix1 0 1 0 0 1 0 0 0 1

5. Demonstrations II • Example 2: loop link • PageSim matrixa: 295.2b: 246.4 295.2 c: 230.4 246.4 295.2d: 246.4 230.4 246.4 295.2 • PR = (100, 100, 100, 100) • SimRank matrix1 0 1 0 0 1 0 0 0 1

5. Demonstrations III • Example 3: more complex • PageSim matrix1: 100.02: 40.0 487.63: 50.7 159.4 397.44: 10.7 238.5 130.0 275.55: 10.7 130.0 130.0 130.0 314.9PR = (100, 40.0, 50.7, 10.7, 10.7) • SimRank matrix1: 12: 0 1 3: 0 0.25 14: 0 0 0.5 15: 0 0 0.5 1 1 • PageSim results • v3 is most similar to v1. • v4 is most similar to v2.

6. Conclusion and future work I • Conclusion • Web Mining • Web page similarity measuresText-based, Link-based, and Hybrid • PageSim: PageRank score propagation. • Propagation radius pruning • PageSim vs SimRank

6. Conclusion and future work II • Future work • Evaluation of PageSim • Taking traditional text-based similarity measure TFIDF as ground truth. • Efficiency of computation • Since computing PageSim score of two web pages is O(n), computing all n2 pairs of pages is O(n3). • Storage issue • Since each page needs an array of length n to store similarity scores issued from all web pages, the storage needed by PageSim is O(n2).

Q & A Thank you!

PageSim: A Novel Link-based Measure of Web Page Similarity