1 / 46

Web Page Clustering based on Web Community Extraction

Web Page Clustering based on Web Community Extraction. Chikayama-Taura Lab. M2 Shim Wonbo. Background. Directory = Category. Open Directory Project. Used by Google, Lycos, etc. Categorizing Web pages by hand Accurate Lately updated Unscalable. World Wide Web.

Télécharger la présentation

Web Page Clustering based on Web Community Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Page Clustering based onWeb Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

  2. Background Directory = Category

  3. Open Directory Project • Used by Google, Lycos, etc. • Categorizing Web pages by hand • Accurate • Lately updated • Unscalable

  4. World Wide Web • Rapid increase (= # of clusters changes) • Daily updated (= cluster centers move) • Due to these two properties of the Web.. • A Web page clustering system without human effort is needed.

  5. Purpose • Constructing a Web page clustering system which • finds clusters without human help • is scalable • clusters Web pages in high speed • clusters Web pages accurately

  6. Brief System View Partitioning of remaining pages based on TF-IDF DBG Extraction (c) Web Page Clustering (a) Web pages (b) Web Communities

  7. Contribution • Web Community • A new Web community topology is defined. • Extracted Web community shows higher precision than existing work. • Web Page Clustering • An approach to exploit Web communities as centroids of clusters in TF-IDF space is taken. • Experimental results show meaningful clusters.

  8. Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion

  9. Existing Work • Text-based clustering • Use of terms as feature • Generally used algorithm • ex) k-means, Hierarchical algorithm, Density-based clustering • Link-based clustering • Called as Web community extraction • Extracting dense subgraphs from the Web graph • Conjunction of text and link information • ex) Contents-Link Coupled Web Page Clustering [Yitong et al., DEWS2004]

  10. Text-based Clustering • Merit • Accurate (because of considering text) • Problem • Unsupervised clustering • Complex to decide the number of clusters • Supervised learning and clustering • Difficult to label each training datum

  11. Contents-Link Coupled Web Page Clustering [Yitong et al., DEWS2004] • Feature Term frequency (pterm), Out-link (pout), In-link (pin) • Similarity • Clustering Algorithm • An extension of the k-means algorithm

  12. Extraction of Web Community based on Link Analysis • An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001] • PlusDBG: Web Community Extraction Scheme Improving Both Precision and Pseudo-Recall [Saida et al, 2005]

  13. Terminology • Fan and Center • Bipartite Graph (BG) • Complete BG (CBG) • Dense BG (DBG) Fan Center p q (b) DBG (a) CBG

  14. Algorithm for Extracting DBG [Reddy et al., 2001] • Finds bipartite graph using co-citing and co-cited Web pages • Extracts a DBG from above graph 1 2 DBG(3, 3) 3 4 3 Seed page 3 3 3 3 1

  15. PlusDBG • Uses distance defined by co-citing page rate between two pages • Finds co-citing pages which are within distance threshold • Extracts a DBG from above graph • PlusDBG shows higher precision than DBG does.

  16. Web Community Extraction O High speed O Finding out topics over the Web X Possibility of extracting unrelated Web pages as a community

  17. Problem of DBG

  18. Improvement of PlusDBG

  19. Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion

  20. Proposal • Extracts Web communities using link structure. • Assigns remainders to the closest Web community in TF-IDF space.

  21. Proposed Web Community Connectable centers Connecter • Connecter • Fan which is citing two centers. • Connectable • If two centers are connectable, the centers have more than two connecters. • Web Community • A Web Community C is a DBG composed of connectable centers and connecters.

  22. Proposed Web Community All center is connectable to another one.

  23. Extraction Algorithm S={} T={g} S={b,c,d} T={g,i} S’={a,b,c,d} a e T’={e,f,h,i} T’={e,f,h,i,j} t’=i t’=j f b # connecters = 3 # connecters = 1 g c h d i j Output Community = {a,b,c,d,e,f,g,h,i}

  24. Labeling Remainders • Remainder: a Web page which is not extracted as a member of communities. • Calculate centroids of Web communities. • Label remainders with Web community ID w.r.t vi is the TF-IDF vector of a page v

  25. Agenda • Introduction • Related Work • Proposal • Evaluation • Preprocess • Web community extraction • Labeling result • Conclusion

  26. Preprocess • Data set • 2.34 M pages, 20 M links • Almost 80% of data set is Japanese pages. • Create a link-only file • Links to out of data set are deleted. • Duplicates are deleted which share 90% of links. • Pages including 50 links are deleted. • Remained data set: 1.45 M pages, 5.09 M links • Create a TF-IDF file • Used TF-IDF: • Parser: MeCab • Terms which appeared in less than 0.1% or more than 90% of total documents are removed

  27. Distribution of Web Community Size

  28. Distribution of Web Community Size

  29. Distance from centroids to term vectors

  30. Variance of distance

  31. Example of Web communities • About motor bike manufacturers and links. • http://bike.ak-m.jp/ • http://www.bike-cube.jp/ • http://bike.ak-m.jp/2006/01/post_32.html • http://www.bike-cube.jp/index.php • http://bike.ak-m.jp/2006/11/post_20.html • http://www.kymco.co.jp/ • http://www1.suzuki.co.jp/motor/ • http://www.yamaha-motor.jp/mc/ • http://bike.ak-m.jp/ • http://www.peugeot-moto.com/ • http://www.apriliajapan.co.jp/index.html • http://www.buell.jp/ • http://www.cagiva.co.jp/ • http://www.mitsuoka-motor.com/ • http://www.ducati.com/od/ducatijapan/jp/index.jhtml • http://www.triumphmotorcycles.com/japan/ • http://www.harley-davidson.co.jp/index.html • http://www.ktm-japan.co.jp/

  32. Comparing to ODP • Definition of precision • From a Web community C, let page subset existing in ODP OC. • If |OC| < 3, the precision of C is undefined. • For r in OC, the Pscore of r is: • With Pscore, the precision of C is: • Comparing to the 4th and 5th level of ODP directories (Top/Regional/Japan/Arts/Movie) • The number of ODP pages included in the data set: 47,093 score(p, q) = 1, p, q in same directory score(p, q) = 0, otherwise

  33. Comparing to ODP

  34. Precision of Web Communities(4th level)

  35. Precision of Web communities(5th level)

  36. Summary of Web Community Extraction • The proposed method extracted smaller Web communities than PlusDBG did. • Members of each community were closer to the centroid in the TF-IDF space than members of PlusDBG were. • My communities showed higher precision than PlusDBG’s when comparing to ODP.

  37. Labeling Result • Ignore pages including less than 10 terms. • Compare to the ODP • ODP pages: 29,153 • ODP directories: 1,862

  38. Labeling Result (the 4th level)

  39. Labeling Result (the 5th level)

  40. Labeling example

  41. Labeling example

  42. Summary and Conclusion • A DBG structure is defined as the Web community topology. • All two centers should be connectable. • All fan is a connecter of centers. • My DBG structure extracts more compact and more precise Web communities than existing work does. • Clustering based on the Web community extraction is proposed. • The centroids of communities in TF-IDF space are used in labeling of remainders. • Clustering result showed meaningful page groups.

  43. Future Work • Coupling feature selections for improvement on the labeling result. • Clustering extracted centroids.

  44. 発表文献 • (発表予定) ウェブコミュニティ抽出アルゴリズムの改良、沈 垣甫、田浦 健次郎、近山 隆、データ工学ワークショップ、2007

  45. Thank you for attention

  46. Extraction Algorithm • Select seed page t and set T={t}, S={}. • Find S’ of which members cite any page in T. • Find T’ of which members cited by any page in T and are not in T. • Determine that t’∈T’ is connectable to all pages in T. • If t’ is connectable, set T=T∪{t’} and S={connecters} and go to 2. • If not, select other t’∈T’ and go to 4. • If |S| > 3 and |T| > 3, extract the page set as a Web Community and delete from the Web Graph. • If any t exists, go to 1.

More Related