460 likes | 561 Vues
Web Page Clustering based on Web Community Extraction. Chikayama-Taura Lab. M2 Shim Wonbo. Background. Directory = Category. Open Directory Project. Used by Google, Lycos, etc. Categorizing Web pages by hand Accurate Lately updated Unscalable. World Wide Web.
E N D
Web Page Clustering based onWeb Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo
Background Directory = Category
Open Directory Project • Used by Google, Lycos, etc. • Categorizing Web pages by hand • Accurate • Lately updated • Unscalable
World Wide Web • Rapid increase (= # of clusters changes) • Daily updated (= cluster centers move) • Due to these two properties of the Web.. • A Web page clustering system without human effort is needed.
Purpose • Constructing a Web page clustering system which • finds clusters without human help • is scalable • clusters Web pages in high speed • clusters Web pages accurately
Brief System View Partitioning of remaining pages based on TF-IDF DBG Extraction (c) Web Page Clustering (a) Web pages (b) Web Communities
Contribution • Web Community • A new Web community topology is defined. • Extracted Web community shows higher precision than existing work. • Web Page Clustering • An approach to exploit Web communities as centroids of clusters in TF-IDF space is taken. • Experimental results show meaningful clusters.
Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion
Existing Work • Text-based clustering • Use of terms as feature • Generally used algorithm • ex) k-means, Hierarchical algorithm, Density-based clustering • Link-based clustering • Called as Web community extraction • Extracting dense subgraphs from the Web graph • Conjunction of text and link information • ex) Contents-Link Coupled Web Page Clustering [Yitong et al., DEWS2004]
Text-based Clustering • Merit • Accurate (because of considering text) • Problem • Unsupervised clustering • Complex to decide the number of clusters • Supervised learning and clustering • Difficult to label each training datum
Contents-Link Coupled Web Page Clustering [Yitong et al., DEWS2004] • Feature Term frequency (pterm), Out-link (pout), In-link (pin) • Similarity • Clustering Algorithm • An extension of the k-means algorithm
Extraction of Web Community based on Link Analysis • An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001] • PlusDBG: Web Community Extraction Scheme Improving Both Precision and Pseudo-Recall [Saida et al, 2005]
Terminology • Fan and Center • Bipartite Graph (BG) • Complete BG (CBG) • Dense BG (DBG) Fan Center p q (b) DBG (a) CBG
Algorithm for Extracting DBG [Reddy et al., 2001] • Finds bipartite graph using co-citing and co-cited Web pages • Extracts a DBG from above graph 1 2 DBG(3, 3) 3 4 3 Seed page 3 3 3 3 1
PlusDBG • Uses distance defined by co-citing page rate between two pages • Finds co-citing pages which are within distance threshold • Extracts a DBG from above graph • PlusDBG shows higher precision than DBG does.
Web Community Extraction O High speed O Finding out topics over the Web X Possibility of extracting unrelated Web pages as a community
Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion
Proposal • Extracts Web communities using link structure. • Assigns remainders to the closest Web community in TF-IDF space.
Proposed Web Community Connectable centers Connecter • Connecter • Fan which is citing two centers. • Connectable • If two centers are connectable, the centers have more than two connecters. • Web Community • A Web Community C is a DBG composed of connectable centers and connecters.
Proposed Web Community All center is connectable to another one.
Extraction Algorithm S={} T={g} S={b,c,d} T={g,i} S’={a,b,c,d} a e T’={e,f,h,i} T’={e,f,h,i,j} t’=i t’=j f b # connecters = 3 # connecters = 1 g c h d i j Output Community = {a,b,c,d,e,f,g,h,i}
Labeling Remainders • Remainder: a Web page which is not extracted as a member of communities. • Calculate centroids of Web communities. • Label remainders with Web community ID w.r.t vi is the TF-IDF vector of a page v
Agenda • Introduction • Related Work • Proposal • Evaluation • Preprocess • Web community extraction • Labeling result • Conclusion
Preprocess • Data set • 2.34 M pages, 20 M links • Almost 80% of data set is Japanese pages. • Create a link-only file • Links to out of data set are deleted. • Duplicates are deleted which share 90% of links. • Pages including 50 links are deleted. • Remained data set: 1.45 M pages, 5.09 M links • Create a TF-IDF file • Used TF-IDF: • Parser: MeCab • Terms which appeared in less than 0.1% or more than 90% of total documents are removed
Example of Web communities • About motor bike manufacturers and links. • http://bike.ak-m.jp/ • http://www.bike-cube.jp/ • http://bike.ak-m.jp/2006/01/post_32.html • http://www.bike-cube.jp/index.php • http://bike.ak-m.jp/2006/11/post_20.html • http://www.kymco.co.jp/ • http://www1.suzuki.co.jp/motor/ • http://www.yamaha-motor.jp/mc/ • http://bike.ak-m.jp/ • http://www.peugeot-moto.com/ • http://www.apriliajapan.co.jp/index.html • http://www.buell.jp/ • http://www.cagiva.co.jp/ • http://www.mitsuoka-motor.com/ • http://www.ducati.com/od/ducatijapan/jp/index.jhtml • http://www.triumphmotorcycles.com/japan/ • http://www.harley-davidson.co.jp/index.html • http://www.ktm-japan.co.jp/
Comparing to ODP • Definition of precision • From a Web community C, let page subset existing in ODP OC. • If |OC| < 3, the precision of C is undefined. • For r in OC, the Pscore of r is: • With Pscore, the precision of C is: • Comparing to the 4th and 5th level of ODP directories (Top/Regional/Japan/Arts/Movie) • The number of ODP pages included in the data set: 47,093 score(p, q) = 1, p, q in same directory score(p, q) = 0, otherwise
Summary of Web Community Extraction • The proposed method extracted smaller Web communities than PlusDBG did. • Members of each community were closer to the centroid in the TF-IDF space than members of PlusDBG were. • My communities showed higher precision than PlusDBG’s when comparing to ODP.
Labeling Result • Ignore pages including less than 10 terms. • Compare to the ODP • ODP pages: 29,153 • ODP directories: 1,862
Summary and Conclusion • A DBG structure is defined as the Web community topology. • All two centers should be connectable. • All fan is a connecter of centers. • My DBG structure extracts more compact and more precise Web communities than existing work does. • Clustering based on the Web community extraction is proposed. • The centroids of communities in TF-IDF space are used in labeling of remainders. • Clustering result showed meaningful page groups.
Future Work • Coupling feature selections for improvement on the labeling result. • Clustering extracted centroids.
発表文献 • (発表予定) ウェブコミュニティ抽出アルゴリズムの改良、沈 垣甫、田浦 健次郎、近山 隆、データ工学ワークショップ、2007
Extraction Algorithm • Select seed page t and set T={t}, S={}. • Find S’ of which members cite any page in T. • Find T’ of which members cited by any page in T and are not in T. • Determine that t’∈T’ is connectable to all pages in T. • If t’ is connectable, set T=T∪{t’} and S={connecters} and go to 2. • If not, select other t’∈T’ and go to 4. • If |S| > 3 and |T| > 3, extract the page set as a Web Community and delete from the Web Graph. • If any t exists, go to 1.