1 / 18

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS. Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky. ACMSE’07. INTRODUCTION. blogs highly opinionated personal online commentary including hyperlinks to other resources

Télécharger la présentation

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZINGAUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky ACMSE’07

  2. INTRODUCTION • blogs • highly opinionated personal online commentary • including hyperlinks to other resources • Technorati (July, 2006) • tracking more than 50 million blogs • about 175,000 blogs were created daily • size of the blogosphere doubles every six months • how many blog authors are updating their blogs regularly -> not clear

  3. INTRODUCTION(CON.) • analysis of the blogosphere in 2004 • more than two-thirds of public blogs are personal journals • knowledge blogs (k-blogs) -> mere 3 percent • due to the diverse background of the blog authors and readers • the blogosphere has hyper-accelerated the spread of information

  4. BLOGS V.S. WEBPAGES • the major difference between blogs and the standard web pages • blogs are dated • most of blogs allow readers to place comments on each blog document • creates communication channels between the blog authors and the readers • blog authors can place individual blogs into different categories • according to some predefined categories • the definitions of the categories may be different for different authors

  5. BLOG DOCUMENTS • use vector-space model to encode the blog web pages • each blog page can be viewed as a column vector • each word used can be considered as one row of the matrix • consider a blog page as three parts • blog title • blog body • the content of the blog page • comments of the authors and/or the readers

  6. A SAMPLE BLOG PAGE

  7. Hypothesis • hypothesis • the use of title and comment words in the dataset will enhance the discrimination of the blog pages • result in more accurate clustering solutions • reason • the words in the comments reflect the specific views and questions and answers of the authors and the readers • may hold more weights in discriminating individual blog pages

  8. DATA PREPARATION AND CLUSTERING • Data Preprocessing • selected three categories of blog files • gun control • church • Alzheimer’s disease • downloaded from Windows Live Spaces by searching with the key words • each entry has at least one comment • each category has 70 files for a total of 210 blog files • parsingconvert into 3 parts  stemming delete stop words count the number of occurrences of each word

  9. DATA PREPROCESSING(CON.) • represent each document by three vectors • vector for the whole document is a weighted sum of all three vectors: • wt : title weight • wb : body weight • wc : comment weight

  10. DATA PREPROCESSING(CON.) • the word-page matrix A is composed of a set of such document vectors • A = (v1 … vm) • vij is the weighted occurrences of the word i in the document vj • to balance the influence of small size and large size documents • scale each document vector vj to have its Euclidean norm equal to 1

  11. tf-idf TI is the mean value of tfidf over all the documents for each term use TI to measure the quality of the term the higher the TI value is, the better the term is to be ranked Feature Selection

  12. Clustering • k-means algorithm • It computes the Euclidean distance from each of the documents to each cluster center. A document is assigned to the cluster with the smallest distance • each cluster center is recomputed to be the mean of its constituent documents • repeat steps 1. and 2. until the convergence is reached

  13. criterion function for the convergence r : the step of the iterations Edist(vi, cj) : computes the Euclidean distance from the document vi to a cluster center cj given a convergence criterion ε the k-means algorithm stops when |fr+1 - fr| < ε CLUSTERING(CON.)

  14. Entropy gauges the distribution of each class of documents within each cluster suppose there are q classes and the clustering algorithm returns k clusters the entropy E of a cluster Sr of size nr is computed as is the number of documents in the ith class that are assigned to the rth cluster entropy of the entire clustering solution is computed as: CLUSTERING METRICS

  15. CLUSTERING METRICS(CON.) • Purity • the purity of the cluster Sr can be defined as • purity value of the entire clustering solution is computed as

  16. EXPERIMENTAL RESULTS • influence of weight • not very good if only use one of the title, body, or comment • the accuracy of clustering the blog body is better than title or comments • using all of the three parts improves a lot

  17. Feature Selection use only the title and the body for clustering reducing the percentage of the features used will not change the clustering accuracy apply feature selection to all the blog content including the comments with certain percentage of features selected, entropy value can be reduced EXPERIMENTAL RESULTS  making good use of the terms in comments can help increase clustering accuracy

  18. Summary • utilizing a particular feature of the blogs, the comments, to enhance the effectiveness of a clustering algorithm in classifying blog pages • Future work • consider the timing effect of the blogs • better clustering blog documents • finding blog communities • the utilization of predefined category information may also improve the classification of blog files • experimenting other data mining algorithms with blog datasets

More Related