ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZINGAUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky ACMSE’07

INTRODUCTION • blogs • highly opinionated personal online commentary • including hyperlinks to other resources • Technorati (July, 2006) • tracking more than 50 million blogs • about 175,000 blogs were created daily • size of the blogosphere doubles every six months • how many blog authors are updating their blogs regularly -> not clear

INTRODUCTION(CON.) • analysis of the blogosphere in 2004 • more than two-thirds of public blogs are personal journals • knowledge blogs (k-blogs) -> mere 3 percent • due to the diverse background of the blog authors and readers • the blogosphere has hyper-accelerated the spread of information

BLOGS V.S. WEBPAGES • the major difference between blogs and the standard web pages • blogs are dated • most of blogs allow readers to place comments on each blog document • creates communication channels between the blog authors and the readers • blog authors can place individual blogs into different categories • according to some predefined categories • the definitions of the categories may be different for different authors

BLOG DOCUMENTS • use vector-space model to encode the blog web pages • each blog page can be viewed as a column vector • each word used can be considered as one row of the matrix • consider a blog page as three parts • blog title • blog body • the content of the blog page • comments of the authors and/or the readers

A SAMPLE BLOG PAGE

Hypothesis • hypothesis • the use of title and comment words in the dataset will enhance the discrimination of the blog pages • result in more accurate clustering solutions • reason • the words in the comments reflect the specific views and questions and answers of the authors and the readers • may hold more weights in discriminating individual blog pages

DATA PREPARATION AND CLUSTERING • Data Preprocessing • selected three categories of blog files • gun control • church • Alzheimer’s disease • downloaded from Windows Live Spaces by searching with the key words • each entry has at least one comment • each category has 70 files for a total of 210 blog files • parsingconvert into 3 parts  stemming delete stop words count the number of occurrences of each word

DATA PREPROCESSING(CON.) • represent each document by three vectors • vector for the whole document is a weighted sum of all three vectors: • wt : title weight • wb : body weight • wc : comment weight

DATA PREPROCESSING(CON.) • the word-page matrix A is composed of a set of such document vectors • A = (v1 … vm) • vij is the weighted occurrences of the word i in the document vj • to balance the influence of small size and large size documents • scale each document vector vj to have its Euclidean norm equal to 1

tf-idf TI is the mean value of tfidf over all the documents for each term use TI to measure the quality of the term the higher the TI value is, the better the term is to be ranked Feature Selection

Clustering • k-means algorithm • It computes the Euclidean distance from each of the documents to each cluster center. A document is assigned to the cluster with the smallest distance • each cluster center is recomputed to be the mean of its constituent documents • repeat steps 1. and 2. until the convergence is reached

criterion function for the convergence r : the step of the iterations Edist(vi, cj) : computes the Euclidean distance from the document vi to a cluster center cj given a convergence criterion ε the k-means algorithm stops when |fr+1 - fr| < ε CLUSTERING(CON.)

Entropy gauges the distribution of each class of documents within each cluster suppose there are q classes and the clustering algorithm returns k clusters the entropy E of a cluster Sr of size nr is computed as is the number of documents in the ith class that are assigned to the rth cluster entropy of the entire clustering solution is computed as: CLUSTERING METRICS

CLUSTERING METRICS(CON.) • Purity • the purity of the cluster Sr can be defined as • purity value of the entire clustering solution is computed as

EXPERIMENTAL RESULTS • influence of weight • not very good if only use one of the title, body, or comment • the accuracy of clustering the blog body is better than title or comments • using all of the three parts improves a lot

Feature Selection use only the title and the body for clustering reducing the percentage of the features used will not change the clustering accuracy apply feature selection to all the blog content including the comments with certain percentage of features selected, entropy value can be reduced EXPERIMENTAL RESULTS  making good use of the terms in comments can help increase clustering accuracy

Summary • utilizing a particular feature of the blogs, the comments, to enhance the effectiveness of a clustering algorithm in classifying blog pages • Future work • consider the timing effect of the blogs • better clustering blog documents • finding blog communities • the utilization of predefined category information may also improve the classification of blog files • experimenting other data mining algorithms with blog datasets

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS

Presentation Transcript

ENHANCING CLUSTER LABELING USING WIKIPEDIA

Clustering for web documents

Creating and Using PDF Documents

Word Lesson 9 Enhancing Documents

Types of Documents

Presenter : Keng -Yu Lin Author : Amir Ahmad , Lipika Dey PRL . 2011

Extending the Growing Hierarchal SOM for Clustering Documents in Graphs domain

An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Word 2010 Level 1 Unit 2 Enhancing and Customizing Documents Chapter 6 Maintaining Documents

Clustering and MDS

Reader Project Presentation

HCC class lecture 14 comments

CoNMF: Exploiting User Comments for Clustering Web2.0 Items

Co-clustering based classification for Out-of-domain Documents

Utilizing Marginal Net Utility for Recommendation in E-commerce

Clustering dense graphs: A web site graph paradigm

Enhancing Your Communications with Adobe Acrobat 8.0

Clustering

Bi-Clustering

Two Density-based Clustering Algorithms

WeBLOGS