1 / 23

V. Clustering

V. Clustering. 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93. Outline. V.1 Clustering tasks in text analysis V.2 The general clustering problem V.3 Clustering algorithm V.4 Clustering of textual data. Clustering. Clustering

vanna
Télécharger la présentation

V. Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93

  2. Outline • V.1 Clustering tasks in text analysis • V.2 The general clustering problem • V.3 Clustering algorithm • V.4 Clustering of textual data

  3. Clustering • Clustering • An unsupervised process through which objects are classified into groups called cluster. (cf. categorization is a supervised process.) • Data mining, document retrieval, image segmentation, pattern classification.

  4. V.1 Clustering tasks in text analysis(1/2) • Cluster hypothesis “Relevant documents tend to be more similar to each other than to nonrelevant ones.” • If cluster hypothesis holds for a particular document collection, then the clustering of documents may help to improve the search effectiveness. • Improving search recall • When a query matches a document its whole cluster can be return • Improving search precision • By grouping the document into a much smaller number of groups of related documents

  5. V.1 Clustering tasks in text analysis(2/2) • Scatter/gather browsing method • Purpose: to enhance the efficiency of human browsing of a document collection when a specific search query cannot be a formulated. • Session1: a document collection is scattered into a set of clusters. • Sesson2: then the selected clusters are gatheredinto a new subcollection with which the process may be repeated. • 참고사이트 • http://www2.parc.com/istl/projects/ia/sg-background.html • Query-Specific clustering are also possible. - the hierarchical clustering is appealing

  6. V.2 Clustering problem(1/2) • Cluster tasks • problem representation • definition proximity measures • actual clustering of objects • data abstraction • evalutation • Problem representation • Basically, optimization problem. • Goal: select the best among all possible groupings of objects • Similarity function: clustering quality function. • Feature extraction/ feature selection • In a vector space model, • objects: vectors in the high-dimensional feature space. • the similarity function: the distance between the vectors in some metric

  7. V.2 Clustering problem(2/2) • Similarity Measures • Euclidian distance • Cosine similarity measure is the most common

  8. V.3 Clustering algorithm (1/9) • flat clustering: a single partition of a set of objects into disjoint groups. • hierarchical clustering: a nested series of partition. • hard clustering: every objects may belongs to exactly one cluster. • soft clustering: objects may belongs to several clusters with a fractional degree of membership in each.

  9. V.3 Clustering algorithm (2/9) • Agglomerative algorithm: begin with each object in a separate cluster and successively merge cluster until a stopping criterion is satisfied. • Divisive algorithm: begin with a single cluster containing all objects and perform splitting until stopping criterion satisfied. • Shuffling algorithm: iteratively redistribute objects in clusters

  10. V.3 Clustering algorithm (3/9) • k-means algorithm(1/2) • hard, flat, shuffling algorithm

  11. V.3 Clustering algorithm (4/9) • example of K-means algorithm

  12. V.3 Clustering algorithm (5/9) • K-means algorithm(2/2) • Simple, efficient • Complexity O(kn) • bad initial selection of seeds.-> local optimal. • k-means suboptimality is also exist.-> Buckshot algorithm. ISO-DATA algorithm • Maximizes the quality function Q:

  13. V.3 Clustering algorithm (6/9) • EM-based probabilistic clustering algorithm(1/2) • Soft, flat, probabilistic

  14. V.3 Clustering algorithm (7/9)

  15. V.3 Clustering algorithm (8/9) • Hierarchical agglomerative Clustering • single-link method • Complete-link method • Average-link method

  16. V.3 Clustering algorithm (9/9)

  17. Other clustering algorithms • minimal spanning tree • nearest neighbor clustering • Buckshot algorithm

  18. V.4 clustering of textual data(1/6) • representation of text clustering problem • Objects are very complex and rich internal structure. • Documents must be converted into vectors in the feature space. • Bag-of-words document representation. • Reducing the dimensionality • Local method: delete unimportant components from individual document vectors. • Global method: latent semantic indexing(LSI)

  19. V.4 clustering of textual data(2/6) • latent semantic indexing • map N-dimensional feature space F onto a lower dimensional subspace V. • LSI is based upon applying the SVD to the term-document matrix.

  20. V.4 clustering of textual data(3/6) • Singular value decomposition (SVD) A = UDVT U : column-orthonormal mxr matrix D: diagonal rxr matrix, matrix,digonal elements are the singular values of A V: column-orthonormal nxr UUT = VTV = I • Dimension reduction

  21. V.4 clustering of textual data(4/6) • Mediods: actual documents that are most similar to the centroids • Using Naïve Bayes Mixture models with the EM clustering algorithm

  22. V.4 clustering of textual data(5/6) • Data abstraction in text clustering • generating meaningful and concise description of the cluster. • method of generating the label automatically • a title of the medoid document • several words common to the cluster documents can be shown. • a distinctive noun phrase.

  23. V.4 clustering of textual data(6/6) • Evaluation of text clustering - the quality of the result? • purity • assume {L1,L2,...,Ln} are the manually labeled classes of documents, {C1,C2,...,Cm} the clusters returned by the clustering process • entropy, mutual information between classes and clusters

More Related