1 / 14

Matjaž Juršič, Vid Podpe čan , Nada Lavrač

http://kt.jis.si. Fuzzy Clustering of Documents. Matjaž Juršič, Vid Podpe čan , Nada Lavrač. 1 /13. Overview. Basic Concepts - Clustering - Fuzzy Clustering - Clustering of Documents Problem Domain - Conference Papers Clustering (Phase 1)

micah
Télécharger la présentation

Matjaž Juršič, Vid Podpe čan , Nada Lavrač

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://kt.jis.si Fuzzy Clustering of Documents Matjaž Juršič, Vid Podpečan, Nada Lavrač

  2. 1/13 Overview Basic Concepts - Clustering - Fuzzy Clustering - Clustering of Documents Problem Domain - Conference Papers Clustering (Phase 1) - Combining Constraint-Based & Fuzzy Clustering - Conference Papers Clustering (Phase 2) Fuzzy Clustering of Documents - C-Means Algorithm - Distance Measure - Comparison of Crisp & Fuzzy Clustering - Time Complexity Further Work Fuzzy Clustering of Documents

  3. 2/13 Clustering • Important unsupervised learning problem that deals with finding a structure in a collection of unlabeled data. • Dividing data into groups (clusters) such that: • - “similar” objects are in the same cluster, • - “dissimilar” objects are in different clusters. • Problems: • - correct similarity/distance function between objects, • - evaluating clustering results. Fuzzy Clustering of Documents

  4. 3/13 Fuzzy Clustering • No sharp boundaries between clusters. • Each data object can belong to more than one cluster (with certain probability). e.g. membership of “red square” data object: - 70% in “red” cluster - 30% in “green” cluster Fuzzy Clustering of Documents

  5. 4/13 • Clustering of Documents • Bag of Words & Vector Space Model • - text represented as an unordered collection of words • - using tf-idf (term frequency–inversedocumentfrequency) • - document = one vector in high dimensional space • - similarity = cosine similarity between vectors • Text-Garden Software Library (www.textmining.net) • - collection of text-minig software tools • (text analysis, model generation, documents classification/clustering, web crawling, ...) • - c++ library • - developed at JSI Fuzzy Clustering of Documents

  6. 5/13 • Conference Papers Clustering (Phase 1) Problem Grouping conference papers with regard to their contents into predefined sessions schedule. Sessions schedule Example Session A – Title Session D – Title Session A (3 papers) Constraint-based clustering Papers Coffee break Session B (4 papers) Session C – Title Session B – Title Lunch break Session C (4 papers) Coffee break Session D (3 papers) Fuzzy Clustering of Documents

  7. 6/13 • Combining Constraint-Based & Fuzzy Clustering Phase 1 Solution - constrained-based clustering (CBC) Difficulties - CBC can get stuck in local minimum - often low quality result (created schedule) - user interaction needed to repair schedule Phase 2 Needed - run fuzzy clustering (FC) with initial clusters from CBC - if output clusters of FC differ from CBC repeat everything - if the clusters of FC equal to CBC show new info to user Fuzzy Clustering of Documents

  8. 7/13 • Conference Papers Clustering (Phase 2) Run Fuzzy Clustering on Phase 1 Results - insight into result quality - identify problematic papers Sessions schedule Example Session A – Title Session D – Title 25% Coffee break Session B – Title Session C – Title 10% 42% Lunch break 13% 37% Coffee break Fuzzy Clustering of Documents

  9. 8/13 • C-Means Algorithm • generate initial(random) clusters centres • repeat • for each example calculate membership weights • for each cluster recompute new centre • until the difference of the clusters between two iterations drops under some threshold Fuzzy Clustering of Documents

  10. 9/13 • Distance Measure Vector Space - Usual similarity measure: cosine similarity C-Means explicitly needs distance (dissimilarity), not similarity: - There are many possibilities: - None has ideal properties. - Experimental evaluation shows no significant difference. - We used Fuzzy Clustering of Documents

  11. 10/13 • Comparison of Crisp & Fuzzy Clustering Fuzzy Clustering of Documents

  12. 11/13 • Time Complexity If dimensionality of the vector is much higher than the number of clusters then comparable to k-means (this holds for document clustering). Fuzzy Clustering of Documents

  13. 12/13 • Further Work Evaluation - Test scenarios - Benchmarks - Using data from past conferences User Interface - Web interface for semi-automatic conference schedule creation Algorithms Fine-Tuning … Fuzzy Clustering of Documents

  14. Discussion contacts matjaz.jursic@ijs.si, vid.podpecan@ijs.si, nada.lavrac@ijs.si Thank you for your attention

More Related