1 / 27

Small World Clustering Algorithms

Small World Clustering Algorithms. Brant Chee. Experiments. 3 clustering algorithms Complete Link (Cluto) K means (Cluto) Small World. Test Collections. Experimental Setup. Parameters left at package defaults Clustered with n = 50,100,150 and 200.

Télécharger la présentation

Small World Clustering Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Small World Clustering Algorithms Brant Chee

  2. Experiments • 3 clustering algorithms • Complete Link (Cluto) • K means (Cluto) • Small World

  3. Test Collections

  4. Experimental Setup • Parameters left at package defaults • Clustered with n = 50,100,150 and 200. • Clusters with less than 4 elements or more than 50 elements were eliminated and the clustering which resulted in less than 40 clusters was chosen to be evaluated.

  5. Quantitative Results

  6. Quantitative Results II

  7. Qualitative Evaluation • 2 Criteria: Utility and Coherence • 3 point scale: 1 good, 2 poor, 3 bad • Good: >60% of articles • Poor: 59-41% • Bad: <40% • Evaluate terms in cluster to get context.

  8. Quantitative Results Cont…

  9. Sample Session

  10. Other Approaches Statistical Methods

  11. Other Clustering Approaches • Can we choose other types of clustering algorithms which could provide better quality results or provide better cluster labels? • SOM (Self Organizing Map) • Slow for high numbers of dimensions and large numbers of objects. • Carrot2 • Slow for large numbers of items. • Huge memory consumption.

  12. Random Projection • Can we reduce the dimensionality of vectors (ie 50,0001000) while preserving distances? • Speed up similarity calculations • Various methods: • Random projection. • “Latent semantic indexing”. • Multi Dimensional Scaling

  13. Very Sparse Random Projections • A ∈ R× be our n points in D dimensions • A x Random matrix ∈ RD×k • R of entries in {−1, 0, 1} with probabilty • O(nDk + n2k)

  14. Reducing Dimensionality • Bank Dataset 11,000 articles from 11 categories in Dmoz. • 11,000 articles reduced from 30K terms 1GB heap in 11s. • Increase in Purity and decrease in Entropy (measures of clustering quality).

  15. MI on Phrases • More context than single words • More meaningful term clusters

  16. Other approaches Knowledge Intensive Approaches

  17. Hypernym • “Is-a” relationship • Shakespeare is an author. • Pug is a dog. • Implicitly hierarchical. • Basis of many ontology and semantic networks. • Wordnet • UMLS

  18. Portion of the UMLS Semantic Network: Biologic Function

  19. Hypernym Relations • NP such as {, NP}* {(or | and)} NP • Vegetables such as Beets, Carrots and Peas. • Such NP as {NP,}* {(or|and)} NP • …works by such authors as Herrick, Goldsmith and Shakespeare. • NP {, NP}* {,} or|and other NP • Bruises, …, broken bones or other injuries • NP {,} including {NP,} * {or|and} NP • All common-law countries, including Canada and England … • NP {,} especially {NP,} * {or|and} NP • … most European countries, especially France, England and Spain.

  20. Uses of Hypernym Trees • Search • Query Expansion • Facted metadata • Clustering • Parent node defines a cluster • Keyword assignment

  21. Trivial Hypernyms • organic compounds d-ribose • organic compounds d-arabinose • organic compounds l-arabinose • organic compounds sucrose • substances cortisone • substances vitamins a and c • substances zinc • organs liver • organs kidney • sugar-containing products honey • sugar-containing products jam • sugar-containing products glucose • sugar-containing products fruit juice concentrates • sugar-containing products tomato • largely populated countries china • largely populated countries russia

  22. Bad Hypernyms • suicidal patients appears • other agents plasmin • other agents plasminogen • such common sensations illness • phenomena founder effects • phenomena migration • phenomena gene flow • clinical manifestations 80 • chemical agents homocystine • no other explanation anencephaly • conditions azure a-0.5 % nahco3 solution • conditions ph 8.1 • fewer side-effects vegetative disfunction • techniques carpentier • techniques 's ring

  23. Good? Hypernyms • entirely synthetic steroids norgestrel and quingestanol • menstrual disorders metrorrhagia • menstrual disorders oligoamenorrhea • menstrual disorders amenorrhea • mild venous disorders swollen veins • mild venous disorders heavy limbs • mild venous disorders varicosities • obstructive pulmonary lung diseases alveolar proteinosis • obstructive pulmonary lung diseases pneumonia • obstructive pulmonary lung diseases asthma • obstructive pulmonary lung diseases bronchiectasis • obstructive pulmonary lung diseases cystic fibrosis • choline analogues n,n'-dimethylethanolamine • choline analogues n-monomethylethanolamine • choline analogues ethanolamine • 3alpha-oh-containing steroids androsterone

More Related