270 likes | 374 Vues
Explore clustering algorithms like K-means and Complete Link for small world test collections. Evaluate results quantitatively and qualitatively using utility and coherence criteria. Investigate approaches for reducing dimensionality and improving clustering quality through hypernym trees. Discover relevant hypernyms for various domains.
E N D
Small World Clustering Algorithms Brant Chee
Experiments • 3 clustering algorithms • Complete Link (Cluto) • K means (Cluto) • Small World
Experimental Setup • Parameters left at package defaults • Clustered with n = 50,100,150 and 200. • Clusters with less than 4 elements or more than 50 elements were eliminated and the clustering which resulted in less than 40 clusters was chosen to be evaluated.
Qualitative Evaluation • 2 Criteria: Utility and Coherence • 3 point scale: 1 good, 2 poor, 3 bad • Good: >60% of articles • Poor: 59-41% • Bad: <40% • Evaluate terms in cluster to get context.
Other Approaches Statistical Methods
Other Clustering Approaches • Can we choose other types of clustering algorithms which could provide better quality results or provide better cluster labels? • SOM (Self Organizing Map) • Slow for high numbers of dimensions and large numbers of objects. • Carrot2 • Slow for large numbers of items. • Huge memory consumption.
Random Projection • Can we reduce the dimensionality of vectors (ie 50,0001000) while preserving distances? • Speed up similarity calculations • Various methods: • Random projection. • “Latent semantic indexing”. • Multi Dimensional Scaling
Very Sparse Random Projections • A ∈ R× be our n points in D dimensions • A x Random matrix ∈ RD×k • R of entries in {−1, 0, 1} with probabilty • O(nDk + n2k)
Reducing Dimensionality • Bank Dataset 11,000 articles from 11 categories in Dmoz. • 11,000 articles reduced from 30K terms 1GB heap in 11s. • Increase in Purity and decrease in Entropy (measures of clustering quality).
MI on Phrases • More context than single words • More meaningful term clusters
Other approaches Knowledge Intensive Approaches
Hypernym • “Is-a” relationship • Shakespeare is an author. • Pug is a dog. • Implicitly hierarchical. • Basis of many ontology and semantic networks. • Wordnet • UMLS
Hypernym Relations • NP such as {, NP}* {(or | and)} NP • Vegetables such as Beets, Carrots and Peas. • Such NP as {NP,}* {(or|and)} NP • …works by such authors as Herrick, Goldsmith and Shakespeare. • NP {, NP}* {,} or|and other NP • Bruises, …, broken bones or other injuries • NP {,} including {NP,} * {or|and} NP • All common-law countries, including Canada and England … • NP {,} especially {NP,} * {or|and} NP • … most European countries, especially France, England and Spain.
Uses of Hypernym Trees • Search • Query Expansion • Facted metadata • Clustering • Parent node defines a cluster • Keyword assignment
Trivial Hypernyms • organic compounds d-ribose • organic compounds d-arabinose • organic compounds l-arabinose • organic compounds sucrose • substances cortisone • substances vitamins a and c • substances zinc • organs liver • organs kidney • sugar-containing products honey • sugar-containing products jam • sugar-containing products glucose • sugar-containing products fruit juice concentrates • sugar-containing products tomato • largely populated countries china • largely populated countries russia
Bad Hypernyms • suicidal patients appears • other agents plasmin • other agents plasminogen • such common sensations illness • phenomena founder effects • phenomena migration • phenomena gene flow • clinical manifestations 80 • chemical agents homocystine • no other explanation anencephaly • conditions azure a-0.5 % nahco3 solution • conditions ph 8.1 • fewer side-effects vegetative disfunction • techniques carpentier • techniques 's ring
Good? Hypernyms • entirely synthetic steroids norgestrel and quingestanol • menstrual disorders metrorrhagia • menstrual disorders oligoamenorrhea • menstrual disorders amenorrhea • mild venous disorders swollen veins • mild venous disorders heavy limbs • mild venous disorders varicosities • obstructive pulmonary lung diseases alveolar proteinosis • obstructive pulmonary lung diseases pneumonia • obstructive pulmonary lung diseases asthma • obstructive pulmonary lung diseases bronchiectasis • obstructive pulmonary lung diseases cystic fibrosis • choline analogues n,n'-dimethylethanolamine • choline analogues n-monomethylethanolamine • choline analogues ethanolamine • 3alpha-oh-containing steroids androsterone