210 likes | 639 Vues
Measuring Distance. Input for Multidimensional Scaling and Clustering. Distances and Similarities. Both are ways of measuring how similar two objects are Distances increase as objects are less similar. The distance of an object to itself is 0
E N D
Measuring Distance Input for Multidimensional Scaling and Clustering
Distances and Similarities • Both are ways of measuring how similar two objects are • Distances increase as objects are less similar. The distance of an object to itself is 0 • Similarities increase as objects are more similar. The similarity of an object to itself is the maximum value for the similarity measure
Distance Examples • Mileage between two towns measured in straight line (Euclidian) distance (“as the crow flies”), as driving distance, or as great circle (spherical) distance • Instead of geographic locations we can treat measurements such as length, width, and thickness of an artifact as defining its position
Similarity Examples • The number of characteristics two objects have in common (cultural traits, genes, presence/absence traits) • Similarity measures can be converted to distances by subtracting each similarity from the maximum possible similarity
Interval/Ratio Measures • Manhattan Distance (or City Block, 1-norm) • Euclidian Distance (and Squared Euclidian Distance, 2-norm) • Minkowski Distance (p-norm) • Chebyshev Distance (Maximum Distance, infinite norm)
Counts • Ecologists use counts of species between plots to analyze compositional changes in community structure • Bray-Curtis compares the number of specimens and number of overlapping species
DefinitionsBray Curtis Dissimilarity Note: If samples j and k are percentages, then the denominator becomes 200.
Ordinal Measures • Few measures specifically for rank data, but rank correlation coefficients (spearman, Kendall) can be used
Dichotomies • Can use interval/ratio measures • Numerous options based on 2x2 table • Many similarity measures based on weighting of presence/presence and absence/absence • Subtract from 1 to create distances
Definitions Simple Matching Coefficient: (a+d)/(a+b+c+d) Jacard’s Coefficient (asymmetric binary): a/(a+b+c) Phi and Yule’s Q measures of association ade4 and proxy have many different options for dichotomies
Nominal Variables • Similarity can be measured with chi-square based measures • Convert to multiple dichotomies • E.g. Temper: Sand, Silt, Gravel becomes three variables: TSand, TSilt, Tgravel • Then use measures for dichotomies/ metric variables
Multiple Types • Gower’s Index is the only one that computes a similarity index using variables with different levels of measurement. Take the mean of the variables: • Presence/Absence – Jaccard • Categorical – 1 if the same, 0 if not • Interval/Ratio/Ranks – absolute difference divided by range
Issues • Weighting – how to weight variables with different variances – standardization, weighting • Correlations between variables – how (and whether) to take correlations into account (Mahalanobis Distances)
Distance Matrix • For simple analyses, dist() in base R provides euclidean, maximum, manhattan, canberra, binary (Jaccard), and minkowski • Other packages including different measures: Many others. See packages ade4, amap, cluster, ecodist, labdsv, proxy, and vegan
# Load Darl # Rcmdr to create scatterplot matrix > Euclid <- dist(Darl[,2:5]) > Euclid 35-3043 35-2871 35-2866 36-3619 36-3520 35-2871 11.437657 35-2866 5.380520 6.542935 36-3619 14.621217 3.682391 9.570266 36-3520 15.309148 4.068169 10.163661 1.757840 36-3036 7.760155 4.442972 2.495997 7.195832 7.860662 > scatterplot(Width~Length, reg.line=lm, smooth=FALSE, spread=FALSE, pch=16, id.n=6, boxplots=FALSE, ellipse=TRUE, grid=FALSE, data=Darl) > mahalanobis(Darl[,2:3], mean(Darl[,2:3]), cov=cov(Darl[,2:3])) 35-3043 35-2871 35-2866 36-3619 36-3520 36-3036 2.2577596 1.8173684 0.4641912 2.9652763 1.7527347 0.7426699
> install.packages("ecodist") > library(ecodist) > Mahal <- distance(Darl[,2:3], method="mahalanobis") > Mahal 35-3043 35-2871 35-2866 36-3619 36-3520 35-2871 4.9367446 35-2866 0.6900956 2.8905096 36-3619 8.5903617 7.5849187 4.7250487 36-3520 6.8826044 0.6084649 3.6631704 4.9720621 36-3036 2.4467510 4.8835727 0.8163226 1.9192663 4.3901066
# Rcmdr > .PC <- princomp(~Length+Weight, cor=TRUE, data=Darl) > Darl$PC1 <- .PC$scores[,1] > Darl$PC2 <- .PC$scores[,2] # Typed commands > PCDist <- dist(Darl[,6:7]) > PCDist 35-3043 35-2871 35-2866 36-3619 36-3520 35-2871 2.5498737 35-2866 2.1968323 1.1918768 36-3619 3.7858013 1.2539806 1.9883494 36-3520 4.2220041 1.8034110 2.1957351 0.7029308 36-3036 2.6677120 0.9201698 0.5717135 1.4339465 1.6290415 > scatterplot(PC2~PC1, reg.line=FALSE, smooth=FALSE, spread=FALSE, grid=FALSE, boxplots=FALSE, pch=16, ellipse=TRUE, id.n=6, span=0.5, data=Darl) [1] "35-3043" "35-2866" "36-3619" "36-3520" "35-2871" "36-3036"