210 likes | 539 Vues
Distance Metric. Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 points X and Y , such that d(X, Y) is positive definite : if (X Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X)
E N D
Distance Metric Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 points X and Y, such that d(X, Y)is positive definite: if (X Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) issymmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality:d(X, Y) + d(Y, Z) d(X, Z)
Standard Distance Metrics Minkowski distance or Lp distance, Manhattan distance, (P = 1) Euclidian distance, (P = 2) Max distance, (P = )
An Example Y (6,4) Z X (2,1) A two-dimensional space: Manhattan, d1(X,Y)= XZ+ ZY =4+3 = 7 Euclidian, d2(X,Y)= XY = 5 Max, d(X,Y)= Max(XZ, ZY) = XZ = 4 d1d2 d For any positive integer p,
HOBbit Similarity These notes contain NDSU confidential & Proprietary material. Patents pending on bSQ, Ptree technology Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x1: 0 1 10 1 0 0 1 x2: 0 1 0 11 1 0 1 y1: 0 1 11 1 1 0 1 y2: 0 1 0 1 0 0 0 0 HOBbitS(x1, y1) = 3 HOBbitS(x2, y2) = 4 Higher Order Bit (HOBbit) similarity: HOBbitS(A, B) = A, B: two scalars (integer) ai, bi :ith bit of A and B (left to right) m : number of bits
HOBbit Distance (High Order Bifurcation bit) Example: Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x1: 0 1 10 1 0 0 1 x2: 0 1 0 11 1 0 1 y1: 0 1 11 1 1 0 1 y2: 0 1 0 1 0 0 0 0 HOBbitS(x1, y1) = 3 HOBbitS(x2, y2) = 4 dv(x1, y1) = 8 – 3 = 5 dv(x2, y2) = 8 – 4 = 4 HOBbit distance between two scalar value A and B:dv(A, B)= m – HOBbit(A, B) HOBbit distance for X and Y: In our example (considering 2-dim data): dh(X, Y) = max (5, 4) = 5
HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y), = 0 if (XY), > 0 HOBbit distance is symmetric HOBbit distance holds triangle inequality
Neighborhood of a Point 2r 2r 2r 2r X X X X T T T T Neighborhood of a target point, T, is a set of points, S, such thatXSif and only if d(T, X) r Manhattan Euclidian Max HOBbit If Xis a point on the boundary, d(T, X) = r
Decision Boundary Manhattan Euclidian Max Max Euclidian Manhattan > 45 < 45 X A A A A A R1 B B B B B d(A,X) d(B,X) R2 D decision boundary between points A and B, is the locus of the point X satisfying d(A, X) = d(B, X) Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance Decision boundaries for Manhattan, Euclidean and max distance
Minkowski Metrics ? Lp-metrics (aka: Minkowski metrics) dp(X,Y) = (i=1 to n wi|xi - yi|p)1/p (weights, wi assumed =1)Unit DisksBoundary p=1 (Manhattan) p=2 (Euclidean) p=3,4,… . . P= (chessboard) P=½,⅓, ¼, … dmax≡ max|xi - yi| d≡ limp dp(X,Y). Proof (sort of) limp { i=1 to n aip }1/p max(ai) ≡b. For p large enough, other aip << bp since y=xp increasingly concave, so i=1 to n aip k*bp(k=duplicity of b in the sum), so {i=1 to n aip }1/p k1/p*b and k1/p1
P>1Lpmetrics q x1 y1 x2 y2 Lq distance x to y 2 .5 0 .5 0 .7071067812 4 .5 0 .5 0 .5946035575 9 .5 0 .5 0 .5400298694 100 .5 0 .5 0 .503477775 MAX .5 0 .5 0 .5 x y q x1 y1 x2 y2 Lq distance x to y 2 .71 0 .71 0 1.0 3 .71 0 .71 0 .8908987181 7 .71 0 .71 0 .7807091822 100 .71 0 .71 0 .7120250978 MAX .71 0 .71 0 .7071067812 x y q x1 y1 x2 y2 Lq distance x to y 2 .99 0 .99 0 1.4000714267 8 .99 0 .99 0 1.0796026553 100 .99 0 .99 0 .9968859946 1000 .99 0 .99 0 .9906864536 MAX .99 0 .99 0 .99 x y x q x1 y1 x2 y2 Lq distance x to y 2 1 0 1 0 1.4142135624 9 1 0 1 0 1.0800597389 100 1 0 1 0 1.0069555501 1000 1 0 1 0 1.0006933875 MAX 1 0 1 0 1 y q x1 y1 x2 y2 Lq distance x to y 2 .9 0 .1 0 .9055385138 9 .9 0 .1 0 .9000000003 100 .9 0 .1 0 .9 1000 .9 0 .1 0 .9 MAX .9 0 .1 0 .9 y x x q x1 y1 x2 y2 Lq distance x to y 2 3 0 3 0 4.2426406871 3 3 0 3 0 3.7797631497 8 3 0 3 0 3.271523198 100 3 0 3 0 3.0208666502 MAX 3 0 3 0 3 y x q x1 y1 x2 y2 Lq distance x to y 6 90 0 45 0 90.232863532 9 90 0 45 0 90.019514317 100 90 0 45 0 90 MAX 90 0 45 0 90 y
x P<1Lpmetrics q x1 y1 x2 y2 Lq distance x to y 1 .1 0 .1 0 .2 .8 .1 0 .1 0 .238 .4 .1 0 .1 0 .566 .2 .1 0 .1 0 3.2 .1 .1 0 .1 0 102 .04 .1 0 .1 0 3355443 .02 .1 0 .1 0 112589990684263 .01 .1 0 .1 0 1.2676 E+29 2 .1 0 .1 0 .141421356 x y y q x1 y1 x2 y2 Lq distance x to y 1 .5 0 .5 0 1 .8 .5 0 .5 0 1.19 .4 .5 0 .5 0 2.83 .2 .5 0 .5 0 16 .1 .5 0 .5 0 512 .04 .5 0 .5 0 16777216 .02 .5 0 .5 0 5.63 E+14 .01 .5 0 .5 0 6.34 E+29 2 .5 0 .5 0 .7071 q x1 y1 x2 y2 Lq distance x to y 1 .9 0 0.1 0 1 .8 .9 0 0.1 0 1.098 .4 .9 0 0.1 0 2.1445 .2 .9 0 0.1 0 10.82 .1 .9 0 0.1 0 326.27 .04 .9 0 0.1 0 10312196.962 .02 .9 0 0.1 0 341871052443154 .01 .9 0 0.1 0 3.8 E+29 2 .9 0 0.1 0 .906 y x d1/p(X,Y) = (i=1 to n |xi - yi|1/p)p P<1 For p=0 (lim as p0), Lp doesn’t exist (Does not converge.)
Min dissimilarity function The dmin function ( dmin(X,Y) = min i=1 to n|xi - yi| ) is strange. It is not even a psuedo-metric. The Unit Disk is: And the neighborhood of the blue point relative to the red point (the neighborhood of points closer to the blue than the red) is strangely shaped! http://www.cs.ndsu.nodak.edu/~serazi/research/Distance.html
Other Interesting Metrics Canberra metric: dc(X,Y) = (i=1 to n |xi – yi| / (xi + yi) normalized manhattan distance Square Cord metric: dsc(X,Y) = i=1 to n( xi – yi )2 Already discussed as Lp with p=1/2 Squared Chi-squared metric: dchi(X,Y) = i=1 to n (xi – yi)2/ (xi + yi) Scalar Product metric: dchi(X,Y) = X • Y = i=1 to n xi * yi Hyperbolic metrics: (which map infinite space 1-1 onto a sphere) Which are rotationally invariant? Translation invariant? Other? Some notes on distance functions can be found at http://www.cs.ndsu.NoDak.edu/~datasurg/distance_similarity.pdf