330 likes | 351 Vues
This research paper discusses the problem of point pattern matching and similarity search for point sets under translation. It introduces a novel algorithm that embeds point sets into a metric space, enabling efficient similarity search. The algorithm is designed to handle outliers in the point sets and can be computed in polynomial time.
E N D
Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008
Point Pattern Matching • Point Pattern Matching • Given two point sets P, Q, find Q’ Q • to minimize • Dist(P, Q’) = min dist(tP, Q’) • where t is a geometric transformation. • (e.g., translation, rotation, …) P Q
Point Pattern Similarity Search • Point Pattern Similarity Search • A collection of point setsS={P1,P2,…,PN} • has been preprocessed. Given a • query set Q, find (approximate) • nearest Pi with respect to a • distance function and • transformation group. … … Q … … S = {P1, P2, …, PN}
Results EMD: Earth Mover’s Distance SD: Symmetric Difference Distance
P = {0,12,14,23,35,54,59,64} t=3 Q = {15,17,20,26,38,57,65,67} … … … … … … Q … P … Problem Definition • Point Pattern Similarity Searching: • Distance Measure: • Symmetric Difference Distance • Error Model: • Outliers (but No Noise) • Transformation: • Translation • Restriction: • Coordinates are integers P = {p1,p2,p3,p4} Q = {p1,p2,p5,p6} {12,14,17,23,35,54,62,64} {0,12,14,23,35,54,59,64} { 12,14,23,35,54, 64}
Motivation: Sources of Complexity • Combination of Translation + Outliers • Translation Only • - translate the point set by aligning leftmost point to the origin • - trivial matching • Outliers Only • - Reduce to Nearest neighbor search in Hamming cube • (By hashing or random sampling)
Intuition Q P1 f f P2 f P3 f f P4 f PN Metric space
Embedding: Basic Definitions • Given metric spaces (X, d) and (X', d'), a mapf: X X’is called an embedding. • The contraction of f is the maximum factor by which distances • are shrunk, i.e., • The expansion or stretch of f is the maximum factor by • which distances are stretched: • The distortion of f is the product of the contraction and expansion.
Main Result: Preliminaries • Main result: There exists an randomized embedding that maps a point set under symmetric difference with respect to translation into a metric space L1with distortion O(log2 n). • Assumption: • Each point set has at most n elements and is in dimension d. • Coordinates are integers of magnitude polynomial in n • Distance Function: Symmetric Difference with respect to translation • <PΔQ> = min |(P + t)ΔQ| • Target Metric: L1
1 0 0 1 0 0 1 0 0 0 1 3 0 0 2 0 0 1 0 Outline of Algorithm • 1. Transform d-dimension points into 1-d dimension points. • (Distortion: 1) • 2. Reduce the domain size using a linear hash function. • (Distortion: O(1)) • 3. Make invariant under translation. • (Distortion: O(log2n)) • 4. Reduce the target domain size using a universal hash function. • (Distortion: O(1)) {3,6,10,14,22} O(nlogn) {101010, ..., 010100, …, 11101}
Translation Invariant s 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 P = ρ= 4 … { 1101, 0000, 0010, 1100, 0001, 1010}
1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 Intuition hP s hQ s Φ2P={10,01,00,10,01,00,10,00,00,01,00} If one of probeshits mismatched positions, then the bit patterns generated may differ. Φ2Q={10,00,01,00,11,00,10,01,00,11,00} The probabilitythatone of probes hits mismatched positionsincreases when the probe size increases. Φ4P={1101,0000,0010,1100,0000,0001, 1000,0010,0101,0000,0010} Φ4Q={1011,0100,0010,0101,1000,0011, 1100,0010,0100,1001,0000}
Relationship between ρ (probe size) and δ* δ: estimated distance δ*: original distance Expectation Unknown Upper bound >2s-2 Distance of Invariants ??? s/2i increases
Embedding δ: estimated distance δ*: original distance ??? Distance of Invariants 1 .5 20 21 22 … 2L … … 2H … 2log 2n=2n
Build Time • The expensive operations are of building invariant and hashing for large domain. • Building invariant : (# of Probes) * (# of Translations) • Trivial: O(s) * s = O(n log n) * O(n log n) = O(n2 log2 n) • Universal hash function: • (# of Elements) * (Matrix operation) • = (# of Elements) * (Input Size) * (Output Size) • Trivial: O(s) * O(s) * O(log s) = O(s2 log s) = O( n2 log3 n ) • We can improve it to O( n log3 n ) if we merge two operations. • Surprise!!!
1 0 0 0 1 0 1 0 0 1 0 y0 y1 y2 ys-1 r0 1 0 1 0 1 … … H Merge Two Operations P= s f 1 0 1 0 1 … rlog s Convolution can be computed in O(n log n) where n is the size of array
Main Result: Formal Statement • Given failure probability β, there exists a randomized embedding • from a point set P into a vector ΨP of dimension • O(n (log2n) log(1/β)) such that for any P, Q • This embedding can be computed in timeO(n (log4n) log(1/β))
Open Problems • Q1. Can we improve the distortion bound? currently O(log2 n) • Cormode & Muthukrishnan show how to embed a string under edit distance with moves into L1 with O(log n log* n) distortion. • Q2. Can we derandomize the algorithm? • Cormode & Muthukrishnan’s algorithm is deterministic. • Q3. Can we improve space/time complexities?
Other Extensions • Q1.Can we support a distance measure (e.g., Hausdorff distance that is robust to noisy data)? • Q2.Can we handle other transformation groups? • - integer scaling? • - integer scaling + translation? • - affine transformations over finite vector spaces? • Point Pattern Similarity Searching: • Distance Measure: • Symmetric Difference Distance • Error Model: • Outliers (but No Noise) • Transformation: • Translation • Restriction: • Coordinates are integral
2 0 0 1 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Translation Invariant P = {3,6,10,14,22} h(x) = x mod s (e.g. s = 11) s 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 hP = ρ= 4 … { 1101, 0000, 0010, 1100, 0001, 1010} ΦρP = {13,0,2,12,1,…,10} h’(x) : (for simplicity, x mod 10) ΦρP = 1 3 4 2 0 5 6 7 8 9
Trial 1: Geometric Hashing for Translation • Naïve Version: • - Space complexity is O( N n2 ) since the frame size is 1. • - With outliers in a query: # of queries will increase • Adaptive Version: • To reduce space complexity, if store only c transformed sets, then • # of queries will increase. • Outliers may lead a false matching, thus they will increase the prob. of the false positive.
Geometric Hashing with Outliers (delete) • Based on the outliers $r$ and the frame size $k$, the number of queries will increase to get a correct result. • method 1. Pr[ choose a valid frame set] = ( 1 – r/n )^k • method 2. (r + 1) different trials ( deterministic) • method 3. pigeonhole theorem. • Pr[ choose a valid frame set] = 1-r/(n/k) • [Grimson&Huttenlocher 90] : Outliers lead a false matching and increase the prob. of the false positive.
0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 d-Dimension 1-Dimension • Let u be the maximum coordinate value of each point. Then, we can map a d-dimensional point set to a 1-dimensional pointset with coordinates of size at most (3u)d. without changing the symmetric difference distance under translation. (5,3) (1,1) 0 1 0 0 1 … 0 0 1 0 0 … 0 1 0 0 0 … [6,15] [21,30] 1 35
# of Primes & Collision Prob. • Collision Probability • h(x) = x mod s where s is a prime number in Θ (n log n) • ( where s is chosen uniformly at random ) • For x != y • Pr[h(x) = h(y)] = Pr[(x mod s) = (y mod s)] • = Pr[(x-y) mod s = 0] • Since x, y Є Znc, |x – y| < nc. • Pr[h(x) = h(y)] < c/(# of primes) = 1/O(n) • Prime Number Theorem • There exist O(m/log m) prime numbers in range between 1 and m.
Distance Distortion by Hashing • We can achieve o(1)distortion with the hash function which the probability of collision is 1/O(n). • Note that the distance is always contracted due to collision.
P = {3,6,10,14,22} 1 0 0 1 0 0 1 0 0 0 1 Linear Hash Function (X) • h(x) = x mod s • where s is a prime number • in Θ(n log n) • Linearity • h( x + t ) = h(x) + h(t) • - translation • ΦρP = Φρ(P+t) S
Distance Distortion by Hashing (X) • We can achieve o(1)distortion with the hash function which the probability of collision is 1/O(n). • Note that the distance is always contracted due to collision.
Universal Hash Function for large domain • Since the maximum probe size is O(n log n), the input domain of hash function is O(2O(n log n)). However, it has only θ(n log n) elements. • H: 2s 2k • H(x) = R x + b (mod (2,2,…,2))R: a random k x s matrix • b: k bits random row vector. • Time Complexity: • For compute a value : O( k s )= O( (log n) n log n ) =O( n log2 n ) • For, all s (= O(n log n) ) , the time is O( n2 log3 n ).
Relationship between ρ and δ* δisa guess distance δ* isan optimal distance Expectation Unknown Upper bound >2s-2 ??? s/2i
Effect of Hash Functions ??? h’ h
Merge Two Operations using FFT & Convolution • П = random_probe( ρ, s ) • For t = 1, …., s, x(t) = (hP + t)[П] // make an invariant • For t = 1, …, s. • x’(t) = H x(t) + b ( mod (2,2,2,…,2) ) // H: O(log s) x ρ matrix • ΦρP[x’(t)]++ • Time Complexity: O(s) * O(matrix multi) = O( s ) * O(s log s) • ------------------------------------------------------------------------ • H = [r1, r2, …, rO(log s)]’ // ri : a binary row bit vector • Hx(t) = [ r1 x(t), r2 x(t), r3 x(t), …, rO(logs) x(t)]’ • ri x(t) = ri (hP + t)[П] = (hP + t)[П ri] • [ri x(0), ri x(1), …, ri x(s)] = fliplr(hP) [П ri] • Time Complexity: O(log s) * O(convolution) = O( log s ) * O(s log s)