1 / 33

Embedding and Similarity Search for Point Sets under Translation

This research paper discusses the problem of point pattern matching and similarity search for point sets under translation. It introduces a novel algorithm that embeds point sets into a metric space, enabling efficient similarity search. The algorithm is designed to handle outliers in the point sets and can be computed in polynomial time.

alanaa
Télécharger la présentation

Embedding and Similarity Search for Point Sets under Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

  2. Point Pattern Matching • Point Pattern Matching • Given two point sets P, Q, find Q’  Q • to minimize • Dist(P, Q’) = min dist(tP, Q’) • where t is a geometric transformation. • (e.g., translation, rotation, …) P Q

  3. Point Pattern Similarity Search • Point Pattern Similarity Search • A collection of point setsS={P1,P2,…,PN} • has been preprocessed. Given a • query set Q, find (approximate) • nearest Pi with respect to a • distance function and • transformation group. … … Q … … S = {P1, P2, …, PN}

  4. Results EMD: Earth Mover’s Distance SD: Symmetric Difference Distance

  5. P = {0,12,14,23,35,54,59,64} t=3 Q = {15,17,20,26,38,57,65,67} … … … … … … Q … P … Problem Definition • Point Pattern Similarity Searching: • Distance Measure: • Symmetric Difference Distance • Error Model: • Outliers (but No Noise) • Transformation: • Translation • Restriction: • Coordinates are integers P = {p1,p2,p3,p4} Q = {p1,p2,p5,p6} {12,14,17,23,35,54,62,64} {0,12,14,23,35,54,59,64} { 12,14,23,35,54, 64}

  6. Motivation: Sources of Complexity • Combination of Translation + Outliers • Translation Only • - translate the point set by aligning leftmost point to the origin • - trivial matching • Outliers Only • - Reduce to Nearest neighbor search in Hamming cube • (By hashing or random sampling)

  7. Intuition Q P1 f f P2 f P3 f f P4 f PN Metric space

  8. Embedding: Basic Definitions • Given metric spaces (X, d) and (X', d'), a mapf: X  X’is called an embedding. • The contraction of f is the maximum factor by which distances • are shrunk, i.e., • The expansion or stretch of f is the maximum factor by • which distances are stretched: • The distortion of f is the product of the contraction and expansion.

  9. Main Result: Preliminaries • Main result: There exists an randomized embedding that maps a point set under symmetric difference with respect to translation into a metric space L1with distortion O(log2 n). • Assumption: • Each point set has at most n elements and is in dimension d. • Coordinates are integers of magnitude polynomial in n • Distance Function: Symmetric Difference with respect to translation • <PΔQ> = min |(P + t)ΔQ| • Target Metric: L1

  10. 1 0 0 1 0 0 1 0 0 0 1 3 0 0 2 0 0 1 0 Outline of Algorithm • 1. Transform d-dimension points into 1-d dimension points. • (Distortion: 1) • 2. Reduce the domain size using a linear hash function. • (Distortion: O(1)) • 3. Make invariant under translation. • (Distortion: O(log2n)) • 4. Reduce the target domain size using a universal hash function. • (Distortion: O(1)) {3,6,10,14,22} O(nlogn) {101010, ..., 010100, …, 11101}

  11. Translation Invariant s 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 P = ρ= 4 … { 1101, 0000, 0010, 1100, 0001, 1010}

  12. 1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 Intuition hP s hQ s Φ2P={10,01,00,10,01,00,10,00,00,01,00} If one of probeshits mismatched positions, then the bit patterns generated may differ. Φ2Q={10,00,01,00,11,00,10,01,00,11,00} The probabilitythatone of probes hits mismatched positionsincreases when the probe size increases. Φ4P={1101,0000,0010,1100,0000,0001, 1000,0010,0101,0000,0010} Φ4Q={1011,0100,0010,0101,1000,0011, 1100,0010,0100,1001,0000}

  13. Relationship between ρ (probe size) and δ* δ: estimated distance δ*: original distance Expectation Unknown Upper bound >2s-2 Distance of Invariants ??? s/2i increases

  14. Embedding δ: estimated distance δ*: original distance ??? Distance of Invariants 1 .5 20 21 22 … 2L … … 2H … 2log 2n=2n

  15. Build Time • The expensive operations are of building invariant and hashing for large domain. • Building invariant : (# of Probes) * (# of Translations) • Trivial: O(s) * s = O(n log n) * O(n log n) = O(n2 log2 n) • Universal hash function: • (# of Elements) * (Matrix operation) • = (# of Elements) * (Input Size) * (Output Size) • Trivial: O(s) * O(s) * O(log s) = O(s2 log s) = O( n2 log3 n ) • We can improve it to O( n log3 n ) if we merge two operations. • Surprise!!!

  16. 1 0 0 0 1 0 1 0 0 1 0 y0 y1 y2 ys-1 r0 1 0 1 0 1 … … H Merge Two Operations P= s f 1 0 1 0 1 … rlog s Convolution can be computed in O(n log n) where n is the size of array

  17. Main Result: Formal Statement • Given failure probability β, there exists a randomized embedding • from a point set P into a vector ΨP of dimension • O(n (log2n) log(1/β)) such that for any P, Q • This embedding can be computed in timeO(n (log4n) log(1/β))

  18. Open Problems • Q1. Can we improve the distortion bound? currently O(log2 n) • Cormode & Muthukrishnan show how to embed a string under edit distance with moves into L1 with O(log n log* n) distortion. • Q2. Can we derandomize the algorithm? • Cormode & Muthukrishnan’s algorithm is deterministic. • Q3. Can we improve space/time complexities?

  19. Other Extensions • Q1.Can we support a distance measure (e.g., Hausdorff distance that is robust to noisy data)? • Q2.Can we handle other transformation groups? • - integer scaling? • - integer scaling + translation? • - affine transformations over finite vector spaces? • Point Pattern Similarity Searching: • Distance Measure: • Symmetric Difference Distance • Error Model: • Outliers (but No Noise) • Transformation: • Translation • Restriction: • Coordinates are integral

  20. Thank You!

  21. 2 0 0 1 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Translation Invariant P = {3,6,10,14,22} h(x) = x mod s (e.g. s = 11) s 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 hP = ρ= 4 … { 1101, 0000, 0010, 1100, 0001, 1010} ΦρP = {13,0,2,12,1,…,10} h’(x) : (for simplicity, x mod 10) ΦρP = 1 3 4 2 0 5 6 7 8 9

  22. Trial 1: Geometric Hashing for Translation • Naïve Version: • - Space complexity is O( N n2 ) since the frame size is 1. • - With outliers in a query: # of queries will increase • Adaptive Version: • To reduce space complexity, if store only c transformed sets, then • # of queries will increase. • Outliers may lead a false matching, thus they will increase the prob. of the false positive.

  23. Geometric Hashing with Outliers (delete) • Based on the outliers $r$ and the frame size $k$, the number of queries will increase to get a correct result. • method 1. Pr[ choose a valid frame set] = ( 1 – r/n )^k • method 2. (r + 1) different trials ( deterministic) • method 3. pigeonhole theorem. • Pr[ choose a valid frame set] = 1-r/(n/k) • [Grimson&Huttenlocher 90] : Outliers lead a false matching and increase the prob. of the false positive.

  24. 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 d-Dimension  1-Dimension • Let u be the maximum coordinate value of each point. Then, we can map a d-dimensional point set to a 1-dimensional pointset with coordinates of size at most (3u)d. without changing the symmetric difference distance under translation. (5,3) (1,1) 0 1 0 0 1 … 0 0 1 0 0 … 0 1 0 0 0 … [6,15] [21,30] 1 35

  25. # of Primes & Collision Prob. • Collision Probability • h(x) = x mod s where s is a prime number in Θ (n log n) • ( where s is chosen uniformly at random ) • For x != y • Pr[h(x) = h(y)] = Pr[(x mod s) = (y mod s)] • = Pr[(x-y) mod s = 0] • Since x, y Є Znc, |x – y| < nc. • Pr[h(x) = h(y)] < c/(# of primes) = 1/O(n) • Prime Number Theorem • There exist O(m/log m) prime numbers in range between 1 and m.

  26. Distance Distortion by Hashing • We can achieve o(1)distortion with the hash function which the probability of collision is 1/O(n). • Note that the distance is always contracted due to collision.

  27. P = {3,6,10,14,22} 1 0 0 1 0 0 1 0 0 0 1 Linear Hash Function (X) • h(x) = x mod s • where s is a prime number • in Θ(n log n) • Linearity • h( x + t ) = h(x) + h(t) • - translation • ΦρP = Φρ(P+t) S

  28. Distance Distortion by Hashing (X) • We can achieve o(1)distortion with the hash function which the probability of collision is 1/O(n). • Note that the distance is always contracted due to collision.

  29. Universal Hash Function for large domain • Since the maximum probe size is O(n log n), the input domain of hash function is O(2O(n log n)). However, it has only θ(n log n) elements. • H: 2s 2k • H(x) = R x + b (mod (2,2,…,2))R: a random k x s matrix • b: k bits random row vector. • Time Complexity: • For compute a value : O( k s )= O( (log n) n log n ) =O( n log2 n ) • For, all s (= O(n log n) ) , the time is O( n2 log3 n ).

  30. Relationship between ρ and δ* δisa guess distance δ* isan optimal distance Expectation Unknown Upper bound >2s-2 ??? s/2i

  31. Effect of Hash Functions ??? h’ h

  32. Merge Two Operations using FFT & Convolution • П = random_probe( ρ, s ) • For t = 1, …., s, x(t) = (hP + t)[П] // make an invariant • For t = 1, …, s. • x’(t) = H x(t) + b ( mod (2,2,2,…,2) ) // H: O(log s) x ρ matrix • ΦρP[x’(t)]++ • Time Complexity: O(s) * O(matrix multi) = O( s ) * O(s log s) • ------------------------------------------------------------------------ • H = [r1, r2, …, rO(log s)]’ // ri : a binary row bit vector • Hx(t) = [ r1 x(t), r2 x(t), r3 x(t), …, rO(logs) x(t)]’ • ri x(t) = ri (hP + t)[П] =  (hP + t)[П ri] • [ri x(0), ri x(1), …, ri x(s)] = fliplr(hP)  [П ri] • Time Complexity: O(log s) * O(convolution) = O( log s ) * O(s log s)

  33. Build Time

More Related