1 / 42

Converting Categories to Numbers for Approximate Nearest Neighbor Search

Converting Categories to Numbers for Approximate Nearest Neighbor Search. 嘉義大學資工系 郭煌政 2004/10/20. Outline. Introduction Motivation Measurement Algorithms Experiments Conclusion. Introduction. Memory-Based Reasoning Case-Based Reasoning Instance-Based Learning

Télécharger la présentation

Converting Categories to Numbers for Approximate Nearest Neighbor Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

  2. Outline • Introduction • Motivation • Measurement • Algorithms • Experiments • Conclusion

  3. Introduction • Memory-Based Reasoning • Case-Based Reasoning • Instance-Based Learning • Given a training dataset and a new object, predict the class (target value) of the new object. • Focus on table data

  4. Introduction • K Nearest Neighbor Search • Compute similarity between the new object and each object in the training dataset. • Linear time to the size of the dataset • Similarity: Euclidean distance • Multi-dimension Index • Spatial data structure, such as R-tree • Numeric data

  5. Introduction • Indexing on Categorical Data? • Linear order of the categories • Existing correct ordering? • Best ordering? • Store the mapped data on a multi-dimensional data structure as filtering mechanism

  6. Measurement for Ordering Ordering Problem Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.

  7. Measurement for Ordering • Relationship ScoringReasonable Ordering Score • In an ordering path <v1, v2, …, vn>, 3-tuple <vi-1, vi, vi+1> is reasonable if and only if dist(vi-1, vi+1) ≧dist(vi-1, vi) anddist(vi-1, vi+1) ≧dist(vi, vi+1).

  8. Measurement for Mapping • Pairwise Difference Scoring • Normalized distance matrix • Mapping values of categories • Distm(vi, vj) = |mapping(vi) - mapping(vj)|

  9. Algorithms • Prim-like Ordering • Kruskal-like Ordering • Divisive Ordering • GA Approach Ordering • A vertex is a category • A graph represent a distance matrix

  10. Prim-like Ordering Algorithm • Prim’s Minimum Spanning Tree • Initially, choose a least edge (u, v) • Add the edge to the tree; S = {u, v} • Choose a least edge connecting a vertex in S and a vertex, w, not in S • Add the edge to the tree; Add w to S • Repeat until all vertices are in S

  11. Prim-like Ordering Algorithm • Prim-like Ordering • Choose a least edge (u, v) • Add the edge to the ordering path; S = {u, v} • Choose a least edge connecting a vertex in S and a vertex, w, not in S • If the edge creates a circle on the path, discard the edge, and choose again • Else, add the edge to the ordering path; Add w to S • Repeat until all vertices are in S

  12. Kruskal-like Ordering Algorithm • Kruskal Minimum Spanning Tree • Initially, choose a least edge (u, v) • Add the edge to the tree; S = {u, v} • Choose a least edge as long as the edge does not create a circle in the tree • Add the edge to the tree; Add the two vertiecs to S • Repeat until all vertices are in S

  13. Kruskal-like Ordering Algorithm • Kruskal-like Ordering • Initially, choose a least edge (u, v) and add it to the ordering path; S = {u, v} • Choose a least edge as long as the edge does not create a circle in the tree, anddegree of either vertex on the path is <= 2 • Add the edge to the ordering path; Add the two vertices to S • Repeat until all vertices are in S • Heap array can be used to speed up choosing least edge

  14. Divisive Ordering Algorithm • Idea: • Pick a central vertex, and split the rest vertices • Building a binary tree: vertices are the leaves • Central Vertex:

  15. P B A AL BL AR BR Divisive Ordering Algorithm • AR is closer to P than AL is. • BL is closer to P than BR is.

  16. Clustering • Splitting a Set of Vertices into Two Groups • Each group has at least one vertex • Close (similar) vertices in same groupDistant vertices in different groups • Clustering Algorithms • Two clusters

  17. Clustering • Clustering • Grouping a set of objects into classes of similar objects • Agglomerative Hierarchical Clustering Algorithm • Singleton clusters • Merge similar clusters

  18. Clustering • Clustering Algorithm: Cluster Similarity • Single linkdist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj • Complete linkdist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj • Average link -- adopted in our studydist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj • others

  19. Clustering • Clustering Implementation Issues • Which pair of clusters to be merged:Keep cluster-to-cluster similarity for each pair • Recursively partition sets of vertices while building the binary tree:Non-recursive version with a stack

  20. GA Ordering Algorithm • Genetic Algorithm for Optimal Problems • Chromosome: solution • Population: pool of solutions • Genetic Operations • Crossover • Mutation

  21. GA Ordering Algorithm • Encoding a Solution • Binary string • Ordered list of categories – in our ordering problem • Fitness Function • Reasonable ordering score • Selecting Chromosomes for crossover • High fitness value => high probability

  22. GA Ordering Algorithm • Crossover • Single point • Multiple points • Mask • Crossover AB | CDE and BD | AEC • Results in ABAEC and BDCDE => Illegal

  23. GA Ordering Algorithm • Repair Illegal Chromosome ABAEC • AB*EC => fill D in * position • Repair Illegal Chromosome ABABC • AB**C • D and E are missing • Which one is closest to B, fill it in first * position

  24. Mapping Function • Ordering Path <v1, v2, …, vn> • Mapping(vi) =

  25. Experiments • Synthetic Data (width/length = 5)

  26. Experiments • Synthetic Data (width/length = 10)

  27. Experiments • Synthetic Data:Reasonable Ordering Score for Divisive Algorithm • width/length = 5 => 0.82 • width/length = 10 => 0.9 • No Ordering => 1/3 • Divisive algorithm is better than Prim-like algorithm when number of categories > 100

  28. Experiments • Synthetic Data (width/length = 5)

  29. Experiments • Synthetic Data (width/length = 10)

  30. Experiments • Divisive Ordering is best among the three ordering algorithms • For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10. • Prim-like ordering algorithm: 0.12 and 0.1, respectively.

  31. Experiments • “Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive • 33 nominal attributes, 7 continuous attributes • Sample 5000 records for training dataset. • Sample 2000 records for approximate KNN search experiment.

  32. Experiments • Distance Matrix: distance between two categories • V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999 • D = {d1, d2, …, dn} of n tuples. • D is subset of D1 * D2 * … * Dk, where Di is a categorical domain, for 1 ≦ i ≦ k. • di = <ci1, ci2, …, cik>.

  33. Experiments

  34. Experiments • Approximate KNN – nominal attributes

  35. Experiments • Approximate KNN – nominal attributes

  36. Experiments • Approximate KNN – nominal attributes

  37. Experiments • Approximate KNN – all attributes

  38. Experiments • Approximate KNN – all attributes

  39. Experiments • Approximate KNN – all attributes

  40. Conclusion • Developed Ordering Algorithms • Prim-like • Krusal-like • Divisive • GA-based • Devised Measurement • Reasonable ordering score • Root mean squared error

  41. Conclusion • What next? • New categories, new mapping function • New index structure? • Training mapping function for a given ordering path.

  42. Thank you.

More Related