1 / 40

Operators for Similarity Search

Operators for Similarity Search. Deepak Padmanabhan, PhD Centre for Data Sciences and Scalable Computing The Queen’s University of Belfast United Kingdom deepaksp@acm.org. Similarity Search in Action. Image Search. Web Pages. Similar Movies (Tastekid.com). Similarity and Cognition.

gcordova
Télécharger la présentation

Operators for Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operators for Similarity Search Deepak Padmanabhan, PhD Centre for Data Sciences and Scalable Computing The Queen’s University of Belfast United Kingdom deepaksp@acm.org

  2. Similarity Search in Action Image Search Web Pages Similar Movies (Tastekid.com)

  3. Similarity and Cognition

  4. Similarity and Cognition This sense of sameness is the very keel and backbone of our thinking.   Principles of Psychology William James, 1890 … the mind makes continual use of the notion of sameness, and if deprived of it, would have a different structure from what it has. 

  5. Similarity and Cognition Similarity, is fundamental for learning, knowledge and thought, for only our sense of similarity allows us to order things into kinds so that these can function as stimulus meanings. Reasonable expectation depends on the similarity of circumstances and on our tendency to expect that similar causes will have similar effects. Ontological Relativity and Other Essays Quine, 1969

  6. Geometric Similarity Model O1 O1 O2 O2

  7. Diagnosticity Principle Similarity and Grouping are related Features that are used to cluster have disproportionate influence

  8. Pair wise similarities:Object Representation and Similarity Measures

  9. Example Representations

  10. Estimating Similarity Between Objects Domain Ontology Text Similarity CAR1023 ---------- Remarks: Good condition Model: Passat V6 Year: 2002 Battery Voltage: 12.9V … … … … CAR560 ---------- Remarks: Nice condition Model: Passat Year: 2000 Battery Voltage: 12.6V … … … … 0.60 0.80 0.75 0.90 Domain Knowledge+ Numeric Numeric min2 min avg noagg max 0.60 {0.6,0.8,0.75,0.9} 0.75 0.90 0.76

  11. Outline for the Rest • Construction-based Classification • Property-based Classification • Some Directions

  12. Problem Overview S(Q,D) D(Q,D) Q D q1 q2 . . . . . . qn d1 d2 . . . . . . dn s1 s2 . . . . . . sn I(Q,D’) Member- Ship/Score I(Q,D) Scoring Aggregation Filter Query Parameters Scoring Operators: Assign a score vector to each Object, by comparing to the query object Aggregation operators: Aggregate the score vector into smaller number of values Selection/Filter Operators: Select a subset of objects based on whether they satisfy a criterion, e.g., Skyline, rank based or threshold based

  13. Common Operations • Aggregation operations • Weighted Sum • Max • Min • Distance • N-Match • Filter operations • Skyline • Rank (Top-k) • Threshold (Bounding Box, Range query) Different combinations lead to different operators

  14. Weighted Sum Top-k W(X) = 1 W(Y) = 2 Top-k Filter: Sort and Choose k d((2,2)) = 1*1 + 2*2 = 5 d((1,3)) = 1*2 + 2*1 = 4 d((1,4)) = 1*2 + 2*0 = 2 d((5,1)) = 1*2 + 2*3 = 8 d((3,3)) = 1*0 + 2*1 = 2 d((6,3)) = 1*3 + 2*1 = 5 d((2,6)) = 1*1 + 2*2 = 5 d((5,6)) = 1*2 + 2*2 = 6 d((1,4)) = 1*2 + 2*0 = 2 d((3,3)) = 1*0 + 2*1 = 2 d((1,3)) = 1*2 + 2*1 = 4 d((2,2)) = 1*1 + 2*2 = 5 d((6,3)) = 1*3 + 2*1 = 5 d((2,6)) = 1*1 + 2*2 = 5 d((5,6)) = 1*2 + 2*2 = 6 d((5,1)) = 1*2 + 2*3 = 8 (2,6) (5,6) (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1) Locus w/ Eq. Weights Useful when all attributes need to be considered

  15. Max Top-k d((2,2)) = 2 d((1,3)) = 2 d((1,4)) = 2 d((5,1)) = 3 d((3,3)) = 1 d((6,3)) = 3 d((2,6)) = 2 d((5,6)) = 2 (2,6) (5,6) (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1) Useful when maximum dissimilarity needs to be bounded Locus

  16. Min Top-k d((2,2)) = 1 d((1,3)) = 1 d((1,4)) = 0 d((5,1)) = 2 d((3,3)) = 0 d((6,3)) = 1 d((2,6)) = 1 d((5,6)) = 2 (2,6) (5,6) (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1) Useful when best matching attribute is sufficient Locus

  17. Skyline Domination’ Region of (1,4) Domination’ Region of (2,6) Domination’ Region of (5,6) An object is said to dominate another if the latter is farther away from the query than the former on “all” dimensions (can be equal on some, but not all) (2,6) (5,6) (1,4) Q:(3,4) Domination’ Region of (3,3) (3,3) (1,3) (6,3) (2,2) (5,1) All objects that are not dominated by any other are output as results. Results: (5,6), (2,6), (3,3), (1,4) Useful when attribute scores cannot be aggregated

  18. Range Query L2 aggregation + Threshold filter (2,6) (5,6) r (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1)

  19. Bounding Box ry (2,6) (5,6) null aggregation + Threshold filter rx (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1)

  20. K-N-Match (2,6) (4,5) (1,4) Q:(3,4) (1,3) (6,3) (2,2) (3,2) (5,1) N=1 K-N-Match operator ranks objects based on the match on the nth best matching attribute. N=2 N-match aggregation + Rank filter Useful when at least N attributes should match

  21. Summarizing the Construction-based Classification

  22. Property-based Classification • Ordered vs. Unordered Output • Whether there is an ordering in the output result set • Subset vs. All Attributes • Whether all attributes contribute to deciding the membership in the result set

  23. Ordered vs. Unordered Output Applicable to Selection/Filter operators Skyline Top-k 3 R 2 R 1 R Query Query

  24. Subset vs. All Attributes Applicable to Aggregation operators S(Q,D) D(Q,D) Q D q1 q2 . . . . . . qn d1 d2 . . . . . . dn s1 s2 . . . . . . sn I(Q,D’) Member- Ship/Score I(Q,D) Scoring Aggregation Filter Query Parameters We focus on the construction of I(Q,D) for this classification.

  25. Some Example I(Q,D)s • Weighted Sum • Range Query • Bounding Box/Skyline • Max • Min • K-N-Match All Attributes Needed “Some” Attributes Enough

  26. Classification Overview Aggregation Selection/ Filter

  27. “Add-on” Features for Similarity Operators • Indirection (Reverse Operators) • Multiple Queries • Diversity • Visibility • Subspaces • Typed Data (Chromaticity)

  28. Reverse Operators • Range Query: Get me all the restaurants within 1km of my home • This is common in consumer usage scenario • E.g., user searching for restaurants to dine • Reverse Range Query: Get me all the users for whom my restaurant is within 1km • This is more of a service provider question • E.g., Finding potential consumers to whom targeted marketing may be done • This reversal could be done in various operators • E.g., Reverse Skyline, Reverse kNN, …

  29. Multiple Queries Restaurants/Pubs I plan to leave from office, go to the club and then get home. I need to get some dinner somewhere during this travel. Give me restaurants or pubs that are within 1km of any of these three locations. Home Club Office This corresponds to a range query using multiple query points. The merging operator here is the OR operator, since we would be content with places that are close to any one of these queries.

  30. Diversity The logical 3 nearest neighbors aren’t very diverse and are very similar to each other. Rating Diversity constraint makes sure that the pairwise distance between any two results is lower bounded. Thus, it will return a more diverse set. Cost

  31. Visibility Constraints Return k Nearest neighbours that are visible from the query point (d6) (d4) (d1) Q K = 3 (d8) (d2) (d5) KNN = {d4, d5, d6} (d3) VkNN = {d4, d1, d2} (d7)

  32. Subspaces: Subspace Range Search Find objects within a threshold distance in a user specified subset of dimensions Dimensions = {Rating} R = {d1, d2, d4, d5, d6, d8} Dimensions = {Expense, Rating} R = {d4, d5, d6} (d6) (d4) Rating (d1) Q (d8) (d2) (d5) Dimensions = {Expense} R = {d4, d5, d6} (d3) (d7) Expense

  33. Typed Data: Chromaticity Find objects (of class A) that have the query object (of class B) in its kNN result set Example: people and restaurants Find bi-chromatic rKNN set of a restaurant (p4) (p3) RNN(r1) = {p2, r3} (p1) (r2) Bi-RNN(r1) = {p2, p1} (p5) (r3) (r1) Bi-RNN(r3) = {p6} (p2) (p6) Two classes P and R. Query is from class R, results from class P

  34. Summary of operators

  35. The Road Ahead • Plethora of choices in each step leads to the large variety of similarity search operators • And keeps researchers busy • Choices in • Similarity measures • Aggregation operators • Selection/filter operators • Additional features • Algorithmic features • Are we done yet?

  36. Let us invent some new Operators

  37. N-Match-BB • Bounding Box query where at least N attribute bounds are satisfied • An adaptation of K-N-Match to Bounding Boxes Unordered Subset of Attrs Q For 1-Match-BB, data points on either of these rectangles are OK.

  38. Multi-Query Bichromatic Reverse kNN • Combination of • Weighted Sum • Top-k Filter • Reverse (Indirection) • Multi-Query • Chromaticity • Example Use Case: Of the three chosen locations for Café X (all three are intended to be opened), find people who would find at least one of these locations among the k closest cafes

  39. Miscellaneous • Revisiting algorithms on new platforms • Hadoop/MR • Interpretability in Results • Can results of similarity search be shown in a manner so that the intuitive similarity between the query and the result be highlighted? • Syntactic and Semantic Features • Understand the dichotomy between syntactic (e.g., shape similarity) and semantic (e.g., two images being similar due to both being maps) • Would modeling them differently and learning when to weigh each highly lead to more efficient similarity search • Contextual Similarity; conditioning on user history • On searching for “IBM Watson”, a travelling person should be shown IBM Watson Labs, whereas a technologist should be shown the IBM Watson system

  40. “Similarity lies in the eyes of the beholder”* Thank You!Questions/Comments? deepaksp@acm.org deepakp7@gmail.com * (Adapted from famous quote) from http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt

More Related