1 / 26

Tree-based indexing methods for similarity search in metric and nonmetric spaces

Tree-based indexing methods for similarity search in metric and nonmetric spaces. Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague Mgr. Jakub Lokoč Supervisor: Doc. RNDr . Tom áš Skopal , Ph.D. Presentation outline. Introduction

kasie
Télécharger la présentation

Tree-based indexing methods for similarity search in metric and nonmetric spaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tree-based indexing methods for similarity search in metric and nonmetric spaces Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague Mgr. Jakub Lokoč Supervisor: Doc. RNDr. TomášSkopal, Ph.D. MFF UK, Prague

  2. Presentation outline • Introduction • Similarity search • M-tree • Contributions & Results • Metric search • Nonmetric search • Outlook MFF UK, Prague

  3. query object Similarity search • How to search in large collections of unstructured data? • We cannot use relation databases or textual annotation • Content based similarity searching • Similarity→ distance functionδ→metric vs. nonmetricsearch • Feature extraction→ feature space • Problems of similarity searching • Effectivity → selection of complex descriptors and (often expensive) distance function (not DB problem) • Efficiency → indexing → exact vs. approximate search Feature extraction Similarity evaluation Feature extraction MFF UK, Prague

  4. Similarity search -variants of δ • δ is metric • Allows indexing by metric access methods (e.g., M-tree) • Objects can be organized into separate clusters • δ is nonmetric • Robust similarity functions suitable for domain experts • Not constrained by metric axioms, but only approximate search by metric access methods • In our work, we have focused onFASTsimilarity search in metric and nonmetric spaces by M-tree MFF UK, Prague

  5. range query Q (euclidean 2D space) M-tree • Structure and properties • Dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) • The leaves are clusters of indexed objectsOj(ground objects) • Routing entries in the inner nodes represent hyper-spherical metric regions (Oi,rOi), recursively bounding the object clusters in leaves • The triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation

  6. Contributions to M-tree • New construction techniques • Forced reinserting • Hybridway leaf selection • Parallel dynamic batch loading • Nonmetric search • M-tree variant - NM-tree MFF UK, Prague

  7. O5 O3 O1 O7 O9 O4 O5 O1 Forcedreinserting • Insert new object O11 • Remove O8, O6 and insert them into the stack • Decrease region’s radius (to O11) • Insert O6 from the stack • Remove O2 and insert in the stack • Decrease region’s radius (to O6) • Insert O2 from the stack • Insert O8 from the stack O4 O6 O1 O3 O11 O11 O5 O2 O7 STACK O8 O9 O10 O2 O8 O6 O9 O10

  8. Hybridway leaf selection • First phase of inserting = find suitable leaf for new OBJ • Classic selection strategies • Singleway – fast indexing, less compact hierarchy • Multiway – vice versa • Our approach • User controls how many branches are visited • Finds suboptimal leaf node • May return full leaf node MFF UK, Prague

  9. Experimental results CoPhIR (color layout and structure), dim 76, dbSize250.000 MFF UK, Prague

  10. Parallel dynamic batch loading 1. Aggregation 2. Parallel batch loading 3. Traditional inserting Not inserted objects “Split generating” – will be inserted in traditional way (exploiting limited parallelism) Postponed – will be inserted during the next batch • To find scalability bottlenecks we measured • Parallel batch loading time – PI • Traditional inserts causing split time – ICS • Traditional inserts not causing split time – INCS

  11. Experimental results CoPhIR 1.000.000 Dimension 76 (12 + 64) L5.123456 distance 24 / 25 inner/leaf node size 512MB cache size

  12. Nonmetric search • Metric properties – too restrictive • Triangle inequality is the most attacked one • Semimetric distances (e.g. in molecular biology) • But, how to search efficiently? Identity Non-negativity Symmetry Triangle inequality 2NN ( ) = { , } 2NN ( ) = { , } MFF UK, Prague

  13. Nonmetricsearch • Relatedwork • MAMs can employ a semimetricdS for approximate search • Semimetric behavior can be tuned by transformation functions f(e.g., we can turn semimetric to metric dM = fM(dS)) • More metric behavior – more precise, but slower search • Less metric behavior – less precise, but faster search • However, M-tree is fixed to employed (semi)metric (black-box distance) MFF UK, Prague

  14. NM-tree • The trick • We use inversely symmetric transformation functions - dS = f-1 ( f ( dS) ) • fei and fM are evaluated in initial phase • We index data using dM = fM(dS) (to allow exact searching) • Stored distances dM can be transformed back to dS = fM-1(dM) • Retrieval precision ei at query time • dei = fei(fM-1(fM(dS))) or just dei = fei(dS) • Metric search in upper levels (by dM) MFF UK, Prague

  15. Experimental results MFF UK, Prague

  16. Outlook • Metric search • Combination of more sophisticated M-tree constructions techniques and parallelism • Adopting the techniques to M-tree descendants • Employ as a dynamic clustering technique • Nonmetric search • Finding better „nonmetric to metric“ transformation functions • Reuse other MAMs for nonmetric search MFF UK, Prague

  17. References Ciaccia, P., Patella, M., and Zezula, P. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces VLDB1997 Zezula, P., Savino, P., Rabitti, F., Amato, G., and Ciaccia, P. Processing M-Tree with Parallel Resources EDBT 1998 Skopal, T., Pokorny, J., Kratky, M., and Snasel, V. Revisiting M-tree Building Principles ADBIS 2003, LNCS 2798, Springer Skopal T. Unified Framework for Fast Exact and Approximate Search in Dissimilarity Spaces TODS 2007, ACM MFF UK, Prague

  18. Publications Lokoc J. and SkopalT. On Reinsertions in M-tree SISAP 2008, IEEE SkopalT. and Lokoc J. NM-Tree: Flexible Approximate SimilaritySearch in Metric and Non-metric Spaces DEXA 2008, LNCS 5181, Springer Skopal T. and Lokoc J. New Dynamic Construction Techniquesfor M-tree JournalofDiscreteAlgorithms, Elsevier 2009 Lokoc J. Parallel Dynamic Batch Loading in the M-tree SISAP 2009, IEEE J. Novák, T. Skopal, D. Hoksza, J. Lokoč Improving the Similarity Search of Tandem Mass Spectra using Metric Access Methods SISAP 2010, ACM J. Lokoč, T. Skopal On Applications of Parameterized Hyperplane Partitioning SISAP 2010, ACM T. Skopal, J. Lokoč Answering Metric Skyline Queries by PM-tree DATESO 2010, CEUR • T. Skopal, J. Lokoč, B. Bustos • D-cache: Universal Distance Cache for Metric Access Methods • Major revision, Transactions on Knowledge and Data Engineering MFF UK, Prague

  19. Citations • Lokoč, J. and Skopal, T. 2008. On Reinsertions in M-tree. In SISAP ’08: Proceedings of the First International Workshop on Similarity Search and Applications. IEEE Computer Society, Washington, DC, USA, 121–128. • Roberto UribeParedes, Gonzalo Navarro. EGNAT: A Fully Dynamic Metric Access Method for Secondary Memory. In SISAP ’09: Proceedings of the Second International Workshop on Similarity Search and Applications, p.57-64, August 29-30, 2009, Prague, Czech Republic • Marcos R. Vieira, Fabio J. T. Chino, Agma J. M. Traina, Caetano Traina Jr. Revisiting the DBM-Tree. Journal of Information and Data Management, Vol 1, No 1 (2010) • Qiu C. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 MFF UK, Prague

  20. Citations • Skopal, T. and Lokoč, J. 2009. New Dynamic Construction Techniques for M-tree. Journal of Discrete Algorithms, Elsevier 7 (1): 62–77. • Marcos R. Vieira, Fabio J. T. Chino, Agma J. M. Traina, Caetano Traina Jr. Revisiting the DBM-Tree. Journal of Information and Data Management, Vol 1, No 1 (2010) • Qiu C. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 • Kaster D., Bueno R., Bugatti P., Traina A., Traina C. Jr., Incorporating Metric Access Methods for Similarity Searching on Oracle Database, SBBD 2009 MFF UK, Prague

  21. Citations • Lokoč, J. 2009 Parallel Dynamic Batch Loading in the M-tree, In SISAP ’09: Proceedings of the Second International Workshop on Similarity Search and Applications, pp.117-123, August 29-30, 2009, Prague, Czech Republic • QiuC. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 MFF UK, Prague

  22. Thank for your attention MFF UK, Prague

  23. Answers (V. Dohnal) • σmax is not defined • σmax is maximal distance in the distance space • Similarity join (SJ) is not a multiexample query type • I agree - SJ is rather complex operator consisting of multiple single example queries • What other costs must be taken into account • In the case a distance function is cheap (e.g. Lp metrics), we have to take into account internal overhead of a particular MAM (e.g. pivot space filtering in pivot tables) • Missing database size for figure 1.14 • DbSize = 100.000 MFF UK, Prague

  24. Answers (V. Dohnal) • How to solve leaf node overflows during stack processing in conservative resinsertions • We perform regular split • If HW leaf selection is unsuccessful, SW leaf selection is used. Does SW leaf selection employ pre-computed distances from HW? • We do not use distances from HW leaf selection since HW leaf selection is usually successful and hence we have left the algorithm simple (which reduces internal CPU costs) • Moreover, it can be solved by the D-cache (see publications) MFF UK, Prague

  25. Answers (V. Dohnal) • How is changed the number of dimensions (x axis) in figure 3.6 • We have used 76 dim concatenated vector of two features (12 + 64), we used a “prefixes” of this vector • What causes fluctuations to query costs in figure 3.9 • Reinserting behavior is chaotic with respect to increasing number of removed objects • Radius change can be propagated to the upper levels of the M-tree, how is this process synchronized? • Radius is not propagated to upper levels (to improve parallel performance) – but it is a topic of our future work MFF UK, Prague

  26. Answers (V. Dohnal) • What algorithms have been used during the first two steps of the parallel batch loading iteration? • In the first step, we have just used simple list for new objects aggregation. In the second step, each thread used SW leaf selection using exclusive locks for radius updates. • What is the motivation for random heuristic? • Random heuristic can be faster in the case, the distance measure is cheaper. Moreover, we wanted to test, whether randomly selected objects perform more splits. • DB size is 1.000.000, batch size is 200, why is the number of iterations > 5000 • It is caused by the fact, that not all objects from the batch are inserted during one iteration. • ICS and INCS stand for the number of real insertions (ICS = number of leaf node splits) • What is residue time? • Residue aggregates realtime overhead and I/O cost. All other comments will be updated for online version and I thank for them MFF UK, Prague

More Related