DANG Tran Khanh, KÜNG Josef, WAGNER Roland Institute for Applied Knowledge Processing (FAW)

-ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING APPROXIMATE NEAREST NEIGHBOR OF COMPLEX VAGUE QUERIES DANG Tran Khanh, KÜNG Josef, WAGNER Roland InstituteforApplied Knowledge Processing (FAW) Johannes Kepler University of Linz Austria

OUTLINE • Complex Vague Queries in the Vague Query System (VQS)  Similarity search problem of the VQS in the conventional DBMSs • Incremental hyper-Sphere Approach (ISA)  Overcome shortcomings of Incremental hyper-Cube Approach (ICA) • -ISA: Finding Approximate Nearest Neighbors of Complex Vague Queries  The issue of the dimensionality curse  The issue of increasing the query condition number • Experimental Results • Conclusions

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM • The VQS: • Introduced by Kueng and Palkoska 1997 • Support similarity search capabilities in the conventional DBMSs: return to users records semantically close to a given query • One of the VQS’s basic ideas: • NCR-Tables (Numeric-Coordinate-Representation-Tables): keep numeric semantic information of non-numeric attributes

NCR-key NCR - columns fuzzy field NCR-table COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM NCR-Tables – an example SELECT FROM Car WHERE Col IS ‘dark blue‘ INTO myResultTable;

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM Complex Vague Queries in VQS: A simplified view of the problem Vague query processing module Index 1 … Index n NCR-Table 1 … NCR-Table n Query relation ... Attribute 1 … Attribute n ... Value_11 … Value_n1 … … … … ... Value_1k … Value_nk

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM • The issue of the dimensionality curse [Weber et al 1998; Beyer et al 1999] • NCR-Tables with high-dimensional data: • The probability of overlaps between a query and data regions is very high, and thus the performance of multidimensional access methods (MAMs) is decreased significantly • A linear scan over the whole data set would perform better than MAMs • Approximate nearest neighbor problem: • dist(Q, P)  (1+)dist(Q, P’) (1) • Almost for single data sets: single–feature nearest neighbor (S-FNN) queries [Arya et al 1998, Kleinberg 1997, Amato et al 2000, Ciaccia and Patella 2000, etc.]

Attr1 Domain1 [Values] x1 … Query relation Attr1 Attr2 x2 … x1 y1 … … x1 y2 Attr1 Domain1 [Values] x2 y1 y1 … … … y2 … … … COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM • Solving Complex Vague Queries in VQS: “Random access“ [Fagin 1996] is impossible

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM • Incremental hyper-Cube Approach (ICA) [Kueng and Palkoska 1999] • Issues with the ICA: see [Dang et al 2002a, Dang et al 2002b] for the details • How to determine the initial hyper-cubes ? • How to extend the hyper-cubes in necessary case • Accessing unnecessary disk pages and objects • Repeated disk accesses • Only best match record is returned (not top-k records)

INCREMENTAL HYPER-SPHERE APPROACH (ISA) • Input: • A query relation/view S • A complex vague query Q with n query conditions qi (i=1, 2… n) • Assume each feature space (or NCR-Table) related to Q is managed by a multidimensional index structure Fi • Output: • Best match record/tuple Tmin for Q, TminS. Ties are arbitrarily broken. • Step 1: Search on each Fi for the corresponding qi using the adapted incremental algorithm for hyper-sphere range queries. • Step 2: Combine the searching results from all qi to find at least an appropriate record in S, which contains the returned NCR-Values with respect to each query condition. If there is no appropriate record found then go back to step 1. • Step 3: Compute total distances/scores for the found records using formula 2 below and find a record Tmin with the minimum total distance TDcur. Ties are arbitrarily broken.

INCREMENTAL HYPER-SPHERE APPROACH (ISA)

INCREMENTAL HYPER-SPHERE APPROACH (ISA) • Step 4: Compute the maximum searching radius for each qi with respect to TDcur using formula 3 below and continue doing the search as steps 1, 2 and 3 until one of two following conditions holds: (a) the current searching radius of each qi is greater than or equal to its maximum searching radius; (b) found a new appropriate record Tnew with the total distance TDnew<TDcur • Step 5: If condition (a) holds then return Tmin as the best match for Q. Otherwise, i.e. condition (b) holds, replace Tmin with Tnew, i.e. TDcur is also replaced with a smaller value TDnew, and go back to step 4

INCREMENTAL HYPER-SPHERE APPROACH (ISA) • Modifying ISA to retrieve top-k records: see [Dang et al 2002b] • High-dimensional feature spaces and/or • Query condition number increases ISA performance is decreased

-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES • CVQ = M-FNN (Multi-Feature Nearest Neighbor) query • Using lower bound total distance (LBTD)

-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES • Input: • A query relation/view S • A complex vague query Q with n query conditions qi (i=1, 2… n) • Assume each feature space (or NCR-Table) related to Q is managed by a multidimensional index structure Fi • A real >0 used as a tolerant error • Output: • (1+)-approximate NN record/tuple Tapp for Q, TappS. Ties are arbitrarily broken. • Step 1: Search on each Fi for the corresponding qi using the adapted incremental algorithm for hyper-sphere range queries. • Step 2: Combine the searching results from all qi to find at least an appropriate record in S, which contains the returned NCR-Values with respect to each query condition. If there is no appropriate record found then go back to step 1. • Step 3: Compute total distances/scores for the found records using formula 2 and find a record Tapp with the minimum total distance TDcur. Ties are arbitrarily broken.

See next slice -ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES • Step 4: Let di be distance from query condition qi to the last NCR-Value returned in the corresponding feature space, which is being managed by Fi. Compute LBTD as follows: LBTD = min {TDcur, di}, i=1,2…n (5) • Step 5: If TDcur <= (1+)LBTD, return Tapp as a (1+)-approximate NN record for Q. Otherwise, go to step 6 • Step 6: Compute the maximum searching radius for each qi with respect to TDcur using formula 3 and continue doing the search as steps from 1 to 5 until the algorithm is stopped at step 5. If the current searching radius of a certain qi is greater than or equal to its maximum searching radius then searching on Fi is stopped

A B C D -ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES Lower Bound Total Distance - An example

-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES • Approximate k-nearest neighbors • See our paper for more details

EXPERIMENTAL RESULTS • Data sets: • Uniformly distributed: 2, 4, and 8 dimensions (100K objects for each of them) • Real: 9 and 16 dimensions (more than 64K feature vectors of images, URL: http://kdd.ics.uci.edu/) • Using the SH-tree [Dang et al 2001a] to manage multidimensional data • Page size: 8KB • 100 query points were randomly selected from each corresponding data set • ...

EXPERIMENTAL RESULTS 2-condition (4-d and 8-d) NN queries, different  values

EXPERIMENTAL RESULTS 2-condition (4-d) k-NN queries,  = 0.2

EXPERIMENTAL RESULTS 3-condition (2-d) NN queries, different  values 2-condition NN queries (9-d and 16-d real data sets), =1 • =1 means tolerant error is permitted up to 100% • -ISA saved about 4.5 % and 1% of the affected object and disk access number, individually, for 16-d data set while it remained the accuracy at 71% • One notable fact here is that the effective epsilon calculated as introduced in (Arya et al. 1998) is quite low, only 0.23. This is a very promising result.

CONCLUSIONS • -ISA: An Incremental Lower Bound Approach for Efficiently Finding Approximate Nearest Neighbor of Multi-Feature Queries in VQS • -ISA is one of the vanguard solutions to dealing with this problem • -ISA is very useful for application domains that the returned results need not to be exact but similar or approximate similar (with a certain tolerant error) to a given query. The experimental results have proven this. With a suitable  value, the -ISA can save a very high percentage of the costs including both IO-cost and CPU-cost while it still preserves the accuracy of the returned results at a particularly very high value • -ISA is applicable to not only numeric domains such as NCR-tables, but also any ranked input • Application areas: TIS (tourist information systems), GIS, digital libraries, multimedia systems, etc.

Thank you !! • More information • URL: http://www.faw.uni-linz.ac.at/ • E-mail: {khanh, jkueng, rwagner}@faw.uni-linz.ac.at

Research related to dealing with complex vague queries • The A0 algorithm [Fagin 1996] (There are some improvements of Fagin‘s algorithm, see the paper for more details): • Finding top-k matches for a user query involving several multimedia attributes • Problem: this algorithm assumes that random access is possible in the system. This assumption is correct only three following conditions hold: • there is at least a key for each subsystem, • there is a mapping between the keys, • and we must ensure that the mapping is one-to-one • In VQS: condition (1) is always satisfied (each fuzzy field are the key for the corresponding NCR-table), but there is no the mapping one-to-one between the fuzzy fields •  Cannot be applied to our problem

Research related to dealing with complex vague queries (cont.) • Other approaches for multimedia databases: [Ortega et al 1997, Chaudhuri et al 1996, Boehm K. et al 2001] (see our paper) • Chaudhuri et al. 1999 introduced a solution to translate a top-k multi-feature query to a range query that the conventional DBMS can process. This approach employs information in the histograms kept by a relational system …

ISA and J* algorithm

DANG Tran Khanh, KÜNG Josef, WAGNER Roland Institute for Applied Knowledge Processing (FAW)