Authors : Christian Bohm, Alexey Pryakhin, Matthias Schubert Published in : ICDE 2006

Database SeminarThe Gauss-Tree:Efficient Object Identification in Databases of Probabilistic Feature Vectors Authors : Christian Bohm, Alexey Pryakhin, Matthias Schubert Publishedin : ICDE 2006 Presenter : Chui Chun Kit Supervisor : Dr. Kao Date : 18th Jan 2007

Presentation Outline • Introduction • Object identification problem • Motivation of modeling data with uncertainty • The Gaussian Uncertainty Model • Identification Queries • K-mostly likely identification query (k-MLIQ) • Threshold identification query (TIQ) • General Solution • The Gauss-Tree Data Structure • Experimental Results • Conclusion

Introduction • Suppose Peter is a detective … • The police force maintains a number of criminal databases, one of the databases stores the pictures of the criminals and suspects. • One typical task for Peter is to search for the suspects who look similar to the pictures he took or a drawing provided by the witnesses. • E.g. Given the query image, retrieve the most similar image from the database. Query image Images in the database

Introduction E.g. Breadth of nose is one of the features describing the facial image Image Database Breadth of nose Feature • Suppose Peter is a detective … • To do the similarity search • Each object is pre-processed to extract it’s features. • Each feature describes one property of the images. • An image can be described by a set of features, a set of feature values forms a feature vector. • Collection of feature vectors forms a database of images, we will instead work on such a database rather than the original images. • A d-dimensional feature vector consists of d feature values describing the object. Feature value 2cm 3cm Image 1 Image 2 Query image 3.2cm

Introduction Image Database • Suppose Peter is a detective … • By defining a distance function (e.g. Euclidian distance) to feature vectors, we can assume that the distance between the feature vectors corresponds to the dissimilarity of objects. Image 1 Image 2 Query image

Introduction q x • To identify the most similar object to the query image, we could retrieve the nearest neighbor of the query image in the database. • Image 2 is regarded as the most similar object with the query image because the distance between image 2 and the query image is the smallest. Breadth of lips Image 2 Query image x Image 1 Breadth of nose

Introduction Data uncertainty can be represented by an uncertainty region. In this case, although the observed value of the breadth of lips of the query image is 3.2cm, it’s actual value may vary from 2 to 4 cm. Here, the feature value of “breadth of lips” of image 3 is quite different from the query image. Therefore, if we consider data uncertainty, Image 3 will be more likely to be the answer instead of image 2. q x • In reality, the database images, as well as the query image are often represented by feature vectors with a varying degree of exactness or uncertainty. • For example, most data collections consisting of facial images do not just contain images that were taken under the same illumination or sharpness. • The observed feature values may different from the true feature value. • Two feature values describing the same object can be significantly different from each other. Breadth of lips x Image 2 Query image Image 3 x Image 1 Breadth of nose

Introduction Therefore, if we consider data uncertainty, Image 3 will be more likely to be the answer instead of image 2. q x • The degree of similarity between observed and exact values can vary from feature to feature because some features cannot be determined as exactly as others. • E.g. It is much more easier to determine the breadth of nose than the breath of lips. (why?) • Such kind of data uncertainty is common in face recognition, fingerprint analysis or voice recognition • This motivates the authors to propose an uncertainty model for the feature vectors. Breadth of lips x Image 2 Query image Image 3 x Image 1 Breadth of nose

The Gaussian Uncertainty Model

The Gaussian Uncertainty Model • A feature value µi in a feature vector is an observation. E.g. from a sensor. • Due to measurement error, the observed value µi may be different from the true value xi. • The authors assume that the measurement error follows a Gaussian distribution around the true feature value xi (mean) with a known standard deviationδi . P(x) Graphically, a Gaussian distribution is a bell shape probability density function. xi x

The Gaussian Uncertainty Model • The probability density that the value µi is observed given xiis the true value can be calculated by substituting µi to the Gaussian function . However, we often want to determine the probability density of the true feature value xi for the observed feature value µi . P(x) P(µi) xi µi x

The Gaussian Uncertainty Model • Fortunately, Gaussians are symmetric: However, we often want to determine the probability density of the true feature value xi for the observed feature value µi . P(x) P(µi) xi µi x

The Gaussian Uncertainty Model • Fortunately, Gaussians are symmetric: Therefore, we can use this to determine the probability density of the true feature value xi (probably from the query) given the observed feature valueµi. P(x) The probability of µi in the Gaussian of xi is the same as the probability of xi in the Gaussian of µi. Gaussian distribution with mean equal to the observed value µi. xi µi x x

The Gaussian Uncertainty Model • Probabilistic feature vector (pfv) is… • A vector consisting of d pairs of observed feature values µi and standard deviations δi. • Each pair of µi andδi defines a Gaussian distribution of the true feature value xi . Probability A visualization of the probability densities of three 2-dimensional probabilistic feature vectors. F1 F2

The Gaussian Uncertainty Model • Probabilistic feature vector (pfv) is… • A vector consisting of d pairs of observed feature values µi and standard deviations δi. • Each pair of µi andδi defines a Gaussian distribution of the true feature value xi . • For d-dimensional feature vectors, the probability density for observing a queryvector q ofactual values under the condition that we already observed v for the same data object can be calculated in the following way: The object v is a pfv. The query q is a feature vector without uncertainty.

Identification Queries

Queries on a database of pfv • For identification task, we would like to calculate the conditional probability that a query q belongs to a pfv v, under the condition that q belongs to one pfv of the set of all considered pfv in DB • P(v) is the probability that object v is the answer to a query at all. • The authors assume that P(v) is the same for any object and thus we can cancel it in the fraction if we are using P(v|q) for comparison. The conditional probability of observing q under the condition that we have already observed w for the same object. Recall that we have discussed the way to calculate p(q|v). Therefore, we would like to use the Bayesian theorem to rewrite the conditional probability.

Queries on a database of pfv • Once we can determine the probability P(v|q), we can define two types of queries: • Threshold Identification Query (TIQ) • E.g. give me all persons in the databases that could be shown on a given image with probability of at least 10%. • K-Most-Likely Identification Query (k-MLIQ) • E.g. Give the 10 most likely persons in the database that are shown on a given image. The feature vectors of the query and the object can be uncertain. i.e. They are all probabilistic feature vectors.

Queries on a database of pfv • Given 3 facial images O1,O2, and O3 of varying quality that are stored in a database. • A query image Q. • Feature F1 is particularly sensitive to the rotational angle, F2 is sensitive to illumination. Object O1 is taken under good conditions. Both features are relative accurate. For O3 the rotation was bad but the illumination was good. For the query object, the rotation was good, but illumination was bad.

Queries on a database of pfv • We can recognize that O3 must be the object providing the highest probability for describing the same object as specified by the query Q. • From the previous formulae • P(O1|Q) =10% • P(O2|Q) =13% • P(O3|Q) =77% • A k-MLIQ with k=1 would report O3 as result. • A TIQ with a threshold probability ε =12% would additionally report O2.

Queries on a database of pfv • A similarity query using the Euclidean distance would obtain three rather similar distances • d(Q,O1)=1.53 • d(Q,O2)=1.97 • d(Q,O3)=1.74 • Thus the NN would be O1 which is excluded when considering the variances.

Preliminary experiment • To further illustrate the NN approach is not suitable for uncertain data, the authors conducted a preliminary experiment to compare the precision and recall rates of • The Nearest neighbor query (does not consider data uncertainty) • The Most likely identification query (consider data uncertainty) • Recall is the percentage of correct answer retrieved. • Precision is the percentage of correct objects in the answer set.

Preliminary experiment • Dataset (An image database) • 10,987 database objects. • 27-dimensional color histograms. • To describe the images as probabilistic feature vectors, the authors complemented each dimension with a randomly generated standard deviation. • 100 objects are selected and modified as query objects. • New observed mean value is generated w.r.t. the corresponding Gaussian.

Preliminary experiment The rates are not 100% because the query images are perturbed and maybe a few of them look very different from the original image. MLIQ, which consider data uncertainty achieved 98% precision and recall. Even the result size is increased the recall rate does not increase much, but the precision rate drops significantly. Therefore, the authors conclude that the nearest neighbor method which only use the mean value as the true feature value is not suitable for similarity search in uncertain data. On the other hand, the NN query, which use the mean value only, displayed a precision and recall rate of 42% The x axis correspond to the size of the result set. i.e. X2 means to retrieve 2-NN.

General Solution When the query is also a probabilistic feature vector

Queries on a database of pfv • Given a query vector q with exact feature values, and a database with pfv(s) v. • We would like to determine P(v|q) for each pfv v. • We would like to calculate p(q|v), d is the dimensionality. • We would like to calculate, for each dimension i, the probability density for observing the exact value qi under the condition that we already observed vi for the same object. Question : How about when q is a pfv? We need to calculate p(qi|vi) where the query value qi is uncertain. i.e. a Gaussian distribution. Here we apply the theory of Bays. P(v) and P(w) are the same for all objects. Gaussian distribution

Queries on a database of pfv • If both the object v and query q are pfv, we have to consider all possible positions when calculating p(qi|vi). Probability distribution of the feature value of the query q in dimension i. Probability distribution of the feature value of an object v in dimension i. P(x) Recall that if the query value q is exact value, we simply substitute q into the Gaussian of v to obtain p(q|v). However in this case, q is also a pfv with mean µq and standard deviationδq. µv µq q x

Queries on a database of pfv • If both the object v and query q are pfv, we have to consider all possible positions when calculating p(qi|vi). Probability distribution of the feature value of the query q in dimension i. Probability distribution of the feature value of an object v in dimension i. P(x) The probability that x’ is the true value for the object v . The probability that both v and q having the true value x’. The probability that x’ is the true value for the query object q . x’ µv µq x

The joint probability The joint probability of the query pfv q and object pfv v. (2 Gaussians) • An interesting lemma • The lemma reduced the more general case to the easier case that one of the objects is exact and the other is a pfv.

The joint probability P(x) µv µq x Instead of calculating the integral, we can calculate the joint probability p(qi|vi) by substituting the mean of the query value into a Gaussian function.

Queries on a database of pfv • Given a query pfv q, and a database with pfv(s) v. • We would like to calculate P(v|q) for each pfv v. • Then we would like to calculate p(q|v), d is the dimensionality. • Then we would like to calculate, for each dimension i, the probability density for observing the pfv qi under the condition that we already observed vi for the same object. The joint probability between q and v.

General Solution • The general solution is already a stand-alone solution operating on top of a sequential scan of the database. • The authors proposed the Gauss-tree • An index on database objects to improve mining efficiency. • The general approach is used as a refinement step.

The Gauss-Tree

The Gauss-Tree • The Gauss-Tree is a balanced tree from the R-tree family. • Not the space of the spatial objects (i.e. Gaussians) is indexed but instead the parameter space. • Mean • Standard deviation (uncertainty value)

The Gauss-Tree • A Gauss-tree of degree M is a search tree where the following properties hold: • The root has between 1 and M entries. • All other inner nodes have between M/2 and M entries each. • A leaf node has between M and 2M entries. • An inner node with k entries has k child nodes.

The Gauss-Tree • A Gauss-tree of degree M is a search tree where the following properties hold: • An entry of a non-leaf node is a minimum bounding rectangle of dimensionality 2*d defining upper and lower bounds for • every feature value [ , ]. • every uncertainty value [ , ]. (standard deviation) δ MBR of 1-dimensional feature vectors, the MBR is 2-dimensional. µ

An object A inside the node represents a Gaussian distribution with mean 7 and standard deviation 1. The Gauss-Tree Object B has the same mean value as object A but with a larger standard deviation. P(x) A D Here, the MBR only bounds on the parametric space, the probability density for positions outside the mean range does NOT equal to zero! B E C x 7 8 δ C E B A 1 D µ 7 8

The Gauss-Tree Graphically, this should be the upper bound of the probability densities of the objects stored in the node /subtree. P(x) A D Here, the MBR only bounds on the parametric space, the probability density for positions outside the mean range does NOT equal to zero! B E C x • We would like to derive the maximum of the probability densities of the objects stored in the subtree of a node.

The Gauss-Tree P(x) A D Distribution of object C correspond to the maximum probability density in this area. This is because C has the smallest mean and the largest variance. Distribution of object A correspond to the maximum probability density in this area. This is because A has the smallest mean and variance. B E C x • Which distribution corresponds to the maximum probability density in each of the colored region? • It is obvious that the distributions correspond to the area with x< must have the mean equal to . • The problem is to determine the standard deviation such that is maximized.

The Gauss-Tree • We can determine the δvalue which maximizes by setting the derivative with respect toδto zero. • Which distribution correspond to the maximum probability density in each of the colored region? • It is obvious that the distributions correspond to the area with x< must have the mean equal to . • The problem is to determine the standard deviation such that is maximized.

The Gauss-Tree • We can determine the δvalue which maximizes by setting the derivative with respect toδto zero. • The only positive solution obtain a local maximum at

The Gauss-Tree P(x) (I) Therefore, we can determine the standard deviation for area (I), where x is smaller than this value… x • The only positive solution obtain a local maximum at

The Gauss-Tree P(x) (I) Therefore, we derive that the standard deviation must be equal to in area (I). x Similar for other areas…

The Gauss-Tree • With defined, we can obtain the upper bound of the probability density of a query value q which is an exact value. • Question : How about the upper bound of the probability density of a query value with Gaussian uncertainty? • Recall the lemma for joint probability between two pfvs qi and vi : • Therefore, the upper bound of the joint probability between pfv q and the objects in a node/subtree. Joint probability between 2 Gaussians.

The Gauss-Tree • Lower bound (skipped) • Sum (skipped) • Accuracy of the approximation of the sum is bounded by

Using the Gauss-Tree

Using the Gauss-Tree • Recall that K-most likely identification query (K-MLIQ) reports the object v for which the probability-based similarity P(v|q) is maximal. • We briefly mention how to use Gauss-Tree for filtering nodes/objects for MLIQ (k=1). • Best first search strategy. • Maintain a priority queue for the nodes to be visited • Question : How to determine the priority of each tree node? That is to say, we have to determine p(q|v) i.e. the joint probability, which the upper bound can be obtained by using the Gauss tree.

Using the Gauss-Tree • Let a be a node of the Gauss-Tree, it’s priority attribute is defined as follows • The ordering key corresponds to the upper bound of the joint probability between the query q and the objects under the subtree of node a. • i.e. The upper bound of p(q|v) where v is inside node a. Recall that this is the upper bound of the joint probability between the query object q and database object v in dimension i .

Using the Gauss-Tree • Brief description of the algorithm for finding MLIQ: • Initially, the queue contains only the root. • The algorithm runs in a loop which removes the top element from the queue, loads the corresponding node and re-inserts the children into the queue. • The algorithm keeps a candidate object in a variable which is the maximum pfv that has been seen so far by the algorithm in any of the leaf nodes.

Using the Gauss-Tree • The algorithm stops when a pfv has been found for which the relative probability exceeds that of the top element of the queue. Upper bound of the joint probability between the query object and the database objects stored under the subtree of the top element of the queue. The maximum joint probability between the query object and the database objects that has been seen so far by the algorithm.

Authors : Christian Bohm, Alexey Pryakhin, Matthias Schubert Published in : ICDE 2006