1 / 105

Density-Based Clustering of Uncertain Data (KDD2005)

HKU Department of Computer Science Database Research Seminar 18th May 2006. Density-Based Clustering of Uncertain Data (KDD2005). Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk http://www.cs.hku.hk/~ckchui

merrill
Télécharger la présentation

Density-Based Clustering of Uncertain Data (KDD2005)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HKU Department of Computer Science Database Research Seminar 18th May 2006 Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk http://www.cs.hku.hk/~ckchui Supervisor: Dr. Benjamin C.M. Kao.

  2. Presentation Outline • Introduction • What is clustering? • Density based similarity measurment • DBSCAN • Issues from mining certain data to uncertain data • Why data exhibit uncertainty? • How to represent / model data uncertainty? • How to represent the distance between two uncertain objects? • Theoretical foundation of changing DBSCAN to FDBSCAN • FDBSCAN • From DBSCAN to FDBSCAN • Computational Issues • Experimental Results • Conclusions

  3. Introduction

  4. What is Clustering? • Problem description • A set of objects • A similarity measurement • Discover groups of similar objects • More precisely, find sets of objects which intra-cluster similarity is high while inter-clusters similarity is relatively low.

  5. Different Clusters Discovered by Different Similarity Measurement • Distance-based • Density-based • Pattern-based • …etc

  6. Density-based clustering y • The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster. • The clusters are separated by low object density regions (noise) Any clusters ? x

  7. Density-based clustering • The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster. • The clusters are separated by low object density regions (noise) Density-based clustering can detect arbitrary cluster shapes

  8. Key idea of density-based clustering • Density constraint for objects to form clusters • Intuitively for each object of a cluster the neighborhood of a given radius has to contain at least a minimum number of objects. (density constraint) • i.e The density in the neighborhood has. to exceed some threshold. • Objects not belong to any clusters are regard as noise.

  9. Previous Works on Density Based Clustering • DBSCAN • A density-based clustering algorithm • Work on data with no uncertainty Will present the uncertainty version of DBSCAN later

  10. DBSCAN • Two important definitions of DBSCAN • Core objects • Directly-density reachable • Density reachable (skip) • Density connected (skip) For the sake of discussion, these two definitions are skipped

  11. DBSCANDefinition 1: Core Object • Given the density constraint (µ andε) • An object o is defined as a core object iff there are µor more objects within theε-range of o. • Basically, we can conduct a range search on object o with radius ε, if there are µ or more objects returned, then o is a core object.

  12. DBSCANDefinition 1: Core Object • Example (µ=5 ) • Is o1 a core object? o2 ε o1 ε Since there are 5 objects within the ε-range of o1, o1 is a core object Since there are 5 objects within the ε-range of o2, o2 is a core object too.

  13. DBSCANDefinition 2: Directly-density reachable • An object p is directly-density reachable from o if the following conditions are satisfied • 1st condition: o is a core object • 2nd condition: d(p,o) ≤ε

  14. DBSCANDefinition 2: Directly-density reachable • Example (µ=5 ) • Question: Is o2 directly-density reachable from o1? Thus, o2 is directly-density reachable from o1 2nd condition: Is d(o2,o1) ≤ε ? Yes, it is within the ε-range of o1. o2 o1 ε 1st condition: Is o1 a core object? Since there are 5 objects within the ε-range of o1, o1 is a core object

  15. DBSCANHow it works? Brief idea… • Search for clusters by checking the ε-neighborhood of each object in the database. • If a core objecto is found, a new cluster with o and it’s direct density-reachable objects is created. • DBSCAN iteratively collects the directly density-reachable objects from the objects in the cluster.

  16. ε ε ε ε ε DBSCAN Eventually, clusters are formed Objects that not assigned to any clusters are regarded as noise Eventually, clusters are formed Objects that not assigned to any clusters are regarded as noise Pick another point for next iteration if the current cluster does not expand. • Example (µ=5 ) o1 DBSCAN continues to “expand” the cluster by adding objects which are directly density reachable from cluster objects Since a1 is not a core object, a2 is NOT direct-density reachable from a1. a2 is NOT added into the cluster Arbitrary pick a point, e.g. o1, check if it is a core object… o2 o1 is a core object A cluster with o1 and all o1’s density reachable objects ε a1 a2

  17. From Certain Data to Uncertain Data

  18. From certain to uncertain dataFive major issues … • Why data exhibit uncertainty? • How to represent / model data uncertainty? • How to represent the distance between two uncertain objects? • What is core object in uncertain data? • What is direct density-reachable in uncertain data?

  19. From certain to uncertain dataFive major issues … • Why data exhibit uncertainty? • How to represent / model data uncertainty? • How to represent the distance between two uncertain objects? • What is core object in uncertain data? • What is direct density-reachable in uncertain data?

  20. Why data exhibit uncertainty? • In many modern application ranges, e.g. the clustering of moving objects or sensor databases, only uncertain data is available. • For instance, in the area of mobile services, the objects continuously change their positions so that exact positional information is often not available.

  21. Why data exhibit uncertainty? • In application areas such as clustering of distributed feature vectors, due to security aspects or to limited bandwidth, only approximated information is transmitted to a central server site.

  22. Uncertain Data (Example) • Somewhere in a tropical rain forest… • Location tracking of a group of about 300 Chimpanzees. • Implanted device reports location of a Chimpanzee regularly. • However the reported location is not precise, it only return the area the Chimpanzee is located. • The area is called an uncertainty region • Assume the probability that the Chimpanzee located in any location inside the uncertainty region is the same.

  23. Uncertain Data (Example) • The Chimpanzee society is complicated, some young Chimpanzees may gather to fight against the leader. • Zoologists are interested to study the factors that affect the formation of different groups (clusters) inside the Chimpanzee society.

  24. Uncertain Data (Example) • One observation is that Chimpanzees of the same group usually stay closely together. • Assume that one Chimpanzee belongs to one group only. • Density based clustering can help to discover the Chimpanzee groups (clusters).

  25. Uncertainty region of 15 Chimpanzees reported by the location tracking devices (location of each Chimpanzee) Uncertain Data (Example) Clusters y x Somewhere in the tropical rain forest…

  26. From certain to uncertain dataFive major issues… • Why data exhibit uncertainty? • How to represent / model data uncertainty? • How to represent the distance between two uncertain objects? • What is core object in uncertain data? • What is direct density-reachable in uncertain data?

  27. Representing Uncertain Objects Probability density functions of 1-D objects Value (e.g. temperature) y Probability density functions for 2-D objects probability x

  28. Representing Uncertain Objects Question: What is the distance between ouncertain and o’uncertain? • The probability that an object o is having a value between a and b can be obtained by Probability density functions of 1-D objects Area Value (e.g. temperature) value b a

  29. From certain to uncertain data Five major issues … • Why data exhibit uncertainty? • How to represent / model data uncertainty? • How to represent the distance between two uncertain objects? • What is core object in uncertain data? • What is direct density-reachable in uncertain data?

  30. How to represent the distance between uncertain objects? • Distance Density Function pd(o,o’) • Distance Distribution Function Pd(o,o’)(b) • Distance expectation value Ed(o,o’) • Aggregated value • Information loss

  31. How to represent the distance between uncertain objects? • Distance Density Function pd(o,o’) • Distance Distribution Function Pd(o,o’)(b) • Distance expectation value Ed(o,o’) • Aggregated value • Information loss

  32. Distance Density Function pd(o,o’) • Express the distance between two objects by means of a probability density function. • Let d be a distance function. • Let P(a≤d(o,o’)≤b) denote the probability that d(o,o’) is between a and b. • A probability density function pd(o,o’) is called a distance density function if the following condition holds:

  33. dis Distance Density Function pd(o,o’) Probability density functions (pdf) of each uncertain data item is considered independent. Value (e.g. temperature) probability Distance density function express the distance between two uncertain objects by mean of pdf. pd(o,o’)(dis) = Pd (o,o’) (dis) 0 Distance between o and o’

  34. Distance Density Function pd(o,o’) Distance Density Function (represents the distance between two uncertain objects) pd (o,o’) probability 0 Distance between o and o’

  35. a b Distance Density Function pd(o,o’) • From the distance density function, the probability that the distance between two uncertain objects is between a and b is given by probability Area = P(a≤d(o,o’)≤b) |Area |= 1 Minumum possible distance between o and o’ pd (o,o’) Maximum possible distance between o and o’ 0 Distance between o and o’

  36. How to represent the distance between uncertain objects? • Distance Density Function pd(o,o’) • Distance Distribution Function Pd(o,o’)(b) • Distance expectation value Ed(o,o’) • Aggregated value • Information loss

  37. Distance Distribution Function • Captures the probability that the distance between two uncertain objects is smaller than or equal to a value b. • Useful in density-based clustering, when expressing the probability that the d(o’,o) ≤b. 2nd condition for directly density reachable in DBSCAN

  38. Probability density functions (pdf) Distance Distribution Function • In density-based clustering, when evaluating whether an object o’ is directly density-reachable from o, we may want to ask What is the probability that o and o’ are close to each other? i.e. distance between o and o’ smaller than or equal to b? o’ o The distance distribution function Pd(o,o’)(b) is the answer.

  39. Distance Distribution Function • The distance distribution functionPd(o,o’)(b) is equal to the integration of the distance density function pd(o,o’) from negative infinity to b . probability Distance Density Function pd (o,o’) 0 b Distance between o and o’

  40. How to represent the distance between uncertain objects? • Distance Density Function pd(o,o’) • Distance Distribution Function Pd(o,o’)(b) • Distance Expectation Value Ed(o,o’) • Aggregated value • Information loss

  41. Distance Expectation Value Ed(o,o’) • Represent the distance between two uncertain objects by one numerical value. • Advantage: Since the distance between two uncertain objects is represented by a single value, traditional clustering algorithms work. E.g. DBSCAN • Disadvantage: Information loss Distance density function Average distance between two objects aggregated from the distance density function

  42. From certain to uncertain data Five major issues … • Why data exhibit uncertainty? • How to represent / model data uncertainty? • How to represent the distance between two uncertain objects? • What is core object in uncertain data? • What is direct density-reachable in uncertain data?

  43. Theoretical Foundations ICore Object Probability • Let denotes the probability that an object o is a core object. • Core object probability of an object o is given by the following formula We start derive this formula from the core object definition of DBSCAN…

  44. Theoretical Foundations ICore Object Probability • In DBSCAN, an object o is a core object if the density constraint (µandε) is satisfied. • i.e. There are µ or more objects p within the ε-range of o. (d(p,o) ≤ε) • The probability that an object o is a core object is the probability that the density constraint is satisified. The probability that there are µ or more objects p with d(p,o) ≤ε

  45. Probability density functions (pdf) Theoretical Foundations ICore Object Probability Exampleµ=5 If ε is this large, obviously, core-object probability of o is 1 If ε is this small, what is the core object probability of o? Sometime, d(p,o) ≤εand sometime d(p,o)≥ε o p What is the core object probability of o? ε

  46. Theoretical Foundations ICore Object Probability • For each subset A of the database D which having the cardinality higher than or equal to µ.

  47. Theoretical Foundations ICore Object Probability • For each subset A of the database D which having the cardinality higher than or equal to µ • Determine the probability that only the objects p of A with d(p,o) ≤εbut no other objects in D\A. The probability that only the objects p of A having d(p,o) ≤ε but no other objects in D\A

  48. Theoretical Foundations ICore Object Probability • Remind thatis the probability that the distance between two uncertain objects is smaller than or equal to a value b. First part: Probability that ALL objects p in A with d(p,o) ≤ε Second part : Probability that ALL objects p in D\A are NOT d(p,o) ≤ε The probability that only the objects p of A having d(p,o) ≤ε but no other objects in D\A

  49. From certain to uncertain data Five major issues … • Why data exhibit uncertainty? • How to represent / model data uncertainty? • How to represent the distance between two uncertain objects? • What is core object in uncertain data? • What is direct density-reachable in uncertain data?

  50. The two events are Dependent to each other ! These two conditions are NOT independent! Theoretical Foundations IIReachability Probability • Let be the probability that p is reachable from o. • In DBSCAN, an object p is directly density reachable form o if • 1st condition : o is a core object • 2nd condition : d(p,o) ≤ε Incorrect, why? ×

More Related