1 / 104

Scalable Data Mining

autonlab.org. Scalable Data Mining. Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy Connolly. The Auton Lab, Carnegie Mellon University www.autonlab.org. Outline. autonlab.org. Kd-trees Fast nearest-neighbor finding Fast K-means clustering

antonie
Télécharger la présentation

Scalable Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. autonlab.org Scalable Data Mining Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy Connolly The Auton Lab, Carnegie Mellon University www.autonlab.org

  2. Outline autonlab.org • Kd-trees • Fast nearest-neighbor finding • Fast K-means clustering • Fast kernel density estimation • Large-scale galactic morphology

  3. Outline autonlab.org • Kd-trees • Fast nearest-neighbor finding • Fast K-means clustering • Fast kernel density estimation • Large-scale galactic morphology

  4. 33 Distance Computations Nearest Neighbor - Naïve Approach • Given a query point X. • Scan through each point Y • Takes O(N) time for each query!

  5. Speeding Up Nearest Neighbor • We can speed up the search for the nearest neighbor: • Examine nearby points first. • Ignore any points that are further then the nearest point found so far. • Do this using a KD-tree: • Tree based data structure • Recursively partitions points into axis aligned boxes.

  6. KD-Tree Construction We start with a list of n-dimensional points.

  7. KD-Tree Construction X>.5 YES NO We can split the points into 2 groups by choosing a dimension X and value V and separating the points into X > V and X <= V.

  8. KD-Tree Construction X>.5 YES NO We can then consider each group separately and possibly split again (along same/different dimension).

  9. KD-Tree Construction X>.5 YES NO Y>.1 NO YES We can then consider each group separately and possibly split again (along same/different dimension).

  10. KD-Tree Construction We can keep splitting the points in each set to create a tree structure. Each node with no children (leaf node) contains a list of points.

  11. KD-Tree Construction We will keep around one additional piece of information at each node. The (tight) bounds of the points at or below this node.

  12. KD-Tree Construction Use heuristics to make splitting decisions: • Which dimension do we split along? Widest • Which value do we split at? Median of value of that split dimension for the points. • When do we stop? When there are fewer then m points left OR the box has hit some minimum width.

  13. Outline autonlab.org • Kd-trees • Fast nearest-neighbor finding • Fast K-means clustering • Fast kernel density estimation • Large-scale galactic morphology

  14. Nearest Neighbor with KD Trees We traverse the tree looking for the nearest neighbor of the query point.

  15. Nearest Neighbor with KD Trees Examine nearby points first: Explore the branch of the tree that is closest to the query point first.

  16. Nearest Neighbor with KD Trees Examine nearby points first: Explore the branch of the tree that is closest to the query point first.

  17. Nearest Neighbor with KD Trees When we reach a leaf node: compute the distance to each point in the node.

  18. Nearest Neighbor with KD Trees When we reach a leaf node: compute the distance to each point in the node.

  19. Nearest Neighbor with KD Trees Then we can backtrack and try the other branch at each node visited.

  20. Nearest Neighbor with KD Trees Each time a new closest node is found, we can update the distance bounds.

  21. Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

  22. Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

  23. Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

  24. Metric Trees • Kd-trees rendered worse-than-useless in higher dimensions • Only requires metric space (a well-behaved distance function)

  25. Outline autonlab.org • Kd-trees • Fast nearest-neighbor finding • Fast K-means clustering • Fast kernel density estimation • Large-scale galactic morphology

  26. What does k-means do?

  27. K-means • Ask user how many clusters they’d like. (e.g. k=5)

  28. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations

  29. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

  30. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns

  31. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns… • …and jumps there • …Repeat until terminated! • For basic tutorial on kmeans, see any machine learning text book , or www.cs.cmu.edu/~awm/tutorials/kmeans.html

  32. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns

  33. K-means search guts I must compute Sxi of all points I own I must compute Sxi of all points I own I must compute Sxi of all points I own I must compute Sxi of all points I own

  34. K-means search guts I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle

  35. K-means search guts I will compute Sxi of all points I own in left rectangle, then right subrectangle, then add ‘em I will compute Sxi of all points I own in left subrectangle, then right subrectangle, then add ‘em I will compute Sxi of all points I own in left rectangle, then right subrectangle, then add ‘em I will compute Sxi of all points I own in left rectangle, then right subrectangle, then add ‘em

  36. In recursive call… I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle

  37. Find center nearest to rectangle In recursive call… I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle

  38. Find center nearest to rectangle • For each other center: can it own any points in rectangle? In recursive call… I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle

  39. Find center nearest to rectangle • For each other center: can it own any points in rectangle? In recursive call… I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle

  40. Find center nearest to rectangle • For each other center: can it own any points in rectangle? • If not, PRUNE Pruning I will just grab Sxi from the cached value in the node I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle

  41. Find center nearest to rectangle • For each other center: can it own any points in rectangle? • If yes, recurse... Pruning not possible at previous level… I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle But.. maybe there’s some optimization possible anyway? I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle

  42. Find center nearest to rectangle • For each other center: can it own any points in rectangle? • If yes, recurse... Blacklisting I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle But.. maybe there’s some optimization possible anyway? A hopeless center never needs to be considered in any recursion I will compute Sxi of all points I own in rectangle I will compute Sxi of all points I own in rectangle

  43. Example • Example generated by: • Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org)

  44. K-means continues…

  45. K-means continues…

  46. K-means continues…

  47. K-means continues…

  48. K-means continues…

  49. K-means continues…

  50. K-means continues…

More Related