1 / 38

Using Trees to Depict a Forest

Using Trees to Depict a Forest. Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich. Motivation. In interactive database querying, we often get more results than we can comprehend immediately.

sirius
Télécharger la présentation

Using Trees to Depict a Forest

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich

  2. Motivation • In interactive database querying, we often get more results than we can comprehend immediately • When do you actually click over 2-3 pages of results? • 85% of users never go to the second page! • What to display on the first page?

  3. Standard solutions • Sorting by attributes • Computationally expensive • Similar results can be distributed many pages apart • Ranking • Hard to estimate of the user's preference. • In database queries, all tuples are equally relevant! • What to do when there are millions of results?

  4. Make the First Page Count • Human beings are very capable of learning from examples • Show the most “representative” results • Best help users learn what is in the result set • User can decide further actions based on representatives

  5. (Model-driven Usable Systems for Information Querying) The Proposal:MusiqLensExperience

  6. Suppose a user wants a 2005 Civic but there are too many of them…

  7. MusiqLens on the Car Data

  8. MusiqLens on the Car Data

  9. After Zooming in:2005 Honda Civics ~ ID 132

  10. After Filtering by “Price < 9,500”

  11. Challenges • Representation Modeling: finding a suitable metric • What is the best set of representatives? • Representative finding • How to find them efficiently? • Query Refinement • How to efficiently adapt to user’s query operations?

  12. Finding a Suitable Metric • Users should be the ultimate judge • Which metric generates the representatives that I can learn the most from? • User study to evaluate different representation modeling

  13. Metric Candidates • Sort by attributes • Uniform random sampling • Small clusters are missed • Density-biased sampling • Sample more from sparse regions, less from dense regions • Sort by typicality • Based on probabilistic modeling • K-medoids

  14. Metric Candidates - K-medoids • A medoid of a cluster is the object whose dissimilarity to others is smallest • Average medoid and max medoid • K-medoids are k objects, each from a different cluster where the object is the medoid • Why not K-means? • K-means cluster centers do not exist in database • We must present real objects to users

  15. Plotting the Candidates Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1

  16. Plotting the Candidates - Typicality

  17. Plotting the Candidates –k-medoids

  18. User Study Procedure • Users are given: • 7 sets of data, generated using the 7 candidate methods • Each set consists of 8 representative points • Users predict 4 more data points • That are most likely in the data set • Should not pick those already given • Measure the predication error

  19. Verdict • K-meoids is the winner • In this paper, authors choose average k-medoids • Proposed algorithm can be extended to max-medoids with small changes

  20. Challenges • Representation Modeling: finding a suitable metric • What is the best set of representatives? • Representative finding • How to find them efficiently? • Query Refinement • How to efficiently adapt to user’s query operations?

  21. Cover Tree Based Algorithm • Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 • Briefly discuss Cover Tree properties • See Cover Tree based algorithms for computing k-medoids

  22. Cover Tree Properties (1) Nesting: for all , Points in the Data (One Dimension)

  23. Cover Tree Properties (2) Covering: node in is within distance of to its children in Distance from node to any descendant is less than . This value is called the “span” of the node.

  24. Cover Tree Properties (3) Separation: nodes in are separated by at least Note: allowed to be negative to satisfy above conditions.

  25. Additional Stats for Cover Tree (2D Example) DS = 10 DS = 3 p Density (DS): number of points in the subtree Centroid (CT): geometric center of points in the subtree

  26. k-medoid Algorithm Outline • We descend the cover tree to a level with more than nodes • Choose an initial points as first set of medoids (seeds) • Bad seeds can lead to local minimums with a high distance cost • Assigning nodes and repeated update until medoids converge

  27. Cover Tree Based Seeding • Descend the cover tree to a level with more than nodes (denote as level m) • Use the parent level as starting point for seeds • Each node has a weight, calculated as product of span and density (the contribution of the subtree to the distance cost) • Expand nodes using a priority queue • Fetch the first nodes from the queue as seeds

  28. A Simple Example: k = 4 Span = 2 Span = 1 Span = 1/2 Span = 1/4 Priority Queue on node weight (density * span): S3 (5), S8 (3), S5 (2) S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2) Final set of seeds

  29. Update Process • Initially, assign all nodes to closest seed to form clusters • For each cluster, calculate the geometric center • Use centroid and density information to approximate subtree • Find the node that is closest to the geometric center, designate as a new medoid • Repeat from step 1 until medoids converge

  30. Challenges • Representation Modeling: finding a suitable metric • What is the best set of representatives? • Representative finding • How to find them efficiently? • Query Refinement • How to efficiently adapt to user’s query operations?

  31. Query Adaptation • Handle user actions • Zooming • Selection (filtering)

  32. Zooming • Zooming • Expand all nodes assigned to the medoid • Run k-medoid algorithm on the new set of nodes

  33. Selection • Effect of selection on a node • Completely invalid • Fully valid • Partially valid • Estimate the validity percentage (VG) of each node • Multiply the VG with weight of each node

  34. Experiments – Initial Medoid Quality • Compare with R-tree based method by M. Ester, H. Kriegel, and X. Xu • Data sets • Synthetic dataset: 2D points with zipf distribution • Real dataset: LA data set from R-tree Portal, 130k points • Measurement • Time to compute the medoids • Average distance from a data point to its medoid

  35. Results on Synthetic Data Distance Time For various sizes of data, Cover-tree based method outperforms R-tree based method

  36. Results on Real Data For various k values, Cover-tree based method outperforms R-tree based method on real data

  37. Query Adaptation Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Synthetic Data Real Data Time cost of re-building is orders-of-magnitude higher than incremental computation.

  38. Conclusion • Authors proposed MusiqLens framework for solving the many-answer problem • Authors conducted user study to select a metric for choosing representatives • Authors proposed efficient method for computing and maintaining the representatives under user actions • Part of the database usability project at Univ. of Michigan • Led by Prof. H.V. Jagadish • http://www.eecs.umich.edu/db/usable/

More Related