Utilizing Trees for Forest Representation in Database Querying

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich

Motivation • In interactive database querying, we often get more results than we can comprehend immediately • When do you actually click over 2-3 pages of results? • 85% of users never go to the second page! • What to display on the first page?

Standard solutions • Sorting by attributes • Computationally expensive • Similar results can be distributed many pages apart • Ranking • Hard to estimate of the user's preference. • In database queries, all tuples are equally relevant! • What to do when there are millions of results?

Make the First Page Count • Human beings are very capable of learning from examples • Show the most “representative” results • Best help users learn what is in the result set • User can decide further actions based on representatives

(Model-driven Usable Systems for Information Querying) The Proposal:MusiqLensExperience

Suppose a user wants a 2005 Civic but there are too many of them…

MusiqLens on the Car Data

After Zooming in:2005 Honda Civics ~ ID 132

After Filtering by “Price < 9,500”

Challenges • Representation Modeling: finding a suitable metric • What is the best set of representatives? • Representative finding • How to find them efficiently? • Query Refinement • How to efficiently adapt to user’s query operations?

Finding a Suitable Metric • Users should be the ultimate judge • Which metric generates the representatives that I can learn the most from? • User study to evaluate different representation modeling

Metric Candidates • Sort by attributes • Uniform random sampling • Small clusters are missed • Density-biased sampling • Sample more from sparse regions, less from dense regions • Sort by typicality • Based on probabilistic modeling • K-medoids

Metric Candidates - K-medoids • A medoid of a cluster is the object whose dissimilarity to others is smallest • Average medoid and max medoid • K-medoids are k objects, each from a different cluster where the object is the medoid • Why not K-means? • K-means cluster centers do not exist in database • We must present real objects to users

Plotting the Candidates Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1

Plotting the Candidates - Typicality

Plotting the Candidates –k-medoids

User Study Procedure • Users are given: • 7 sets of data, generated using the 7 candidate methods • Each set consists of 8 representative points • Users predict 4 more data points • That are most likely in the data set • Should not pick those already given • Measure the predication error

Verdict • K-meoids is the winner • In this paper, authors choose average k-medoids • Proposed algorithm can be extended to max-medoids with small changes

Cover Tree Based Algorithm • Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 • Briefly discuss Cover Tree properties • See Cover Tree based algorithms for computing k-medoids

Cover Tree Properties (1) Nesting: for all , Points in the Data (One Dimension)

Cover Tree Properties (2) Covering: node in is within distance of to its children in Distance from node to any descendant is less than . This value is called the “span” of the node.

Cover Tree Properties (3) Separation: nodes in are separated by at least Note: allowed to be negative to satisfy above conditions.

Additional Stats for Cover Tree (2D Example) DS = 10 DS = 3 p Density (DS): number of points in the subtree Centroid (CT): geometric center of points in the subtree

k-medoid Algorithm Outline • We descend the cover tree to a level with more than nodes • Choose an initial points as first set of medoids (seeds) • Bad seeds can lead to local minimums with a high distance cost • Assigning nodes and repeated update until medoids converge

Cover Tree Based Seeding • Descend the cover tree to a level with more than nodes (denote as level m) • Use the parent level as starting point for seeds • Each node has a weight, calculated as product of span and density (the contribution of the subtree to the distance cost) • Expand nodes using a priority queue • Fetch the first nodes from the queue as seeds

A Simple Example: k = 4 Span = 2 Span = 1 Span = 1/2 Span = 1/4 Priority Queue on node weight (density * span): S3 (5), S8 (3), S5 (2) S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2) Final set of seeds

Update Process • Initially, assign all nodes to closest seed to form clusters • For each cluster, calculate the geometric center • Use centroid and density information to approximate subtree • Find the node that is closest to the geometric center, designate as a new medoid • Repeat from step 1 until medoids converge

Query Adaptation • Handle user actions • Zooming • Selection (filtering)

Zooming • Zooming • Expand all nodes assigned to the medoid • Run k-medoid algorithm on the new set of nodes

Selection • Effect of selection on a node • Completely invalid • Fully valid • Partially valid • Estimate the validity percentage (VG) of each node • Multiply the VG with weight of each node

Experiments – Initial Medoid Quality • Compare with R-tree based method by M. Ester, H. Kriegel, and X. Xu • Data sets • Synthetic dataset: 2D points with zipf distribution • Real dataset: LA data set from R-tree Portal, 130k points • Measurement • Time to compute the medoids • Average distance from a data point to its medoid

Results on Synthetic Data Distance Time For various sizes of data, Cover-tree based method outperforms R-tree based method

Results on Real Data For various k values, Cover-tree based method outperforms R-tree based method on real data

Query Adaptation Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Synthetic Data Real Data Time cost of re-building is orders-of-magnitude higher than incremental computation.

Conclusion • Authors proposed MusiqLens framework for solving the many-answer problem • Authors conducted user study to select a metric for choosing representatives • Authors proposed efficient method for computing and maintaining the representatives under user actions • Part of the database usability project at Univ. of Michigan • Led by Prof. H.V. Jagadish • http://www.eecs.umich.edu/db/usable/

Utilizing Trees for Forest Representation in Database Querying

Utilizing Trees for Forest Representation in Database Querying

Presentation Transcript

common forest trees in virginia

Cutting the trees to save the forest

Seeing the Forest For the Trees

DEPICT: Developing Employability Programmes using Interactive Curriculum Technologies

How to depict CSW quarterly plans using Agilefant

Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering

Seeing the forest for the trees

A forest of trees

Using Trees to Depict a Forest

The Trees for the Forest

Using the Forest to see the Trees: A computational model relating features, objects and scenes

CREST Workshop, 2014: Using Surfaces to Depict Contacts and Interactions

Using Trees to Depict a Forest

See the Trees, Then the Forest

Section 7 Trees of the Forest

Structure and Function of Forest Trees

Efficiently Mining Frequent Trees in a Forest

Use LumberJack to create and compare a forest of phylogenetic trees

Using Trees to Depict a Forest

Missing the Forest for the Trees

Using a Random Forest model to predict enrollment

Constraints On Trees And A Forest Of Other Problems