130 likes | 147 Vues
Explore the Canopies Algorithm for clustering high-dimensional data efficiently, reducing costly comparisons and focusing efforts on potential matches. This method involves creating overlapping canopies to group data points and streamline the clustering process. Experimental validation suggests optimal threshold values for effective canopies. Learn about Greedy Agglomerative Clustering (GAC) and EM Clustering techniques in conjunction with the Canopies Algorithm for accurate reference matching. Dive into complexity analysis, practical results, and insightful discussions on metric interactions and clustering strategies.
E N D
The Canopies Algorithmfrom “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle H. Unger Presented by Danny Wyatt
Record Linkage Methods • As classification [Felligi & Sunter] • Data point is a pair of records • Each pair is classified as “match” or “not match” • Post-process with transitive closure • As clustering • Data point is an individual record • All records in a cluster are considered a match • No transitive closure if no cluster overlap
Motivation • Either way, n2 such evaluations must be performed • Evaluations can be expensive • Many features to compare • Costly metrics (e.g. string edit distance) • Non-matches far outnumber matches • Can we quickly eliminate obvious non-matches to focus effort?
Canopies • A fast comparison groups the data into overlapping “canopies” • The expensive comparison for full clustering is only performed for pairs in the same canopy • No loss in accuracy if: “For every traditional cluster, there exists a canopy such that all elements of the cluster are in the canopy”
Creating Canopies • Define two thresholds • Tight: T1 • Loose: T2 • Put all records into a set S • While S is not empty • Remove any record r from S and create a canopy centered at r • For each other record ri, compute cheap distance d from r to ri • If d < T2, place ri in r’s canopy • If d < T1, remove ri from S
Creating Canopies • Points can be in more than one canopy • Points within the tight threshold will not start a new canopy • Final number of canopies depends on threshold values and distance metric • Experimental validation suggests that T1 and T2 should be equal
Canopies and GAC • Greedy Agglomerative Clustering • Make fully connected graph with a node for each data point • Edge weights are computed distances • Run Kruskal’s MST algorithm, stopping when you have a forest of k trees • Each tree is a cluster • With Canopies • Only create edges between points in the same canopy • Run as before
EM Clustering • Create k cluster prototypes c1…ck • Until convergence • Compute distance from each record to each prototype ( O(kn) ) • Use that distance to compute probability of each prototype given the data • Move the prototypes to maximize their probabilities
Canopies and EM Clustering • Method 1 • Distances from prototype to data points only computed within a canopies containing the prototype • Note that prototypes can cross canopies • Method 2 • Same as one, but also use all canopy centers to account for outside data points • Method 3 • Same as 1, but dynamically create and destroy prototypes using existing techniques
Complexity • n : number of data points • c : number of canopies • f : average number of canopies covering a data point • Thus, expect fn/c data points per canopy • Total distance comparisons needed becomes
Reference Matching Results • Labeled subset of Cora data • 1916 citations to 121 distinct papers • Cheap metric • Based on shared words in citations • Inverted index makes finding that fast • Expensive metric • Customized string edit distance between extracted author, title, date, and venue fields • GAC for final clustering
Discussion • How do cheap and expensive distance metrics interact? • Ensure the canopies property • Maximize number of canopies • Minimize overlap • Probabilistic extraction, probabilistic clustering • How do the two interact? • Canopies and classification-based linkage • Only calculate pair data points for records in the same canopy