New Algorithms for Efficient High-Dimensional Nonparametric Classification
This paper presents novel algorithms for improving high-dimensional nonparametric classification using k-Nearest Neighbors (k-NN). It addresses computational challenges in conventional k-NN searching, especially for skewed class data. Key contributions include KNS1 for standard k-NN, KNS2 tailored for skewed classes, and KNS3 evaluating lower and upper bounds of positive neighbors. Comprehensive results from experiments on real data indicate the effectiveness of these algorithms in reducing complexity while maintaining classification accuracy, paving the way for enhanced machine learning applications in high-dimensional spaces.
New Algorithms for Efficient High-Dimensional Nonparametric Classification
E N D
Presentation Transcript
New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray
Overview • Introduction • k Nearest Neighbors (k-NN) • KNS1: conventional k-NN search • New algorithms for k-NN classification • KNS2: for skewed-class data • KNS3: ”are at least t of k-NN positive”? • Results • Comments
Introduction: k-NN • k-NN • Nonparametric classification method. • Given a data set of n data points, it finds the k closest points to a query point , and chooses the label corresponding to the majority. • Computational complexity is too high in many solutions, especially for the high-dimensional case.
Introduction: KNS1 • KNS1: • Conventional k-NN search with ball-tree. • Ball-Tree (binary): • Root node represents full set of points. • Leaf node contains some points. • Non-leaf node has two children nodes. • Pivot of a node: one of the points in the node, or the centroid of the points. • Radius of a node:
Introduction: KNS1 • Bound the distance from a query point q: • Trade off the cost of construction against the tightness of the radius of the balls.
Introduction: KNS1 • recursive procedure: PSout=BallKNN (PSin, Node) • PSin consists of the k-NN of q in V ( the set of points searched so far) • PSout consists of the k-NN of q in V and Node
KNS2 • KNS2: • For skewed-class data: one class is much more frequent than the other. • Find the # of the k NN in the positive class without explicitly finding the k-NN set. • Basic idea: • Build two ball-trees: Postree (small), Negtree • “Find Positive”: Search Postree to find k-nn set Possetk using KNS1; • “Insert negative”: Search Negtree, use Possetk as bounds to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.
KNS2 • Definitions: • Dists={Dist1,…, Distk}: the distance to the k nearest positive neighbors of q, sorted in increasing order. • V: the set of points in the negative balls visited so far. • (n, C): n is the # of positive points in k NN of q. C ={C1,…,Cn}, Ciis # of the negative points in V closer than the ith positive neighbor to q. • and
KNS2 Step 2 “insert negative” is implemented by the recursive function (nout, Cout)=NegCount(nin, Cin, Node, jparent, Dists) (nin, Cin) sumarize interesting negative points for V; (nout, Cout) sumarize interesting negative points for V and Node;
KNS3 • KNS3 • “are at least t of k nearest neighbors positive?” • No constraint of skewness in the class. • Proposition: • Instead of directly compute the exact values, we compute the lower and upper bound, since m+t=k+1
KNS3 P is a set of balls from Postree, N consists of balls from Negtree.
Experimental results • Real data
Experimental results k=9, t=ceiling(k/2), Randomly pick 1% negative records and 50% positive records as test (986 points) Train on the reaming 87372 data points
Comments • Why k-NN? Baseline • No free lunch: • For uniform high-dimensional data, no benefits. • Results mean the intrinsic dimensionality is much lower.