Local One-Class Optimization for Finding Relevant Data Points

Local one class optimization Gal Chechik, Stanford joint work with Koby Crammer, Hebrew university of Jerusalem

The one-class problem: Find a subset of similar/typical samples Formally: find a ball of a given radius (with some metric) that covers as many data points as possible (related to the set covering problem).

Motivation I Unsupervised setting: Sometimes we wish to model small parts of the data and ignore the rest. This happens when many data points are irrelevant. Example: • Finding sets of co-expressed genes in genome wide-experiment: identify the relevant genes out of thousands irrelevant ones. • Finding a set of document of the same topic, in an heterogeneous corpus

Motivation II Supervised setting: Learning given positive samples only Examples: • Protein interactions • Intrusion detection application Care about low false positive rate

Current approaches Often treat the problem as Outliers and novelty detection:most samples are relevant Current approaches use • A convex cost function (Schölkopf 95, Tax and Duin 99, Ben-Hur et al 2001). • A parameter that affects the size or weight of the ball • Bias towards center of massWhen searching for a small ball, the center of the optimal ball is in the global center of mass, w*=argmin Σx(x-w)2 missing the interesting structures.

Current approaches Example with synthetic data: • 2 Gaussians + uniform background Convex one class (OSU-SVM) Local one-class

How do we do it: • A cost function designed for small sets • A probabilistic approach: allow soft assignment to the set • Regularized optimization

1. A cost function for small sets • The case where only few samples are relevant • Use cost function that is flatfor samples not in the set • Two parameters: • Divergence measure DBF • Flat cost K • Indifferent to the position of “irrelevant” samples. • Solutions converge to the center of mass when ball is large.

2. A probabilistic formulation • We are given m samples in a d dimensional space or simplex, indexed by x . • p(x) is the prior distribution over samples • c ={TRUE,FALSE} is an R.V. that characterizes assignment to the interesting set (the “Ball”). • p(c|x) reflects our belief that the sample x is “interesting”. • The cost function will be D=p(c|x)DBF(w|vx) + (1-p(c|x))KDBF is a divergence measure, to be discussed later

3. Regularized optimization The goal: minimize the mean cost+regularization min β <DBF,K(,wC;vx)>p(c,x) + I(C;X) {p(c|x),w} • The first term: measures the mean distortion <DBF,R(p(c|x),w;vx)> = Σ p(x) [p(c|x)BF(w|vx)+(1-p(c|x))K] • The second term: regularizes the compression of the data (removes information about X) I(C;X) = H(X) – H(X|C), It pushes for putting many points in the set. • This target function is not convex

To solve the problem • It turns out that for a family of divergence functions, called Bregman divergences, we can analytically describe properties of the optimal solution. • The proof follows the analysis of the Information Bottleneck method (Tishby,Pereira,Bialek,99)

Bregman divergences • A Bregman divergence is defined by a convex function F (in our case F(v)=Σf(vi)) • Common examples: L2 norm f(x)=½x2 Itkura-Saito f(x)=-log(x) DKLf(x)=xlog(x) Unnormalized relative entropy f(x)=xlogx-x • Lemma: Convexity of the Bregman Ball The set of points {v s.t. BF(v||w)<R} is convex

Properties of the solution OC solutions obey three fixed point equations When β→∞, Best assignment for x is to minimize

The effect of the K • K controls the nature of the solution. • Is the cost of leaving a point out of the ball • Large K => large radius & many points in set • For the L2 norm, K is formally related to the prior of a single Gaussian fit to the subset. • A full description of a data may require to solve for the complete spectrum of K values.

Algorithm: One-Class IB Adapting the sequential-IB algorithm: One-Class IB: Input: set of m points vx, divergence BF, cost K Output:centroid w, assignment p(c|x) Optimization method: • Iteratively operating sample-by-sample, try to modify the status of a single sample • One step Look-ahead re-fit the model and decide if to change assignment of a sample • This uses a simple formula because of the nice properties of Bregman divergences • search in the dual space of samples, rather than parameters w.

Experiments 1: information retrieval Five most frequent categories of Reuters21578. Each document represented as a multinomial distribution over 2000 terms. The experimental setup: For each category: • train with half of the positive documents, • test with all rest of documents Compared one-class IB with One-class Convex which uses a convex loss function (Crammer& Singer-2003). Controlled by a single parameter η, that determines weight of the class.

Experiments 1: information retrieval Compare precision recall performance, for a range or K/μ values. precision recall

Experiments 1: information retrieval Centroids of clusters, and their distances from the center of mass

Experiments 2: gene expression A typical application for searching small but interesting sets of genes. Genes represented by expression profile across tissues from different patients Alizadeh-2000, (B-cell lymphoma tissues) has mortality data which can be used as an objective method for validating quality of the genes selected.

Experiments 2: gene expression One-class IB compared with one-class SVM (L2) For a series of K values, gene sets with lowest loss function was found (10 restarts). The set of genes was used for regression vs, the mortality data. good Significance of regression prediction (p- value) bad

Future work: finding ALL relevant subsets • Complete characterization of all interesting subsets in the data. • Assume we have a function that assign an interest value to each subset. We search in the space of subsets and for all local maxima. • Requires to define the locality. A natural measure of locality in the subsets-space is the Hamming distance. • The complete characterization of the data require description using a range of local neighborhoods.

Future work: multiple one-class • Synthetic example: two overlapping Gaussians and background uniform noise

Conclusions • We focus on learning one-class for cases where a small ball is sought. • Formalize the problem using IB, and derive its formal solutions • One-class IB performs well in the regime of small subsets.

Local One-Class Optimization for Finding Relevant Data Points

Local One-Class Optimization for Finding Relevant Data Points

Presentation Transcript

Anomaly Detection Combining One-class SVMs and Particle Swarm Optimization Algorithms

Local Search Engine Optimization

Local Search Optimization

Local Search Engine Optimization Company Orlando

Search Engine Optimization for Local Search

Search Engine Optimization For Local Search

Welcome to Class One

Advanced Class Lesson One

Class One

ID 201 Week One Class One

Issue Preclusion – class one

Local Search and Optimization

Class One - First Class

In-Class Query Optimization Exercise

Local Search and Optimization

Local Search Engine Optimization Services Hyderabad

Local Search Engine Optimization Services – Local SEO Company

Basic Needs for Local SEO Optimization

Local SEO Services | Best Local Search Engine Optimization Company

Local Search Engine Optimization

ONE-CLASS CLASSIFICATION

Local one class optimization