Meta-Features in Learning: Theory and Practice

Learning with information of features2009-06-05

Contents Motivation Incorporating prior knowledge on features into learning (AISTATS’07) Regularized learning with networks of features (NIPS’08) Conclusion Company name

Motivation Given data X∈Rn×d + prior information of samples Manifold structure information LAPSVM Transformation invarianceVSVM, ISSVM Permutation invarianceπ- SVM Imbalance informationSVM for imbalance distribution Cluster structure informationStructure SVM Company name

Motivation Information in the sample space (space spanned by samples) Company name

Motivation Prior information in the feature or attribute space (space spanned by features) Company name

Motivation + prior information of features for better generalization Company name

Incorporating prior knowledge on features into learning (AISTATS’07) • Motivation • Kernel design by meta-features • A toy example • Handwritten digit recognition aided by meta-features • Towards a theory of meta-features Company name

Incorporating prior knowledge on features into learning (AISTATS’07) Image recognition task Feature : pixel (gray level) Coordinate (x,y) of pixel can be treated as Feature of features: meta-feature Feature with similar meta-feature, or more specifically, adjacent pixel should be assigned similar weights. Propose a framework incorporating meta-features into learning Company name

Kernel design by meta-features In the standard approach of linear SVM, we solve which can be viewed as finding the maximum-likelihood hypothesis, under the above constraint, where we have a Gaussian prior on w The covariance matrix C equals the unit matrix, i.e. all weights are assumed to be independent and have the same variance. Company name

Kernel design by meta-features We can use meta-feature to create a better prior onw : features with similar meta-feature are expected to be similar in weights, i.e., the weights would be a smooth function of the meta-features. Use a Gaussian prior on w, defined by a covariance matrixC, and the covariance between a pair of weights is taken to be a decreasing function of the distance between their meta-features. Company name

Kernel design by meta-features • The invariance is incorporated by the assumption of smoothness of weights in the meta-feature space. • Gaussian process: xy, smoothness of y in the feature space. This work: uw, smoothness of weight w in the meta-feature space. Company name

A toy problem MINIST dataset (2 vs. 5) Company name

Handwritten digit recognition aided by meta-features Company name

Handwritten digit recognition aided by meta-features Define features and meta-features same height for all isosceles triangle Company name

Handwritten digit recognition aided by meta-features 3 inputs: 40×20×20=16000 (40 stands for ur and uφand 20×20 for the center position.) 2 inputs: 8000 (same feature for a rotation of 180○) Total 24000 features. Company name

Handwritten digit recognition aided by meta-features Define covariance matrix • The weights across features with different sizes, orientations or number of inputs are uncorrelated. • 40+20 identical blocks of size 400×400. Company name

Handwritten digit recognition aided by meta-features Company name

Towards a theory of meta-features Company name

Company name

Regularized learning with networks of features (NIPS’08) • Motivation • Regularized learning with networks of features • Extensions to feature network regularization • Experiment Company name

Motivation • Supervised learning problems, we may know which features yield similar information about the target variable. • Predicting the topic of a document, we know two words are synonyms. • Image recognition, we know which pixels are adjacent. • Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Company name

Regularized learning with networks of features A directed network or graph of features, G: Vertices: the features of the model Edges: link features whose weights are believed to be similar. Pij : the weight of the directed edge from vertex i to vertex j. Company name

Regularized learning with networks of features • Minimize above loss function is equivalent to finding the MAP estimate for w, and w is a priori normally distributed with mean zero and covariance matrix 2M-1. • If P is sparse (only kd entries for k<<d), then the additional matrix multiply is O(d), but the constructed covariance structure over w can be dense. • The feature network regularization penalty is identical to LLE except that the embedding is found for feature weights rather than data instances. Company name

Extensions to feature network regularization • Regularizing with classes of features (In machine learning, features can often be grouped into classes, such that all weights of the features in a given class are drawn from the same underlying distribution.) k disjoint classes of features whose weights are drawn i.i.d. N(μi, σ2) with μiunknown but σ2 known and shared across all classes. The number of edges in this construction scales quadratically in the clique sizes, resulting in feature graphs that are not sparse. Company name

1 1 uk 1 1 Extensions to feature network regularization Solution : can be optimized Company name

Extensions to feature network regularization • Incorporating Feature Dissimilarities • Regularizing Features with the Graph Laplacian Network penalty: penalizes each feature equally Graph penalty: penalizes each edge equally The Laplacian penalty will focus most of the regularization cost on features with many neighbors. Company name

Experiments • Experiments on 20 Newsgroups Features: 11,376 words occurred in at least 20 documents. Feature similarity : a binary vector denoting its presence/absence in 20,000 documents, cosines between binary vectors. (25 neighbors) Company name

Experiments on 20 Newsgroups Company name

Experiments • Experiments on sentiment classification (Product review datasets, sentimentally-charged words from the SentiWord-Net datasets) 1.200 words from SentiWordNet which also occurred in the product reviews at least 100 times. Words with high positive and negative sentiment scores to form ‘positive word cluster’ and ‘negative word cluster’, also two virtual features and a dissimilarity edge between them. Company name

Sentiment Classification Company name

Sentiment Classification 2. Computed the correlations of all features with the SentiWord-Net features so that each word was represented as a 200 dimensional vector of correlations with these highly charged sentiment words. Feature similarity can be computed from those vectors. (100 nearest neighbors) Company name

Sentiment Classification Company name

Conclusion • Smoothness assumption of feature weights. • Restrict to certain application, the define of meta-feature or feature similarity graph. • Feature information derived directly from the given data? The discrimination of individual features. Company name

Conclusion • Fisher’s discriminant ratio (F1) Emphasize on the geometrical characteristics of the class distributions, or more specifically, the manner in which classes are separated which is the most critical for classification accuracy. • Ratio of the separated region (F2) • Feature efficiency (F3) Company name

Meta-Features in Learning: Theory and Practice

Meta-Features in Learning: Theory and Practice

Presentation Transcript

Lecture 05-06

Information Touch- Free 06/05

Information Coordinator Data Custodian Meeting of 05/17/06

FY 05/06

Strasbourg 05/06/07

SACC 05-06

Strasbourg 05/06/07

Strasbourg 05/06/07

LectA. - 04/06/05

Plans for 05-06

Lect1. - 01/06/05

30/05/06

06/05/11

05/06/2010

30/05/06