1 / 19

Testing the Significance of Attribute Interactions

Testing the Significance of Attribute Interactions. Aleks Jakulin & Ivan Bratko Faculty of Computer and Information Science University of Ljubljana Slovenia. Overview. Interactions: The key to understanding many peculiarities in machine learning.

kamps
Télécharger la présentation

Testing the Significance of Attribute Interactions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Testing the Significance of Attribute Interactions Aleks Jakulin & Ivan Bratko Faculty of Computer and Information Science University of Ljubljana Slovenia

  2. Overview • Interactions: • The key to understanding many peculiarities in machine learning. • Feature importance measures the 2-way interaction between an attribute and the label, but there are interactions of higher orders. • An information-theoretic view of interactions: • Information theory provides a simple “algebra” of interactions, based on summing and subtracting entropy terms (e.g., mutual information). • Part-to-whole approximations: • An interaction is an irreducible dependence. Information-theoretic expressions are model comparisons! • Significance testing: • As with all model comparisons, we can investigate the significance of the model difference.

  3. Example 1: Feature Subset Selection with NBC The calibration of the classifier (expected likelihood of an instance’s label) first improves then deteriorates as we add attributes. The optimal number is ~8 attributes. The first few attributes are important, the rest is noise?

  4. Example 1: Feature Subset Selection with NBC NO! We sorted the attributes from the worst to the best. It is some of the best attributes that deteriorate the performance! Why?

  5. Example 2: Spiral/XOR/Parity Problems Either attribute (x, y) is irrelevant when alone. Together, they make a perfect blue/red classifier.

  6. label C importance of attribute A importance of attribute B attribute attribute A B attribute correlation 3-Way Interaction: What is common to A, B and C together; and cannot be inferred from any subset of attributes. 2-Way Interactions What is going on?Interactions

  7. Entropy given C’s empirical probability distribution (p = [0.2, 0.8]). H(C|A) = H(C)-I(A;C) Conditional entropy - Remaining uncertainty in C after learning A. H(A) Information which came with the knowledge of A H(AB) Joint entropy I(A;C)=H(A)+H(C)-H(AC) Mutual information or information gain --- How much have A and C in common? Quantification: Shannon’s Entropy A C

  8. Interaction Information How informative are A and B together? I(A;B;C) := I(AB;C) - I(A;C) - I(B;C) = I(B;C|A) - I(B;C) = I(A;C|B) - I(A;C) • (Partial) history of independent reinventions: • Quastler ‘53 (Info. Theory in Biology) - measure of specificity • McGill ‘54 (Psychometrika) - interaction information • Han ‘80 (Information & Control) - multiple mutual information • Yeung ‘91 (IEEE Trans. On Inf. Theory) - mutual information • Grabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf interaction index • Matsuda ‘00 (Physical Review E) - higher-order mutual inf. • Brenner et al. ‘00 (Neural Computation) - average synergy • Demšar ’02 (A thesis in machine learning) - relative information gain • Bell ‘03 (NIPS02, ICA2003) - co-information • Jakulin ‘02 - interaction gain

  9. Information gain: 100% I(A;C)/H(C) The attribute “explains” 1.98% of label entropy A positive interaction: 100% I(A;B;C)/H(C) The two attributes are in a synergy: treating them holistically may result in 1.85% extra uncertainty explained. A negative interaction: 100% I(A;B;C)/H(C) The two attributes are slightly redundant: 1.15% of label uncertainty is explained by each of the two attributes. Applications: Interaction Graphs CMC domain: the label is the ‘contraceptive method’ used by a couple.

  10. uninformative attribute informative attribute information gain Interaction as Attribute Proximity weakly interacting strongly interacting cluster “tightness” loose tight

  11. Part-to-Whole Approximation Mutual information: • Whole: P(A,B) Parts:{P(A), P(B)} • Approximation: • Kullback-Leibler divergence as the measure of difference: • Also applies for predictive accuracy:

  12. Kirkwood Superposition Approximation It is a closed form part-to-whole approximation, a special case of Kikuchi and mean-field approximations. is not normalized, explaining the negative interaction information. It is not optimal (loglinear models beat it).

  13. Significance Testing • Tries to answer the question: “When is P much better than P’?” • It is based on the realization that even the correct probabilistic model P can expect to make an error for a sample of finite size. • The notion of self-loss captures the distribution of loss of the complex model (“variance”). • The notion of approximation loss captures the loss caused by using a simpler model (“bias”). • P is significantly better than P’ when the error made by P’ is greater than the self-loss in 99.5% of cases. The P-value can be at most 0.05.

  14. Test-Bootstrap Protocol To obtain the self-loss distribution, we perturb the test data, which is a bootstrap sample from the whole data set. As the loss function, we employ KL-divergence: VERY similar to assuming that D(P’||P) has a χ2 distribution.

  15. Self-Loss

  16. Cross-Validation Protocol • P-values ignore the variation in approximation loss and the generalization power of a classifier. • CV-values are based on the following perturbation procedure:

  17. The Myth of Average Performance The distribution of How much do the mode/median/mean of the above distribution tell you about which model to select? ← interaction (complex) winsapproximation(simple) wins →

  18. Summary • The existence of an interaction implies the need for a more complex model that joins the attributes. • Feature relevance is an interaction of order 2. • If there is no interaction, a complex model is unnecessary. • Information theory provides an approximate “algebra” for investigating interactions. • The difference between two models is a distribution, not a scalar. • Occam’s P-Razor: Pick the simplest model among those that are not significantly worse than the best one.

More Related