1 / 73

Multivariate Data Analysis

Multivariate Data Analysis. Helge Voss (MPI–K, Heidelberg), Christoph Rosemann (Desy) Terascale Statistics School , Mainz, April 2011. MVA-Literature /Software Packages... a biased selection. Literature:

ugo
Télécharger la présentation

Multivariate Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multivariate Data Analysis Helge Voss (MPI–K, Heidelberg), Christoph Rosemann (Desy) Terascale Statistics School , Mainz, April 2011

  2. MVA-Literature /Software Packages... a biased selection Literature: • T.Hastie, R.Tibshirani, J.Friedman, “The Elements of Statistical Learning”, Springer 2001 • C.M.Bishop, “Pattern Recognition and Machine Learning”, Springer 2006 Software packages for Mulitvariate Data Analysis/Classification • individual classifier software • e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad and many other packages • attempts to provide “all inclusive” packages • StatPatternRecognition: I.Narsky, arXiv: physics/0507143 http://www.hep.caltech.edu/~narsky/spr.html • TMVA: Höcker,Speckmayer,Stelzer,Therhaag,von Toerne,Voss, arXiv: physics/0703039 http://tmva.sf.netor every ROOT distribution (development moved from SourceForge to ROOT repository) • WEKA: http://www.cs.waikato.ac.nz/ml/weka/ • “R”: a huge data analysis library: http://www.r-project.org/ Conferences: PHYSTAT, ACAT,… Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  3. HEP Experiments: Simulated Higgs Event in CMS • That’s how a “typical” higgs event looks like: (underlying ~23 ‘minimum bias’ events) • And not only this: These event happen only in a tiny fraction of the collisions O(10-11) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  4. HEP Experiments: Event Signatures in the Detector • And while a the needle in the hay-stack would be already in one piece: • these particles need to be reconstructed from decay products • decay products need to be reconstructed from detector signatures • etc.. Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  5. Event Classification in High Energy Physics (HEP) • Most HEP analyses require discrimination of signal from background: • event level (Higgs searches, SUSY searches, Top-mass measurement …) • cone level (Tau. vs. quark-jet reconstruction, …) • track level (particle identification, …) • secondary vertex finding (b-tagging) • flavour tagging • etc. • Input information from multiple variables from various sources • kinematic variables (masses, momenta, decay angles, …) • event properties (jet/lepton multiplicity, sum of charges, …) • event shape (sphericity, Fox-Wolfram moments, …) • detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, …) • etc. • Traditionally few powerful input variables were combined; • new methods allow up to 100 and more variables w/o loss of classification power e.g. MiniBooNE; NIMA 543(2005)577 or D0 single top; Phys.Rev.D78,012005(2008) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  6. Event Classification x2 x2 x2 S S S B B B x1 x1 x1 • Suppose data sample of two types of events: with class labels Signal andBackground (will restrict here to two class cases. Many classifiers can in principle be extended to several classes, otherwise, analyses can be staged) • how to set the decision boundary to select events of type S? • we have discriminating variables x1, x2, … Rectangular cuts? A linear boundary? A nonlinear one? • How can we decide what to uses ? • Once decided on a class of boundaries, how to find the “optimal” one ? Low variance (stable), high bias methods High variance, small bias methods Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  7. Regression f(x) x x x • how to estimate a “functional behaviour” from a given set of ‘known measurements” ? • assume for example “D”-variables that somehow characterize the shower in your calorimeter. constant ? linear? non - linear? f(x) f(x) Energy Cluster Size • seems trivial ?  The human brain has very good pattern recognition capabilities! • seems trivial ? • what if you have many input variables? Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  8. Regression  model functional behaviour f(x,y) y x x • Assume for example “D”-variables that somehow characterize the shower in your calorimeter. • Monte Carlo or testbeam  data sample with measured cluster observables + known particle energy = calibration function (energy == surface in D+1 dimensional space) 2-D example 1-D example f(x) events generated according: underlying distribution • better known: (linear) regression  fit a known analytic function • e.g. the above 2-D example  reasonable function would be: f(x) = ax2+by2+c • what if we don’t have a reasonable “model” ?  need something more general: • e.g. piecewise defined splines, kernel estimators, decision trees to approximate f(x) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  9. Event Classification • Find a mapping from D-dimensional input-observable =”feature” space to one dimensional output  class label y(x) • Each event, if Signal or Background, has “D” measured variables. y(x): RDR: D “feature space” most general form y = y(x); x D x={x1,….,xD}: input variables  • plotting (historamming) the resulting y(x) values: • Who sees how this would look like for the regression problem? Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  10. Classification ↔ Regression y(x) Classification: • Each event, if Signal or Background, has “D” measured variables. • y(x): RDR: “test statistic” in D-dimensional space of input variables • y(x)=const: surface defining the decision boundary. y(B)  0, y(S)  1 D “feature space” y(x): RDR:  Regression: • Each event has “D” measured variables + one function value (e.g. cluster shape variables in the ECAL + particles energy) • find y(x): RDR • y(x)=const  hyperplanes where the target function is constant Now, y(x) needs to be build such that it best approximates the target, not such that it best separates signal from bkgr. f(x1,x2) X2 X1 Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  11. Event Classification 1.5 0.45 y(x) y(x): RnR: the mapping from the “feature space” (observables) to one output variable PDFB(y).PDFS(y): normalised distribution of y=y(x) for background and signal events (i.e. the “function” that describes the shape of the distribution) with y=y(x) one can also say PDFB(y(x)),PDFS(y(x)): : Probability densities for background and signal now let’s assume we have an unknown event from the example above for which y(x) = 0.2  PDFB(y(x)) = 1.5 and PDFS(y(x)) = 0.45 let fS and fB be the fraction of signal and background events in the sample, then: is the probability of an event with measured x={x1,….,xD} that gives y(x) to be of type signal Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  12. Neyman-Pearson Lemma “limit” in ROC curve given by likelihood ratio 1 Neyman-Peason: The Likelihood ratio used as “selection criterium” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximises the area under the “Receiver Operation Characteristics” (ROC) curve better classification 1- ebackgr. good classification random guessing 0 esignal 0 1 varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve how to choose “cut”  need to know prior probabilities (S, Babundances) • measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p) • discovery of a signal (typically: S<<B): maximum of S/√(B) • precision measurement: high purity (p) • trigger selection: high efficiency (e) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  13. Efficiency or Purity ? Decision Truth “limit” in ROC curve given by likelihood ratio if discriminating function y(x) = “true likelihood ratio” • optimal working point for specific analysis lies somewhere on the ROC curve 1 y’(x) 1- ebackgr. y’’(x) Type-1 error small Type-2 error large which one of those two is the better?? y(x) ≠ “true likelihood ratio” differently, point • y’(x) might be better for a specific working point than y’’(x) and vice versa Type-1 error large Type-2 error small 0 esignal 0 1 • application of “cost matrix”  favour one over the other in the training process e.g. cancer diagnostic: consequences of a Type2 error much more severe (death due to missing treatment) compared to Type1 error (unnecessary distress) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  14. MVA and Machine Learning • The previous slides were basically the idea of “Multivariate Analysis” (MVA) • rem: What about “standard cuts” ? • Finding y(x) : RDR • given a certain type of model class y(x) • in an automatic way using “known” or “previously solved” events • i.e. learn from known “patterns”  supervised machine learning • Of course… there’s no magic, we still need to: • choose the discriminating variables • choose the class of models (linear, non-linear, flexible or less flexible) • watch generalization properties • tune the “learning parameters”  bias vs. variance trade off • consider trade off between statistical and systematic uncertainties Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  15. Event Classification ? • What about Neyman-Pearson: • Unfortunately, the true probability densities functions are typically unknown:  Neyman-Pearsons lemma doesn’t really help us directly  rest of the lecture: Strategies to get a “good” test statistics y(x) • 2 different ways: Use these “training” events (e.g. Monte Carlo) to: • estimate the functional form of p(x|C): (e.g. the differential cross section folded with the detector influences)  likelihood ratio • e.g. D-dimensional histogram, Kernel densitiy estimators, … • find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from background • e.g. Linear Discriminator, Neural Networks, … * hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  16. Nearest Neighbour and Kernel Density Estimator x2 • One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V h  Kernel Density estimator of the probability density • estimate probability density P(x) in D-dimensional space: • The only thing at our disposal is our “training data” “events” distributed according to P(x) • Say we want to know P(x) at “this” point “x” “x” • For the chosen a rectangular volume  K-events: x1 k(u): is called aKernel function: rectangular Parzen-Window • K (from the “training data”)  estimate of average P(x) in the volume V: ∫P(x)dx = K/N V • Classification: Determine PDFS(x) and PDFB(x) • likelihood ratio as classifier! Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  17. Nearest Neighbour and Kernel Density Estimator x2 • One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V h • estimate probability density P(x) in D-dimensional space: • The only thing at our disposal is our “training data” “events” distributed according to P(x) • Say we want to know P(x) at “this” point “x” “x” • For the chosen rectangular volume  K-events: x1 k(u): is called aKernel function: rectangular Parzen-Window • K (from the “training data”)  estimate of average P(x) in the volume V: ∫P(x)dx = K/N V • Regression: If each events with (x1,x2) carries a “function value” f(x1,x2) (e.g. energy of incident particle)  i.e.: the average function value Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  18. Nearest Neighbour and Kernel Density Estimator x2 x2 h x1 • estimate probability density P(x) in D-dimensional space: • The only thing at our disposal is our “training data” “events” distributed according to P(x) • Say we want to know P(x) at “this” point “x” • One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V “x” • For the chosen rectangular volume  K-events: x1 • determine K from the “training data” with signal and background mixed together • kNN : k-Nearest Neighbours relative number events of the various classes amongst the k-nearest neighbours Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  19. Kernel Density Estimator • Parzen Window: “rectangular Kernel”  discontinuities at window edges • smoother model for P(x) when using smooth Kernel Functions: e.g. Gaussian ↔ probability density estimator averaged kernels individual kernels • place a “Gaussian” around each “training data point” and sum up their contributions at arbitrary points “x”  P(x) • h: “size” of the Kernel  “smoothing parameter” • there is a large variety of possible Kernel functions Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  20. Kernel Density Estimator : a general probability density estimator using kernel K • h: “size” of the Kernel  “smoothing parameter” • chosen size of the “smoothing-parameter”  more important than kernel function • h too small: overtraining • h too large: not sensitive to features in P(x) for Gaussian kernels, the optimum in terms of MISE (mean integrated squared error) is given by: hxi=(4/(3N))1/5sxi with sxi=RMS in variable xi (Christopher M.Bishop) • a drawback of Kernel density estimators: Evaluation for any test events involves ALL TRAINING DATA  typically very time consuming • binary search trees (i.e. Kd-trees) are typically used in kNN methods to speed up searching Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  21. “Curse of Dimensionality” Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton University Press. We all know: Filling a D-dimensional histogram to get a mapping of the PDF is typically unfeasable due to lack of Monte Carlo events. Shortcoming of nearest-neighbour strategies: • in higher dimensional classification/regression cases the idea of looking at “training events” in a reasonably small “vicinity” of the space point to be classified becomes difficult: consider: total phase space volume V=1D for a cube of a particular fraction of the volume: • In 10 dimensions: in order to capture 1% of the phase space  63% of range in each variable necessary  that’s not “local” anymore.. • Therefore we still need to develop all the alternative classification/regression techniques Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  22. Your Turn! • Not yet convinced that “Multivariate Techniques” are what you want to use? • Still believe in “cuts” ?  Exercise 1 and 2 Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  23. Support Vector Machines • Neural Networks  complicated to find “global optimum” of the weights • separation hyperplane: “wiggly” functional behaviour • remember: “piecewise” defined by linear combination of “activation functions” at nodes • KNN (multidimensional likelihood)  awfully slow • calculating the MVA-output involves for each test event the evaluation of ALL training events • Projective Likelihoods don’t deal with correlated variables  Try to get the best of all worlds… Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  24. Support Vector Machine • Linear decision boundaries can be constructed using only measures of distances (= inner (scalar) products) •  leads to quadratic optimisation problem • The decision boundary in the end is defined only by training events that are closest to the boundary • Variable transformations can be powerful • , i.e moving into a higher dimensional space may allow you to separate with linear decision boundaries non linear problems Support Vector Machine Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  25. Support Vector Machines • Find hyperplane that best separates signal from background x2 support vectors • Linear decision boundary • Best separation: maximum distance (margin) between closest events (support) to hyperplane 1 Non-separable data 2 Separable data optimal hyperplane 3 • If data non-separable add misclassification costparameter C·ii to minimisation function 4 margin • Solution of largest margin depends only on inner product of support vectors (distances)  quadratic minimisation problem x1 Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  26. Support Vector Machines • Find hyperplane that best separates signal from background • Best separation: maximum distance (margin) between closest events (support) to hyperplane • Linear decision boundary Non-separable data Separable data • If data non-separable add misclassification costparameter C·ii to minimisation function (x) • Solution depends only on inner product of support vectors  quadratic minimisation problem • Non-linear cases: • Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  27. Support Vector Machines x3 x2 x2 x1 x1 x1 • Find hyperplane that best separates signal from background Non-separable data Separable data • Best separation: maximum distance (margin) between closest events (support) to hyperplane • Linear decision boundary • If data non-separable add misclassification costparameter C·ii to minimisation function (x1,x2) • Solution depends only on inner product of support vectors  quadratic minimisation problem • Non-linear cases: • Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data • Explicit transformation doesn’t need to be specified. Only need the “scalar product” (inner product) x·x  Ф(x)·Ф(x). • certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.:Gaussian, Polynomial, Sigmoid • Choose Kernel and fit the hyperplane using the linear techniques developed above • Kernel size paramter typically needs careful tuning! (Overtraining!) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  28. Support Vector Machines SVM: the Kernel size parameter: example: Gaussian Kernels • Kernel size (s of the Gaussian) choosen propperly for the given problem • Kernel size (s of the Gaussian) choosen too large:  not enough “flexibility” in the underlying transformation colour code: Red  large signal probability: Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  29. Boosted Decision Trees • Decision Tree: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event assignalorbackground Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  30. Boosted Decision Trees • Decision Tree: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event assignalorbackground • used since a long time in general “data-mining” applications, less known in (High Energy) Physics • similar to “simple Cuts”: leaf node  set of cuts. “cubes” in phase space  signal or backgr. • independent of monotonous variable transformations, immune against outliers • weak variables are ignored (and don’t (much) deteriorate performance) • Disadvantage  very sensitive to statistical fluctuations in training data • Boosted Decision Trees (1996):combine a whole forest of Decision Trees, derived from the same sample, e.g. using different event weights. • overcomes the stability problem  became popular in HEP since MiniBooNE, B.Roe et.a., NIM 543(2005) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  31. Growing a Decision Tree • start with training sample at the root node • split training sample at node into two, using a cut in the one variable that gives best separation gain • continue splitting until: • minimal #events per node • maximum number of nodes • maximum depth specified • (a split doesn’t give a minimum separation gain)  not a good idea  see “chessboard” • leaf-nodes classify S,B according to the majority of events or give a S/B probability • Why no multiple branches (splits) per node ? • Fragments data too quickly; also: multiple splits per node = series of binary node splits • What about multivariate splits? • time consuming • other methods more adapted for such correlations Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  32. Growing a Decision Tree • start with training sample at the root node • split training sample at node into two, using a cut in the variable that gives best separation gain • continue splitting until: • minimal #events per node • maximum number of nodes • maximum depth specified • a split doesn’t give a minimum separation gain • split “until the end” and prune afterwards • leaf-nodes classify S,B according to the majority of events or give a S/B probability • Why no multiple branches (splits) per node ? • Fragments data too quickly; also: multiple splits per node = series of binary node splits • What about multivariate splits? • time consuming, other methods more adapted for such correlations • we’ll see later that for “boosted” DTs weak (dull) classifiers are often better, anyway Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  33. Decision Tree Pruning • “Real life” example of an optimally pruned Decision Tree: Decision tree after pruning Decision tree before pruning • Pruning algorithms are developed and applied on individual trees • optimally pruned single trees are not necessarily optimal in a forest ! • actually they tend to be TOO big when boosted, no matter how hard you prune! • Boosted Decision Trees: Use very small trees grown with “early stopping” Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  34. Boosting classifier C(0)(x) classifier C(1)(x) classifier C(2)(x) classifier C(3)(x) classifier C(m)(x) re-weight re-weight re-weight Weighted Sample Weighted Sample Weighted Sample re-weight Weighted Sample Training Sample Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  35. Adaptive Boosting (AdaBoost) classifier C(0)(x) classifier C(1)(x) classifier C(2)(x) classifier C(3)(x) classifier C(m)(x) re-weight re-weight re-weight Weighted Sample Weighted Sample Weighted Sample re-weight Weighted Sample • AdaBoost re-weights events misclassified by previous classifier by: Training Sample • AdaBoost weights the classifiers also using the error rate of the individual classifier according to: Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  36. AdaBoost: A simple demonstration a) b) • The example: (somewhat artificial…but nice for demonstration) : • Data file with three “bumps” • Weak classifier (i.e. one single simple “cut” ↔ decision tree stumps ) var(i) > x var(i) <= x B S Two reasonable cuts: a) Var0 > 0.5  εsignal=66% εbkg ≈ 0% misclassified events in total 16.5% or b) Var0 < -0.5  εsignal=33% εbkg ≈ 0% misclassified events in total 33% the training of a single decision tree stump will find “cut a)” Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  37. AdaBoost: A simple demonstration b) The first “tree”, choosing cut a) will give an error fraction: err = 0.165  before building the next “tree”: weight wrong classified training events by ( 1-err/err) ) ≈ 5  the next “tree” sees essentially the following data sample: .. and hence will chose: “cut b)”: Var0 < -0.5 re-weight The combined classifier: Tree1 + Tree2 the (weighted) average of the response to a test event from both trees is able to separate signal from background as good as one would expect from the most powerful classifier Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  38. Boosting at Work • Another plot showing how important it is to choose good “tuning parameters” Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  39. classifier C(0)(x) classifier C(1)(x) classifier C(2)(x) classifier C(m)(x) re-weight re-weight Weighted Sample Weighted Sample Generalised Classifier Boosting • Principle (just as in BDT): multiple training cycles, each time wrongly classified events get a higher event weight Training Sample re-weight Response is weighted sum of each classifier response Weighted Sample Boosting will be interesting especially for Methods like Cuts, MLP, and SVM Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  40. AdaBoost On a linear Classifier (e.g. Fisher) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  41. AdaBoost On a linear Classifier (e.g. Fisher) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  42. Bagging and Randomised Trees other classifier combinations: • Bagging: • combine trees grown from “bootstrap” samples (i.e re-sample training data with replacement) • Randomised Trees: (Random Forest: trademark L.Breiman, A.Cutler) • combine trees grown with: • random bootstrap (or subsets) of the training data only • consider at each node only a random subsets of variables for the split • NO Pruning! • These combined classifiers work surprisingly well, are very stable and almost perfect “out of the box” classifiers Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  43. Regression Trees • Rather than calling leafs Signal or Background • could also give them “values” (i.e. “mean value” of all values attributed to training events that end up in the node) • Regression Tree • Node Splitting: Separation Gain  Gain in Variance (RMS) of target function • Boosting: error fraction  “distance” measure from the mean linear, square or exponential • Use this to model ANY non analytic function of which you have “training data” i.e. • energy in your calorimeter as function of show parameters using e.g. training data from testbeam Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  44. Regression Trees • Leaf Nodes: One output value ZOOM Regression Trees seem to need DESPITE BOOSTING larger trees Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  45. Learning with Rule Ensembles Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003 • Following RuleFit approach by Friedman-Popescu • Model is linear combination of rules, where a rule is a sequence of cuts RuleFit classifier rules (cut sequence  rm=1 if all cuts satisfied, =0 otherwise) normalised discriminating event variables Sum of rules Linear Fisher term • The problem to solve is • Create rule ensemble: use forest of decision trees • Fit coefficients am, bk: gradient direct regularization minimising Risk (Friedman et al.) • Pruning removes topologically equal rules” (same variables in cut sequence) One of the elementary cellular automaton rules (Wolfram 1983, 2002). It specifies the next color in a cell, depending on its color and its immediate neighbors. Its rule outcomes are encoded in the binary representation 30=000111102. Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  46. See some Classifiers at Work Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  47. Toy Examples: Linear-, Cross-, Circular Correlations • Illustrate the behaviour of linear and nonlinear classifiers Linear correlations (same for signal and background) Linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  48. Weight Variables by Classifier Output • How well do the classifier resolve the various correlation patterns ? Linear correlations (same for signal and background) Cross-linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Multi-dim. Likelihood (kNN) Likelihood Likelihood - D Fisher BDT MLP Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  49. Final Classifier Performance • Background rejection versus signal efficiency curve: Linear Example Circular Example Cross Example Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

  50. Stability with Respect to Irrelevant Variables Stability with Respect to Irrelevant Variables • Toy example with 2 discriminating and 4 non-discriminating variables ? Terascale Statistics School, Mainz, April 2011 ― Multivariate Data Analysis

More Related