1 / 38

Linear Discriminant Analysis (LDA) for selection cuts :

Julien Faivre. Alice week Utrecht – 14 June 2005. Linear Discriminant Analysis (LDA) for selection cuts :. Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present. Observables : Production yields Spectra slope

jeri
Télécharger la présentation

Linear Discriminant Analysis (LDA) for selection cuts :

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Julien Faivre Alice week Utrecht – 14 June 2005 Linear Discriminant Analysis (LDA)for selection cuts : • Motivations • Why LDA ? • How does it work ? • Concrete examples • Conclusions andS.Antonio’s present

  2. Observables : • Production yields • Spectra slope • p, azimutal anisotropy (v2) • Scaled spectra (RCP, RAA), v2 • Statistics is needed for : • p-p collisions, low-p • All p • All p • Peripheral, p-p, high-p • Need more statistics • Need fast and easy selection optimization Apply a patternclassification method 1/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 Initial motivations : Some particles are critical at all p and in all collision systems • Examples of initial S/N ratios : • @RHIC = 10-10@RHIC = 10-11D0@LHC = 10-8

  3. b b a a Variable 1 Variable 1 2/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 Basic strategy : the « classical cuts » • Want to extract signal out of background • « Classical cuts » : example with n = 2 variables (actual analysis : 5-to 30+) • For a good efficiency on signal (recognition), pollution by background ishigh (false alarms) • Compromise has to be found between good efficiency and high S/N ratio • Tuning the cuts is long and difficult Variable 2 Variable 2

  4. 3/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 Which pattern classification method ? • Bayesian decision theory • Markov fields, hidden Markov models • Nearest neighbours • Parzen windows • Linear Discriminant Analysis • Neural networks • Unsupervised learning methods Linear Discriminant Analysis (LDA) Neural networks • Linear • Simple training • Simple tuning Fast tuning • Linear shapeBut multicut  OK • Connex shape • Non linear • Complex training Overtraining • Choose layers & neurons Long tuning • Non linear shape • Non connex shape • Only advantage of neural nets choose LDA Not an absolute answer ;just tried and turns out it works fine

  5. Variable 2 Best axis Variable 1 4/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 LDA mechanism : • Simplest idea : cut along linear combination of the n observables := LDA axis Cut on scalar product

  6. Need a criterium to find the LDA direction • Direction  found will depend on the criterium chosen • Fisher criterium (widely used) : • Projection of the points on direction gives distributions of classes 1 and 2along this direction • i = mean of distrib. i • i = width of distrib. i • 1 and 2 have to be as far as possible one from the other, 1 and 2 have to be as small as possible 2- 1 1 2 1 2 5/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 LDA criterium : Fisher : LDA axis

  7. 6/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 Improvements needed : • Fisher-LDA doesn’t work for us : • too much background, too few signal ; • background covers all the area where signal lies • Fisher-LDA « considers » the distributions as gaussian(mean and width)  insensitive to local parts of the distributions Fisher good (not us) Fisher not good (us) (log) • Solutions : • Apply several successive LDA cuts • Change the criterium : Fisher  « optimized »

  8. Variable 2 Variable 1 1st best axis • Criterium « optimized I » : Given an efficiency of the kth LDA-cut on the signal, maximisation of the number of background cut 2nd best axis 7/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 Multi-cut LDA & optimized criterium : • Fisher is global irrelevant for multi-cut LDA • Have to find criterium thatdepends locally on thedistributions, not globally • More cuts = better description of the « signal/background boundary » • BUT : if many cuts, tends to describe too locally

  9. 8/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 Non-linear approaches : Caution with the description of the boundary : Too local description bad performance Over training sample : Over test sample : Straight line : mmmh… Not so bad Curve : still not satisfied Very good Almost candidate-per-candidate : happy Very bad Case of LDA : the more cuts, the better  the limit is known(determined from number of background candidates cut) everything under control !

  10. Classical cuts Gain Minimal relative uncertaintywith LDA 28th LDA dir. 29th LDA direction 31st LDA direction 30th LDA direction 9/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 LDA cut-tuning : LDA tightening Best LDA cut value

  11. 10/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 LDA for STAR’s hyperons : • Jeff Speltz’s 62 GeV   K (topological) analysis (SQM 2004) : Classical LDA + 63 % signal

  12. 11/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 LDA in ALICE : • Ludovic Gaudichet : strange particles (topologically) K, , then  and  - Neural nets don’t even reach optimized classical - Cascaded neural nets do but don’t do better - LDA seems to do better (ongoing study) • J.F. : charmed meson  D0 in K (topologically) - Very preliminary results on p-integrated raw yield (PbPb central) : (« Current classical cuts » : Andrea Dainese’s thesis, PPR) - Statistical relative uncertainty (S/S) on PID-filtered candidates :  Current classical = 4.4%LDA = 2.1% 2.1 times better - Statistical relative uncertainty on “unfiltered” cand. (just (,)’s out) :  Current classical = 4.3%LDA = 1.6% 2.7 times better - Looking at LDA distributions  new classical set found :  Does 1.6 times better than current classical

  13. 11bis/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 LDA in ALICE (comparison) : VERY PRELIMINARY !! Optimized classical LDA

  14. Significance vs signal Purity-efficiency plot Optimal LDA cut(tuned wrt relative uncertainty) LDA cuts Current classical cuts New classical cuts 12/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 LDA in ALICE (performance) : PID-filtered D0’s with quite tight classical pre-cuts applied

  15. Zoom 13/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 LDA in ALICE (tuning) : • Tuning = search of the minimum of a valley-shaped 1-dim function • 2 hypothesis of background estimation Relative uncertainty vs efficiency LDA Current classical New classical Optimal LDA

  16. Performance : better than classical cuts • Cut-tuning : obvious (classical cuts : nightmare) cool for other centrality classes, collision energies, colliding systems, p ranges Also provides systematics : - LDA vs classical, - Changing LDA cut value, - LDA set 1 vs LDA set 2 Cherry on the cake : optimal usage of ITS for particle with long c’s (,K,,) : • 6 layers & 3 daughter tracks •  343 hit combinations / sets of classical cuts !! Add 3 variables to LDA (#hits of each daughter)  automatic ITS cut-tuning 14/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 Conclusion : • The method we have now : • Linear • Easy implementation (as classical cuts)and class ready ! (See next slide) • Better usage of the N-dim information • Multicut  not as limited as Fisher • Provides transformation from Rn to R trivial optimization of the cuts • Know when limit (too local) is reached Strategy could be :1- tune LDA2- derive classical from LDA

  17. 15/15 Julien Faivre Alice week Utrecht – 14 Jun 2005 S. Antonio’s present : available tool : • C++ class which performs LDA is available • Calculates LDA cuts with chosen method, params and variable rescaling • Has a function Pass to check if a candidate passes the calculated cuts • Plug-and-play : whichever the analysis, no change in the code required • « Universal » input format (tables) • Ready-to-use : options have default values don’t need to worry for a 1st look • Code is documented for how to use (examples included) • Full documentation about LDA and optimization available • Example of filtering code which makes plot like in previous slide available • Not yet on the web, send e-mail (julien.faivre@pd.infn.it) • Statistics needed for training : with optimized criterium, looks like2000 S and N after cuts are enough

  18. BACKUP

  19. Another fake Xi Destroyed Fake Xi Created • Destroys signal • Keeps background • Destroys some correlations  Has to be studied  19/25 IV. Rotating Rotating : Padova – 22 Febbraio 2005 Fake Xi Real Xi Nothing (GeV/c2)

  20. 2 classes : • Signal (real Xis) • Bkgnd (combinatorial) • 1 type : Xi vertex • Dca’s • Decay length • Number of hits • Etc… • Background sample : real data • Signal sample : simulation (embedding) Goal :Classify a new object in one of the classes defined Usage : Observed XiVertex = signal or background IV. Linear Discriminant Analysis Pattern classification : 1/100 Learning : • p classes of objects of the same type • n observables, defined for all the classes • p samples of Nk objects for each class k

  21. IV. Linear Discriminant Analysis Fisher criterium : 1/100 • Fisher-criterium : maximisation of • No need to have a maximisation algorithm • LDA direction u is directly given by : • All done with simple matrices operations • Calculating axis way faster than reading data Within-class scatter matrix Mean-vectors

  22. Julien Faivre – III. Fisher LDA Mathematically speaking (I) : Yale – 04 Nov 2003 • Fisher-criterium : maximisation of • Let’s call u the vector of the LDA axis, xk the vector of the kth candidate for the training (learning) • Means for class i (vector) : • Mean of the projection on u for class i : • So : 16/42

  23. Julien Faivre – III. Fisher LDA Mathematically speaking (II) : Yale – 04 Nov 2003 • Now : • Let’s defineand Sw = S1 + S2 : • So : • In-one-shot booking of the matrix : 17/42

  24. IV. Linear Discriminant Analysis Algorithm for optimized criterium : 1/100 • First find Fisher LDA direction, as a start point • Do a « performance function »  : : vector u performance figure • Maximize the « performance figure » by varying the direction of u • Several methods for maximisation : • Easy and fast : one coordinate at a time • Fancy and powerfull : genetic algorithm

  25. IV. Linear Discriminant Analysis One coordinate at a time : 1/100 • Change the direction of u by steps of a constant angle  :  = 8 to start, then  = 4, 2, 1, eventually 0.5 • Change the 1st coordinate of u until  reaches a maximum • Change all the other coordinates like this, one by one • Then, try again with 1st coordinate, and with the other ones • When no improvement anymore : divide  by 2 and do the whole thing again

  26. Julien Faivre – IV. Improvements Genetic algorithm (I) : Yale – 04 Nov 2003 • Problem with the « one-coordinate-at-a-time » algo : likely to fall in a local maximum different than the absolute maximum • So : use genetic algorithm ! • Like genetic evolution : • Pool of chromosomes • Generations : evolution, reproduction • Darwinist selection • Mutations 28/42

  27. Julien Faivre – IV. Improvements Genetic algorithm (II) : Yale – 04 Nov 2003 • Start with p chromosomes (p vectors uk) made randomly from Fisher • Calculate performance figure of each uk • Order the p vectors by decreasing value of the performance figure • Keep only the m first vectors (Darwinist selection) • Have them make children : build a new set of p chromosomes, with the m selected ones and combinations of them • In the children chromosomes, introduce some mutations (modify randomly a coordinate) • New pool is ready : go to 29/42

  28. IV. Linear Discriminant Analysis Statistics needed : 1/100 • Fisher-LDA : samples need to have more than 10000 candidates each • Doesn’t depend on number of observables (?) (n = 10, n = 22) • Optimized criteria : need much more • Guess : at minimum 50000 candidates per sample, maybe up to 500000 ? • Depends on number of observables 

  29. Julien Faivre – IV. Improvements Statistics needed (II) : Yale – 04 Nov 2003 • Optimised criterium : can’t look at the oscillations to know if enough statistics ! Optimised criterium (step 1) Variable 2 Variable 2 Optimised criterium(step 2) Variable 1 Variable 1 31/42

  30. Julien Faivre – IV. Improvements Statistics needed (III) : Yale – 04 Nov 2003 • Solutions : • Try all the combinations of k out of n observables (never used) • Problem : number is huge (2n-1) : n = 5  31 combinations,n = 10  1023 combinations, n = 20  1048575 combinations ! • Use underoptimal LDA (widely used) • See next slide • Use PCA : Principal Components Analysis (widely used) • See one after next slide 32/42

  31. Julien Faivre – V. Various things Part V – Various things : Yale – 04 Nov 2003 • The projection of the LDA direction from the n-dimension space to a k-dimension sub-space is not the LDA direction of the projection of the samples from the n-dimension space to the k-dimension sub-space • The more observables, the better • Mathematically : adding an observable can’t lower discriminancy • Practically : it can, because of limited statistics to train • LDA (multi-cuts) can’t do worse than cutting on each observable • Because cutting on each observable is a particular case of multi-cuts LDA ! • If does worse : criterium isn’t good, or efficiency of cuts not well chosen 38/42

  32. Most discriminating pair containing most discriminating direction Actual most discriminating pair Most discriminating direction Julien Faivre – IV. Improvements Underoptimal LDA : Yale – 04 Nov 2003 • Calculate discriminancy of each of the n observables • Choose the observable that has the highest discriminancy • Calculate discriminancy of each pair of observables containing the previously found • Choose the most discriminating pair • Etc… with triplets, up to desired number of directions • Problem : 33/42

  33. Variable 2 Variable 1 Julien Faivre – IV. Improvements PCA – Principal Components Analysis (I) : Yale – 04 Nov 2003 • Tool used in data reduction (e.g. image compression) • Read Root class description of TPrincipal • Finds along which directions (linear combinations of observables) is most of the information Primary component axis x1 Secondary component axis x2 Main information of a point is x1, dropping x2 isn’t important 34/42

  34. Julien Faivre – IV. Improvements PCA – Principal Components Analysis (II) : Yale – 04 Nov 2003 • All is matrix-based : easy • « Informativeness » of the direction is given by normalised eigenvalues • Use with LDA : prior to finding the axis : • Observables = base B1 of the n-dimension space • Apply PCA over signal+bkgnd samples (together) : get base B2 of the the n-dimension space • Choose the k most informative directions : C2, subset of B2 • Calculate LDA axis in space defined by C2 • If several LDA directions ? No problem : apply PCA but keep all information of the candidates : just don’t use it all for LDA PCA will give different sub-space for each step 35/42

  35. Variable 2 Variable 1 Julien Faivre – IV. Improvements PCA – Principal Components Analysis (III) : Yale – 04 Nov 2003 • Problem of using PCA prior to LDA : • Use it / not use it is purely empirical • Percentage of the eigenvalues to keep is also purely empirical Best discriminating axis PCA 1st direction 36/42

  36. Julien Faivre – IV. Improvements PCA – Principal Components Analysis (IV) : Yale – 04 Nov 2003 • Difference between PCA and LDA : • Example with letters O and Q : • PCA finds where most of the information is : Most important part of O and Q is a big round shape applying PCA means that both O and Q become O • LDA finds where most of the difference is : Difference between O and Q is the line at the bottom-right  applying LDA means finding this little line PCA  vs LDA 37/42

  37. Julien Faivre – V. Various things Influence of an LDA cut : Yale – 04 Nov 2003 • Usefull to know if LDA cuts steeply or uniformly, along each direction • fk = distribution of a sample along direction of observable k • gk = the same, after the LDA cut • F the normalised integral of f • h(x) = (g/f)(F-1(x)) • Q = 0  cut uniform, Q = 1  cut steep g/f 1   0 1 F 40/42

  38. V0 decay topology : 40/42

More Related