1 / 37

Micro Array Literature

High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery 25 June 03. Micro Array Literature. Guilt by Association : You are known by the company you keep. Data Matrix

raquel
Télécharger la présentation

Micro Array Literature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden HelixMachine Learning, Statistics, and Discovery25 June 03

  2. Micro Array Literature

  3. Guilt by Association : You are known by the company you keep.

  4. Data Matrix Goal: Associations over the genes. Genes Tissues Guilty Gene

  5. Goals • Associations. • Deep associations • – beyond 1st level correlations. • 3. Uncover multiple mechanisms.

  6. Problems • n < < p • Strong correlations. • Missing values. • Non-normal distributions. • Outliers. • Multiple testing.

  7. Technical Approach • Recursive partitioning. • Resampling-based, adjusted p-values. • Multiple trees.

  8. Recursive Partitioning • Tasks • Create classes. • How to split. • How to stop.

  9. Recursive Partitioning Top-down analysis Can use any type of descriptor. Uses biological activities to determine which features matter. Produces a classification tree for interpretation and prediction. Big N is not a problem! Missing values are ok. Multiple trees, big p is ok. Clustering Often bottom-up Uses “gestalt” matching. Requires an external method for determining the right feature set. Difficult to interpret or use for prediction. Big N is a severe problem!! Differences:

  10. Forming Classes, Categories, Groups ProfessionAv. Income Baseball Players 1.5M Football Players 1.2M Doctors .8M Dentists .5M Lawyers .23M Professors .09M . . . . .

  11. Forming Classes from “Continuous” Descriptor How many “cuts” and where to make them?

  12. rP = 2.03E-70 aP = 1.30E-66 Signal 2.60 - 0.29 t = = = 18.68 Noise 0.734 1 1 36 1614 + Splitting : t-test n = 1650 ave = 0.34 sd = 0.81 TT: NN-CC n = 1614 ave = 0.29 sd = 0.73 n = 36 ave = 2.60 sd = 0.9

  13. Signal Among Var S(Xi. - X..)2/df1 F = = = Noise Within Var S(Xij - Xi.)2/df2 Splitting : F-test n = 1650 ave = 0.34 sd = 0.81 n = 61 ave = 1.29 sd = 0.83 n = 1553 ave = 0.21 sd = 0.73 n = 36 ave = 2.60 sd = 0.9

  14. How to Stop Examine each current terminal node. Stop if no variable/class has a significant split, multiplicity adjusted.

  15. Levels of Multiple Testing • Raw p-value. • Adjust for class formation, segmentation. • Adjust for multiple predictors. • Adjust for multiple splits in the tree. • Adjust for multiple trees.

  16. Understanding observations Multiple Mechanisms Conditionally important descriptors. NB: Splitting variables govern the process, linked to response variable.

  17. Multiple Mechanisms

  18. Reality: Example Data 60 Tissues 1453 Genes Gene 510 is the “guilty” gene, the Y.

  19. 1st Split of Gene 510 (Guilty Gene)

  20. Split Selection 14 spliters with adjusted p-value < 0.05

  21. Histogram Non-normal, hence resampling p-values make sense.

  22. Resampling-based Adjusted p-value

  23. Single Tree RP Drawbacks • Data greedy. • Only one view of the data. May miss other mechanisms. • Highly correlated variables may be obscured. • Higher order interactions may be masked. • No formal mechanisms for follow-up experimental design. • Disposition of outliers is difficult.

  24. Multiple Trees, how and why? Etc.

  25. How do you get multiple trees? • Bootstrap the sample, one tree per sample. • Randomize over valid splitters. Etc.

  26. RandomTreeBrowsing, 1000 Trees.

  27. Example Tree

  28. 1st Split

  29. Example Tree, 2nd Split

  30. Conclusion for Gene G510 If G518 < -0.56 and G790 < -1.46 then G510 = 1.10 +/- 0.30

  31. Using Multiple Trees to Understand variables • Which variables matter? • How to rank variables in importance. • Correlations. • Synergistic variables.

  32. CorrelationInteractionMatrix Red=Syn.

  33. Summary • Review recursive partitioning. • Demonstrated multiple tree RP’s capabilities • Find associated genes • Group correlated predictors (genes) • Synergistic predictors (genes that predict together) • Used to understand a complex data set.

  34. Needed research • Real data sets with known answers. • Benchmarking. • Linking to gene annotations. • Scale (1,000*10,000). • Multiple testing in complex data sets. • Good visualization methods. • Outlier detection for large data sets. • Missing values. (see NISS paper 123)

  35. Teams U Waterloo : Will Welch Hugh Chipman Marcia Wang Yan Yuan NC State University : Jacqueline Hughes-Oliver Katja Rimlinger U. Minnesota : Douglas Hawkins NISS : Alan Karr (Consider post docs) GSK : Lei Zhu Ray Lam

  36. References/Contact • www.goldenhelix.com. • www.recursive-partitioning.com. • www.niss.org, papers 122 and 123. • young@niss.org • GSK patent.

  37. Questions

More Related