1 / 75

Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time. HDLSS Asymptotics Studied from Dual Viewpoint NCI 60 Data Visualization – found (DWD) directions that showed clusters of cancer types Investigated with DiProPerm test HDLSS hypothesis testing. HDLSS Asymptotics. Interesting Idea from Travis Gaydos:

clarke
Télécharger la présentation

Object Orie’d Data Analysis, Last Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Object Orie’d Data Analysis, Last Time HDLSS Asymptotics • Studied from Dual Viewpoint NCI 60 Data • Visualization – found (DWD) directions that showed clusters of cancer types • Investigated with DiProPerm test HDLSS hypothesis testing

  2. HDLSSAsymptotics Interesting Idea from Travis Gaydos: Interpret from viewpoint of dual space Recall from Aug. 25: for Distance to origin: Pairwise Distance: Angle from origin:

  3. HDLSSAsymptotics – Dual View Would be interesting to try: Study (i.e. explore conditions for): Consistency Strong Inconsistency for PCA direction vectors, from this viewpoint Perhaps other things as well…

  4. NCI 60 Data Recall from: • Aug. 28 • Aug. 30 NCI 60 Cancer Cell Lines Microarray Data • Explored Data Combination • cDNA & Affymetrix Measurements • Right answer is known

  5. Real Clusters in NCI 60 Data • Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Deeper Approach • Formal Hypothesis Testing • (Done later)

  6. Real Clusters in NCI 60 Data? From Aug. 30: Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Some types appeared signif’ly different • Others did not Deeper Approach: Formal Hypothesis Testing

  7. HDLSSHypothesis Testing Approach: DiProPerm Test DIrection – PROjection – PERMutation Ideas: • Find an appropriate Direction vector • Project data into that 1-d subspace • Construct a 1-d test statistic • Analyze significance by Permutation

  8. DiProPerm Simple Example 1, Totally Separate Results: • Random relabelling gives much smaller Ts • Quantiles (over 1000 sim’s) give p-val of 0 • I.e. Strongly conclusive • Conclude sub-populations are different

  9. Needed final verification of Cross-platform Normal’n Is statistical power actually improved? Is there benefit to data combo by DWD? More data  more power? Will study later now

  10. Needed final verification of Cross-platform Normal’n Summary of Results: P-values Combined Results Better for 7 out of 8 cases Combined signficant, Affy not, in 3 cases (Lukemia, NSCLC, Renal) Shows combining platforms often worthwhile (because more data gives more power) Comparison with previous heuristics…

  11. Revisit Real Data (Cont.) Previous Heuristic Results (from rand re-ord): Strong Clust’s Weak Clust’s Not Clust’s MelanomaC N S NSCLC LeukemiaOvarianBreast RenalColon Statistically Sign’t (as expected) Not Sign’t (as expected) Surprising result (not consistent with vis’n)

  12. Revisit Real Data (Cont.) Sungkyu Jung Question: How are those results driven by sample size? Add sample size to above table….

  13. Revisit Real Data (Cont.) Previous Heuristic Results (from rand re-ord): Strong Clust’s Weak Clust’s Not Clust’s Melanoma - 18C N S – 12 NSCLC - 18 Leukemia - 12Ovarian - 8Breast - 12 Renal - 14Colon - 12 Statistically Sign’t (as expected) Not Sign’t (as expected) Surprising result (not consistent with vis’n)

  14. Revisit Real Data (Cont.) Sungkyu Jung Question: How are those results driven by sample size? Add sample size to above table…. Good idea: Surprising result perhaps indeed due to larger sample size

  15. DiProPerm Test Particulate Matter Data Consulting Class Project, for: Lindsay Whicher, Penn Watkinson, EPA Analysis by: Chihoon Lee

  16. DiProPerm – Particulate Matter Data: • Measure Heart Rate of Rats • Over time (several days) • Treat with Particulate Matter • Study effect • See differences between treatments? • Statistically significant?

  17. DiProPerm – Particulate Matter

  18. DiProPerm – Particulate Matter Notes on curve view of data: • Clear day – night effect • Apparent changes after treatment • Stronger effect for higher dose • Effect diminishes over time • Statistically significant differences? • How does “signal” compare to “noise”?

  19. DiProPerm – Particulate Matter Alternate view of data: • Each curve is a “data point” • Study distribution of these “points” • Show replicates as points • To indicate “signal” vs. “noise” issues

  20. DiProPerm – Particulate Matter

  21. DiProPerm – Particulate Matter Notes on PCA & DWD dir’n scatterplots: • Dose effect looks strong (PC2 direction) • Systematic Pattern of colors • Ordered by doses • Suggests important differences • Statistically significant differences? • How does “signal” compare to “noise”? Address by DiProPerm tests

  22. DiProPerm – Particulate Matter Look for differences over 48 hours: • Run DiProPerm • Test Control vs. High Dose • Study difference over long time scale

  23. DiProPerm – Particulate Matter

  24. DiProPerm – Particulate Matter DiProPerm Results: • P-value = 0.056 • Not quite significant • “Noise” just overtakes “signal” • Perhaps Interval of 48 hours is too long • So try smaller interval Day 0, 9 AM – 3 PM

  25. DiProPerm – Particulate Matter DiProPerm Results: Day 0, 9 AM – 3 PM

  26. DiProPerm – Particulate Matter DiProPerm Results: Day 0, 9 AM – 3 PM • Results consistent with data curves: • C vs. H strongly different • C vs. M & L vs. H significantly different • Others not quite significant • For more, related, results, see Wichers, Lee, et al (2007)

  27. DiProPerm – Particulate Matter

  28. HDLSSHypothesis Testing – DiProPerm test Chuck Perou’s 500 Breast Cancer data Based on data Merging (using DWD): • UNC • Geo, # = 102 • Unp, # = 93 • NKI # = 512 • Pub, # = 220 • 97, # = 97

  29. Perou 500 Data Simple PCA view combined data Hard to see structure

  30. Perou 500 Data PCA view – colored by cancer type Shows up in PC1 (vs. others) (good data Combo by DWD)

  31. Perou 500 Data PCA view – add symbols for source No obvious source effect (good data Combo by DWD)

  32. Perou 500 Data Rotate axes for better type separation Carefully chosen DWD views Separates quite well

  33. Perou 500 Data How distinct are classes? Compare “signal” vs. “noise” Measure statistical significance Using DiProPerm test

  34. Perou 500 Data DiProPerm test: Normal vs. Rest Pval = 0.22 Not strong evidence

  35. Perou 500 Data DiProPerm test: Normal vs. Rest Pval = 0.22, Not strong evidence OK, since “normal” means: biopsy missed tumor But mostly from cancer patients Instead compare with “true normals”

  36. Perou 500 Data DiProPerm test: True Normal vs. Rest Pval = 2.30E-06 , Very strong evidence Makes sense.

  37. Perou 500 Data DiProPerm test: Cancer classes • Luminals vs {Her2 & Basals}, pval = 0 • Her2 vs Basals, pval = 0 • Lum A vs Lum b, pval = 0.0068 All strongly conclusive Adds statistical significance to early results

  38. Perou 500 Data Interesting questions: • Was the DWD combination essential? • Were individual groups sign’t anyway? • What was value of DWD combo?

  39. Perou 500 Data DiProPerm Luminals vs {Her2 & Basals}: • All Combined, p-val = 0 • UNC Combo, p-val = 0 • NKI Combo, p-val = 0 • UNC GEO, p-val = 0 • UNC UnP, p-val = 3e-14 • NKI Pub, p-val = 1e-11 • NKI 97, p-val = 0.00078

  40. Perou 500 Data DiProPerm Luminal A vs Luminal B: • All Combined, p-val = 0.0068 • UNC Combo, p-val = 0.214 • NKI Combo, p-val = 0.014 • UNC GEO, p-val = 0.298 • UNC UnP, p-val = 0.396 • NKI Pub, p-val = 0.052 • NKI 97, p-val = 0.246 (shows clear value to combining)

  41. Perou 500 Data DiProPerm Her2 vs Basals: • All Combined, p-val = 0 • UNC Combo, p-val = 0 • NKI Combo, p-val = 0 • UNC GEO, p-val = 0 • UNC UnP, p-val = 0.02 • NKI Pub, p-val = 0.00008 • NKI 97, p-val = 0.246

  42. Perou 500 Data Draw back to DiProPerm here: • Classes found by clustering • Different from e.g. NCI 60 classes • So maybe not surprising they are Different from random • I.e. find significant differences • Does this really mean: Cluster is really there??? Needs deeper thought…

  43. HDLSSHypothesis Testing – DiProPerm test Many Open Questions on DiProPerm Test: • Which Direction is “Best”? • Which 1-d Projected test statistic? • Permutation vs. alternatives (bootstrap?)??? • How do these interact? • What are asymptotic properties?

  44. Clustering Idea: Given data • Assign each object to a class • Of similar objects • Completely data driven • I.e. assign labels to data • “Unsupervised Learning” Contrast to Classification (Discrimination) • With predetermined classes • “Supervised Learning”

  45. Clustering Important References: • McQueen (1967) • Hartigan (1975) • Kaufman and Rousseeuw (2005),

  46. K-means Clustering Main Idea: for data Partition indices among classes Given index sets • that partition • represent clusters by “class mean” where (within class means)

  47. K-means Clustering Given index sets Measure how well clustered, using Within Class Sum of Squares Weak point: not very interpretable

  48. K-means Clustering Common Variation: Put on scale of proportions (i.e. in [0,1]) By dividing “within class SS” by “overall SS” Gives Cluster Index:

  49. K-means Clustering Notes on Cluster Index: • CI = 0 when all data at cluster means • CI small when gives tight clustering (within SS contains little variation) • CI big when gives poor clustering (within SS contains most of variation) • CI = 1 when all cluster means are same

  50. K-means Clustering Clustering Goal: • Given data • Choose classes • To miminize

More Related