1 / 99

High Dimensional Data Analysis

The Sixth Sino-Japan-Korea Bioinformatics Training Course Yu Shyr ( 石 瑜 ), Ph.D. March 30, 2007 Yu.Shyr@vanderbilt.edu. High Dimensional Data Analysis. Shyr, Yu ( 石 瑜 ) , Ph.D. Ingram Distinguished Professor of Cancer Research Director

isra
Télécharger la présentation

High Dimensional Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Sixth Sino-Japan-Korea Bioinformatics Training Course Yu Shyr (石 瑜), Ph.D. March 30, 2007 Yu.Shyr@vanderbilt.edu High Dimensional Data Analysis

  2. Shyr, Yu (石 瑜 ), Ph.D. Ingram Distinguished Professor of Cancer Research Director Cancer Biostatistics & High Dimensional Data Analysis Center Professor & Chief Division of Cancer Biostatistics, Dept of Biostatistics Vanderbilt University School of Medicine Yu.Shyr@vanderbilt

  3. Aims Introduction to the high throughput assays – today and tomorrow. Issues of the experiment design of the high throughput assays. Topics of the data pre-processing including the assessment of the quality control. Methods of the data analysis: data mining, pattern recognition, class comparison, model prediction. Tools of the visualization How to avoid the common mistakes in analyzing the data derived from the high throughput assays.

  4. Bioinformatics • The term “bioinformatics” is a fairly recent invention, not appearing in the literature until around 1991. The first efforts in bioinformatics can be traced back to the early application of computer to biology in the 1950’s and 1960’s. • Bioinformatics can generally be defined as the study of how information technologies (computer science, statistics & Probability, and mathematics) are used to solve problems in biology. • The current concept of bioinformatics can be described as the convergence of two technology revolutions: the explosive growth in biotechnology, paralleled by the explosive growth in information technology.

  5. Biostatistics, Bioinformatics, Data Mining • Biostatistics • Bioinformatics • Data Mining Gene profile, Protein profile Hypothesis testing vs.Hypothesis generating

  6. Type of the Biomedical Studies • in vitro – cell lines • in vivo – animal models • Clinical Trials – Phase I, II, III, IV • Cohort Study – Prospective, Retrospective • Case-Control Study – Nested Case-Control

  7. High Dimensional Data • The major challenge in high throughput experiments, e.g., microarray data, MALDI-TOF data, or SELDI-TOF data, is that the data is oftenhigh dimensional. • When the number of dimensions reaches thousands or more, thecomputational timefor the pattern recognition algorithms can becomeunreasonable. This can be a problem, especially when some of the features arenotdiscriminatory.

  8. High Dimensional Data • The irrelevant features may cause a reduction in the accuracy of some algorithms. For example (Witten 1999), experiments with a decision tree classifier have shown that adding a random binary feature to standard datasets can deteriorate the classification performance by5 - 10%. • Furthermore, in many pattern recognition tasks, the number of features represents the dimension of a search space - thelargerthe number of features, thegreaterthe dimension of the search space, and theharderthe problem.

  9. Issues in the Analysis of High Dimensional Data • Experiment Design • Measurement • Preprocessing ♦Filtering, Baseline Correction, Normalization ♦ Profile Alignment, Transformation, Variance correction • QCA (Quality Control Assessment) • Feature Selection • Classification

  10. Issues in the Analysis of High Dimensional Data • Computational Validation ♦Estimate the classification error rate ♦bootstrapping, k-fold validation, leave-one-out validation • Significance Testing of the Achieved Classification Error • Validation – blind test cohort • Validation – laboratory technology, e.g. RTPCR, Pathway analysis • Reporting the result - graphic & table

  11. Experiment Design • Study Objectives: • Class Discovery (unsupervised) • Class Discovery (supervised) Class Comparison (supervised) • Class Prediction (supervised)

  12. Outcome Measurement: Microarray Laser 1 Laser 2 Overlay of the two images

  13. Outcome Measurement: MALDI-TOF

  14. Experiment Design –Bias Reduction - Randomization • Simple Randomization • Stratified Permuted Block Randomization • Adaptive Randomization • Goal – Remove the Bias!!

  15. Quality Control AssessmentReproducibility • Correlation of Variation (CV) SD/Mean • Intra-class Correlation Coefficient (ICC) Intra / Intra + Inter • Variance Component Analysis Mixed/Random Effect Model. The model: investigators, day, spot, machine, lab, etc. • Goal – Make sure the data is reproducible !!

  16. Source of Variability for MALDI-TOF Data • Biological Heterogeneity in Population • Specimen Collection/Handling Effects - Tumor: surgical related effects - Cell Line: culture condition • Biological Heterogeneity in Specimen • Laser power variation

  17. Normalization & Filtering • Normalization • - Model-based methods • - Ratio-based methods Filtering - Scratch - Bubble - Dust specs - Noise

  18. Experiment Design –How many replications?

  19. Intra-Case Variance 0.2 0.5 1.0 Subsample Inter-Case Variance Inter-Case Variance Inter-Case Variance Number (m) 0.2 0.5 1.0 0.2 0.5 1.0 0.2 0.5 1.0 1 6 11 19 11 16 24 19 24 32 5 4 9 17 5 10 17 7 11 19 20 4 8 16 4 9 16 4 9 17 # of Replications Power = 80%, Type I error = 5%

  20. Quality Control: Intra Class Correlation: Training 1-ICC Weight

  21. Intra Class Correlation: Training and Testing Combined 1-ICC Weight

  22. Intra Class Correlation: Training, Testing, and Combined

  23. Expression Profiles day1 day2 day3 day4

  24. The Results from the Cluster Analysis

  25. Why?

  26. Experiment Design - Sample Size Estimation The size of the study should be considered early in the planning phase. Biomedical researches should have sufficientstatistical powerto detect differences between groups considered to be ofclinical interest. Therefore, calculation of sample size with provision for adequate levels of significance and power is an essential part of pla

  27. Experiment Design - Sample Size Estimation

  28. Experiment Design - Sample Size Estimation

  29. Sample Size Estimation and Power Analysis • In vitro – cell lines • In vivo – animal study • Clinical study • Microarray experiment -http://bioinformatics.mdanderson.org/MicroarraySampleSize/ • MALDI – TOF - simulation

  30. We simulated a 2,000 variable dataset for a sample size of 140 (70 patients with PR+SD and 70 patients with PD). Each variable represents a certain protein. Using 2,500 such simulations we determined that cluster analysis has greater than 80% power to correctly classify observations into two response groups at a misclassification rate of less than 5% per group when there is a 2 standard deviation difference in the means of 1% or 20 of the 2,000 variables. Sample Size Estimation – An Example of MALDI-TOF Study

  31. False Discovery Rate Methods for High-Dimensional Data • Massively Univariate Modeling Fit model for each feature/biomarker (gene, protein) • Which of 50,000 features are significant? =0.05  2,500 false positives!

  32. Solutions for theMultiple Comparison Problem • A MCP Solution Must Control False Positives • How to measure multiple false positives? • Familywise Error Rate (FWER) • Chance of any false positives • Controlled by Bonferroni & Random Field Methods • False Discovery Rate (FDR) • Proportion of false positives among rejected tests

  33. False Discovery Rate • Observed FDR • Obs FDR = V0R/(V1R+V0R) = V0R/NR • If NR = 0, obsFDR = 0 • Only know NR, not how many are true or false • Control is on the expected FDR • FDR = E(obsFDR)

  34. p(i) i / v q / c(v) Benjamini & Hochberg Procedure • Select desired limit q on FDR • Order p-values, p(1)p(2) ...  p(V) • Let r be largest i such that • Reject all hypotheses corresponding top(1), ... , p(r). JRSS-B (1995)57:289-300 1 p(i) p-value i/v q/c(v) 0 0 1 i / v

  35. Benjamini & Hochberg: Key Properties • FDR is controlledE(obsFDR)  q m0/V • Conservative, if large fraction of nulls false • Adaptive • Threshold depends on amount of feature • More feature, More small p-values,More p(i) less than i/V q/c(V)

  36. Benjamini & Hochberg Procedure • c(V) = 1 • Positive Regression Dependency on Subsets • P(X1c1, X2c2, ..., Xkck | Xi=xi) is non-decreasing in xi • Only required of test statistics for which null true • Special cases include • Independence • Multivariate Normal with all positive correlations • Same, but studentized with common std. err. • c(V) = i=1,...,V 1/i log(V)+0.5772 • Arbitrary covariance structure Benjamini &Yekutieli (2001).Ann. Stat.29:1165-1188

  37. Other FDR Methods • John Storey (JRSS-B (2002) 64:479-498) • pFDR “Positive FDR” • FDR conditional on one or more rejections • Critical threshold is fixed, not estimated • pFDR and Emperical Bayes • Asymptotically valid under “clumpy” dependence • James Troendle (JSPI (2000) 84:139-158) • Normal theory FDR • More powerful than BH FDR • Requires numerical integration to obtain thresholds • Exactly valid if whole correlation matrix known

  38. Reflex MALDI TOF Mass Spectrometer Laser Optics Nitrogen Laser (337 nm) TOF Analyzer Microchannel Detector MALDI Target  Ion Mirror Ion Grid

  39. Time-of-Flight Mass Spectrometry (TOF-MS) Linear TOF : Ionizing Probe (start) Ion detector (MCP) M3 M2 M1 Ion signals +/- U M2 M1 M3 t 2 t3 t 1 Start

  40. Basic Descriptions of the Data Preprocessing Registration Denoising Baseline correction Normalization Peak selection Peak alignment or Binning Math Model for MS Data Preprocessing • From a mathematical point of view, a signal or a MS data set is nothing else but a signal function defined on a time or m/z domain. • An observed MALDI MS signal f(x) is often modeled as the superposition of three components: • where B(x) is a slowly varying “baseline”, S(x) is the “true” signal to be extracted and e(x) represents high frequency machine noise.

  41. Features of the Wave-Spec Preprocessing Package • MatLab: an interactive software system for numerical computations and graphics using matrix-based techniques to solve problems. • Wavelet: FBI's image coding standard for digitized fingerprints, successful to reproduce true signal by removing noises of specific energy levels. • Common Metric: for Denoising, Normalization, and S/N calculation for Peak Selection. • Efficient Binning: application of statistical estimation curve of the peak distribution. • Input: raw data, Output: data set of binned peaks

  42. Data registration & Wavelet Denoising • Discrete signal processing requires consistent and fixed sample step. • MALDI MS data has different (decreasing) sample rate during different mass intervals. • Wavelet Decomposition & Reconstruction Decompose (Calculate coefficients) Choose threshold & Perform thresholding Reconstruct (by coefficients and wavelet functions) • Adaptive Stationary Discrete Wavelet denoising method offers both good l2 performance and smoothness. • Adaptive denoising method is based on the noise distribution, we set up different threshold values at different mass intervals and frequency levels.

  43. Pre- Calibration

  44. Post Calibration 1. Accurate m/z peak position (as theoretical)2. Less variation of the peaks position 3. Easily to handle large dataset in batch mode.

  45. Convolution Based Calibration Algorithm 1. Known peaks’ simulation (choose peaks with high prevalence across spectra and clear pattern by feedback 80% ). 2. Convolve each spectra with the known peak simulation (Gaussian, or Beta). Maximum happens when two peak shapes match best. 3. The linear shift units makes multiple peaks matched best is the optimal shift. Notice: all process are on the time domain.

  46. Pre- Calibration

  47. Post Calibration

  48. Wavelets Denoising • Wavelets method has been used to denoise signals in a wide variety of contexts. • Wavelet method analyzes the data in both time and frequency domain to extract more useful information. • Adaptive stationary discrete wavelet denoising method is applied in our research, which is shift-invariant and efficient in denoising.

  49. SDWT Decomposition

  50. Energy Distribution

More Related