High Dimensional Data Analysis

The Sixth Sino-Japan-Korea Bioinformatics Training Course Yu Shyr (石瑜), Ph.D. March 30, 2007 Yu.Shyr@vanderbilt.edu High Dimensional Data Analysis

Shyr, Yu (石瑜 ), Ph.D. Ingram Distinguished Professor of Cancer Research Director Cancer Biostatistics & High Dimensional Data Analysis Center Professor & Chief Division of Cancer Biostatistics, Dept of Biostatistics Vanderbilt University School of Medicine Yu.Shyr@vanderbilt

Aims Introduction to the high throughput assays – today and tomorrow. Issues of the experiment design of the high throughput assays. Topics of the data pre-processing including the assessment of the quality control. Methods of the data analysis: data mining, pattern recognition, class comparison, model prediction. Tools of the visualization How to avoid the common mistakes in analyzing the data derived from the high throughput assays.

Bioinformatics • The term “bioinformatics” is a fairly recent invention, not appearing in the literature until around 1991. The first efforts in bioinformatics can be traced back to the early application of computer to biology in the 1950’s and 1960’s. • Bioinformatics can generally be defined as the study of how information technologies (computer science, statistics & Probability, and mathematics) are used to solve problems in biology. • The current concept of bioinformatics can be described as the convergence of two technology revolutions: the explosive growth in biotechnology, paralleled by the explosive growth in information technology.

Biostatistics, Bioinformatics, Data Mining • Biostatistics • Bioinformatics • Data Mining Gene profile, Protein profile Hypothesis testing vs.Hypothesis generating

Type of the Biomedical Studies • in vitro – cell lines • in vivo – animal models • Clinical Trials – Phase I, II, III, IV • Cohort Study – Prospective, Retrospective • Case-Control Study – Nested Case-Control

High Dimensional Data • The major challenge in high throughput experiments, e.g., microarray data, MALDI-TOF data, or SELDI-TOF data, is that the data is oftenhigh dimensional. • When the number of dimensions reaches thousands or more, thecomputational timefor the pattern recognition algorithms can becomeunreasonable. This can be a problem, especially when some of the features arenotdiscriminatory.

High Dimensional Data • The irrelevant features may cause a reduction in the accuracy of some algorithms. For example (Witten 1999), experiments with a decision tree classifier have shown that adding a random binary feature to standard datasets can deteriorate the classification performance by5 - 10%. • Furthermore, in many pattern recognition tasks, the number of features represents the dimension of a search space - thelargerthe number of features, thegreaterthe dimension of the search space, and theharderthe problem.

Issues in the Analysis of High Dimensional Data • Experiment Design • Measurement • Preprocessing ♦Filtering, Baseline Correction, Normalization ♦ Profile Alignment, Transformation, Variance correction • QCA (Quality Control Assessment) • Feature Selection • Classification

Issues in the Analysis of High Dimensional Data • Computational Validation ♦Estimate the classification error rate ♦bootstrapping, k-fold validation, leave-one-out validation • Significance Testing of the Achieved Classification Error • Validation – blind test cohort • Validation – laboratory technology, e.g. RTPCR, Pathway analysis • Reporting the result - graphic & table

Experiment Design • Study Objectives: • Class Discovery (unsupervised) • Class Discovery (supervised) Class Comparison (supervised) • Class Prediction (supervised)

Outcome Measurement: Microarray Laser 1 Laser 2 Overlay of the two images

Outcome Measurement: MALDI-TOF

Experiment Design –Bias Reduction - Randomization • Simple Randomization • Stratified Permuted Block Randomization • Adaptive Randomization • Goal – Remove the Bias!!

Quality Control AssessmentReproducibility • Correlation of Variation (CV) SD/Mean • Intra-class Correlation Coefficient (ICC) Intra / Intra + Inter • Variance Component Analysis Mixed/Random Effect Model. The model: investigators, day, spot, machine, lab, etc. • Goal – Make sure the data is reproducible !!

Source of Variability for MALDI-TOF Data • Biological Heterogeneity in Population • Specimen Collection/Handling Effects - Tumor: surgical related effects - Cell Line: culture condition • Biological Heterogeneity in Specimen • Laser power variation

Normalization & Filtering • Normalization • - Model-based methods • - Ratio-based methods Filtering - Scratch - Bubble - Dust specs - Noise

Experiment Design –How many replications?

Intra-Case Variance 0.2 0.5 1.0 Subsample Inter-Case Variance Inter-Case Variance Inter-Case Variance Number (m) 0.2 0.5 1.0 0.2 0.5 1.0 0.2 0.5 1.0 1 6 11 19 11 16 24 19 24 32 5 4 9 17 5 10 17 7 11 19 20 4 8 16 4 9 16 4 9 17 # of Replications Power = 80%, Type I error = 5%

Quality Control: Intra Class Correlation: Training 1-ICC Weight

Intra Class Correlation: Training and Testing Combined 1-ICC Weight

Intra Class Correlation: Training, Testing, and Combined

Expression Profiles day1 day2 day3 day4

The Results from the Cluster Analysis

Why?

Experiment Design - Sample Size Estimation The size of the study should be considered early in the planning phase. Biomedical researches should have sufficientstatistical powerto detect differences between groups considered to be ofclinical interest. Therefore, calculation of sample size with provision for adequate levels of significance and power is an essential part of pla

Experiment Design - Sample Size Estimation

Sample Size Estimation and Power Analysis • In vitro – cell lines • In vivo – animal study • Clinical study • Microarray experiment -http://bioinformatics.mdanderson.org/MicroarraySampleSize/ • MALDI – TOF - simulation

We simulated a 2,000 variable dataset for a sample size of 140 (70 patients with PR+SD and 70 patients with PD). Each variable represents a certain protein. Using 2,500 such simulations we determined that cluster analysis has greater than 80% power to correctly classify observations into two response groups at a misclassification rate of less than 5% per group when there is a 2 standard deviation difference in the means of 1% or 20 of the 2,000 variables. Sample Size Estimation – An Example of MALDI-TOF Study

False Discovery Rate Methods for High-Dimensional Data • Massively Univariate Modeling Fit model for each feature/biomarker (gene, protein) • Which of 50,000 features are significant? =0.05  2,500 false positives!

Solutions for theMultiple Comparison Problem • A MCP Solution Must Control False Positives • How to measure multiple false positives? • Familywise Error Rate (FWER) • Chance of any false positives • Controlled by Bonferroni & Random Field Methods • False Discovery Rate (FDR) • Proportion of false positives among rejected tests

False Discovery Rate • Observed FDR • Obs FDR = V0R/(V1R+V0R) = V0R/NR • If NR = 0, obsFDR = 0 • Only know NR, not how many are true or false • Control is on the expected FDR • FDR = E(obsFDR)

p(i) i / v q / c(v) Benjamini & Hochberg Procedure • Select desired limit q on FDR • Order p-values, p(1)p(2) ...  p(V) • Let r be largest i such that • Reject all hypotheses corresponding top(1), ... , p(r). JRSS-B (1995)57:289-300 1 p(i) p-value i/v q/c(v) 0 0 1 i / v

Benjamini & Hochberg: Key Properties • FDR is controlledE(obsFDR)  q m0/V • Conservative, if large fraction of nulls false • Adaptive • Threshold depends on amount of feature • More feature, More small p-values,More p(i) less than i/V q/c(V)

Benjamini & Hochberg Procedure • c(V) = 1 • Positive Regression Dependency on Subsets • P(X1c1, X2c2, ..., Xkck | Xi=xi) is non-decreasing in xi • Only required of test statistics for which null true • Special cases include • Independence • Multivariate Normal with all positive correlations • Same, but studentized with common std. err. • c(V) = i=1,...,V 1/i log(V)+0.5772 • Arbitrary covariance structure Benjamini &Yekutieli (2001).Ann. Stat.29:1165-1188

Other FDR Methods • John Storey (JRSS-B (2002) 64:479-498) • pFDR “Positive FDR” • FDR conditional on one or more rejections • Critical threshold is fixed, not estimated • pFDR and Emperical Bayes • Asymptotically valid under “clumpy” dependence • James Troendle (JSPI (2000) 84:139-158) • Normal theory FDR • More powerful than BH FDR • Requires numerical integration to obtain thresholds • Exactly valid if whole correlation matrix known

Reflex MALDI TOF Mass Spectrometer Laser Optics Nitrogen Laser (337 nm) TOF Analyzer Microchannel Detector MALDI Target  Ion Mirror Ion Grid

Time-of-Flight Mass Spectrometry (TOF-MS) Linear TOF : Ionizing Probe (start) Ion detector (MCP) M3 M2 M1 Ion signals +/- U M2 M1 M3 t 2 t3 t 1 Start

Basic Descriptions of the Data Preprocessing Registration Denoising Baseline correction Normalization Peak selection Peak alignment or Binning Math Model for MS Data Preprocessing • From a mathematical point of view, a signal or a MS data set is nothing else but a signal function defined on a time or m/z domain. • An observed MALDI MS signal f(x) is often modeled as the superposition of three components: • where B(x) is a slowly varying “baseline”, S(x) is the “true” signal to be extracted and e(x) represents high frequency machine noise.

Features of the Wave-Spec Preprocessing Package • MatLab: an interactive software system for numerical computations and graphics using matrix-based techniques to solve problems. • Wavelet: FBI's image coding standard for digitized fingerprints, successful to reproduce true signal by removing noises of specific energy levels. • Common Metric: for Denoising, Normalization, and S/N calculation for Peak Selection. • Efficient Binning: application of statistical estimation curve of the peak distribution. • Input: raw data, Output: data set of binned peaks

Data registration & Wavelet Denoising • Discrete signal processing requires consistent and fixed sample step. • MALDI MS data has different (decreasing) sample rate during different mass intervals. • Wavelet Decomposition & Reconstruction Decompose (Calculate coefficients) Choose threshold & Perform thresholding Reconstruct (by coefficients and wavelet functions) • Adaptive Stationary Discrete Wavelet denoising method offers both good l2 performance and smoothness. • Adaptive denoising method is based on the noise distribution, we set up different threshold values at different mass intervals and frequency levels.

Pre- Calibration

Post Calibration 1. Accurate m/z peak position (as theoretical)2. Less variation of the peaks position 3. Easily to handle large dataset in batch mode.

Convolution Based Calibration Algorithm 1. Known peaks’ simulation (choose peaks with high prevalence across spectra and clear pattern by feedback 80% ). 2. Convolve each spectra with the known peak simulation (Gaussian, or Beta). Maximum happens when two peak shapes match best. 3. The linear shift units makes multiple peaks matched best is the optimal shift. Notice: all process are on the time domain.

Pre- Calibration

Post Calibration

Wavelets Denoising • Wavelets method has been used to denoise signals in a wide variety of contexts. • Wavelet method analyzes the data in both time and frequency domain to extract more useful information. • Adaptive stationary discrete wavelet denoising method is applied in our research, which is shift-invariant and efficient in denoising.

SDWT Decomposition

Energy Distribution

High Dimensional Data Analysis

High Dimensional Data Analysis

Presentation Transcript

Handling of High-Dimensional Data Sets

MN-B-C 2 Analysis of High Dimensional (-omics) Data

Biometrics and High Dimensional Data

Dimensional Analysis

MN-B-C 2 Analysis of High Dimensional (-omics) Data

MN-B-C 2 Analysis of High Dimensional (-omics) Data

High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis

Dimensional Analysis

High-Dimensional Data

Dimensional Analysis

Dimensional Analysis

Dimensional Analysis

Entropic graphs for high dimensional data analysis Alfred Hero

High-dimensional data analysis: Microarrays and multiple testing

Analysis of high dimensional time series: Ocean bottom seismogram data

Dimensional Analysis

Clustering High Dimensional Data Using SVM

Booster in High Dimensional Data Classification

Dimensional Analysis

Foundation of High-Dimensional Data Visualization

High Dimensional Data

Dimensional Analysis