1 / 35

Calibration, error modeling and variance stabilization of microarray data

Calibration, error modeling and variance stabilization of microarray data. Anja von Heydebreck Martin Vingron MPI Molecular Genetics Berlin. Wolfgang Huber Holger Sültmann Annemarie Poustka Div.Molecular Genome Analysis DKFZ Heidelberg. Overview IAM, Univ. HD, 13 Feb 2003.

field
Télécharger la présentation

Calibration, error modeling and variance stabilization of microarray data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Calibration, error modeling and variance stabilization ofmicroarray data Anja von HeydebreckMartin Vingron MPI Molecular Genetics Berlin Wolfgang HuberHolger Sültmann Annemarie Poustka Div.Molecular Genome Analysis DKFZ Heidelberg

  2. OverviewIAM, Univ. HD, 13 Feb 2003 o What are microarrays good for? - functional genomics - cancer o What do they measure? (very schematically) o The data and the challenges o Model formulation o Simplification by variance stabilizing transformation o Parameter estimation o Results simulated data real data

  3. 15 Feb 2001:"The human genome is sequenced"

  4. But what does the sequence do? >gi|22046029|ref|NT_029998.5|Hs7_30253 Homo sapiens chromosome 7 reference genomic contig GATCTTATCTATCATGTTCACCTCCCAAGAGGTGAACATATCCCCCAAAGCCTGATAGAGAGAAGATGCTCATTAATATTTAATGCATGACCATGTGCAGACTTGGGAGGAAAAATATGCCTCAGCCTATCAATATTGGACCTTAATAAACAAGGATGTTTCTGCATCATTTCCCCACAACACCGAACAAGTGTGGCTCACTGTGGATGTTTAAGCAAATGCATTGTTTTTCCAGTTATATATCTGGTAGAGATGAGGCCATTGATAGGAATGGGAAGACGATCTCCTTTTATTTTGATGACCCAGCATGGCTGAACACTCAGTGACTACCACTGCACTTTGTTGTACTTTCAGCATTAGAGATGCCAGCCCTGTAGGATATAAAACAGGAACATCTAGTCCTCAATTATATTCAGAATTACTCAAGTCTTAGAAGCACCACTTGTCTTTTTTCAAGGGAGAGAAATGCTCAAGTGATGGGCTGAAGTGAAGGGAGGGAGTCACTCACTTGAACGGTTCCCTTAGGCTGTGTGGATGCAAACAGCATTAGACAATGACACTGACAGTGGGAAATGCACTGGAGACGATGACTGGCAAAGCCCTCCTTTTCTCCCCATCCACTATAGATACTGACAGCAAAGGGTTTGTCACAATGACAACTATACACTCCCAATATCACAGAAGAAGGAGGAATAAAAGGGTATATTATGAGTGACTGAAGTTTAGAATAAATTAATAAATATTATGTCCCTCATCCATAGAAACCACAAAGGTCTAGTAAGGCTAAGGATATAACAAGAAAATAATATGAATATTTGCTTCCCCTTCCTAGTGTAATAGAGTAAGTTACAAATGGCTTCAGGAAGGGGAGAGAGGAAGAAGAGTGGATGAGATACGTAAGAGTGCTTGAGGGCTAATTTTATGAAAGCTTTGGGAAGTTTTAAGAAAAAGAAAAGCTATTTTTCAAGGTACATGTGTGTATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAAAGACAGAAGAAAGAGGGAGACCTAAGAAGACTATGAGACACTAAGAGAAAAATTAAGGTAAAAAAGACACACACTTAGAAAAACACACATAGGGAGGAGGGAGGAGGTTAAGACATTTTACTATGTGCTGTGAATGGAAACTACAAACCATTTTTGATATATGCAATATATATACATATATACACACATATACATATGTATTTAAATATTTAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTATGTCACTCTGCCCAGGCTGCAGTACAGTGGTTGTTCACAGTCATGATCATAGCACATTATAGCCTTGAACTCCTGGGCTCAAGCAACCCTCCTGTATTAGTCTCCCCAGTAGTTGGGATTACTAGCATATGCCACCATGTCCACCTTTATGCTTTTTAAAGTGAAAAACCATACTAAGAATGAGGCAGCTCAACTTAATAATAAAAACATTTCAAATGTAAAGAAATTTACAAAAGAAAAACAATCAACCCCATTAAAATTGGGCAAAGGGAATGAACAGACACTTTTCAAAAGAATACATGCATGCAGCCAACAAACATACAAAAAAAAAGTTCAACATCACTGATCATTAGAGAAATGCAAATCAAAACCATAATGAGATACCATCTCACACCAGTCAGAATAGCTATCATTAAAAAGTCAAAAAATAACAGATGCTAGTGAGGCTATGGAGAAAAGGGAATGCTTATACACTGTTGTTGGGTGTGCAAATCAGTTCAATCATTGTGCAAGGAAAGTGATTCCTCAAAGAGCTAAAAGCAGAGCTACCATTCGACCCAGTAATCCCACTACTGGGTATATACCCAGATGAATATAAACCATTCTACCATAAAGACACATGCATACAAATGTTCATTGCAGCACTGTTCACAATAGCAAAAGTATGGGATCAACCTAATGCCCATCAATGACAGATTGGATAAAGAAAATGTGGTACATATACACCATGGAATACTATGCCGCCATTAAAAATGATATCATGTCTTTTGCTGGAATATGGATGGACCTTCTATTATCCTTAGCAAACTAATGCAGGAACAGAAAACCAAATATAGCATACTCTCAGTTATAAGTGGGAGCTAAA

  5. transcription translation DNA mRNA Protein Regulatory network Organism Transcription and translation

  6. Functional Genomics Goals: o experimentally identify all transcripts o characterize their function o characterize their interactions (a.k.a. ‘systems biology’)  many levels of detail  contingent on what is experimentallly accessible  separate noise, technical artifacts from biologically relevant observations

  7. Functional genomics and cancer Cancer: somatic cells acquire mutations to become anti-social, proliferate excessively Current cancer classifications: o affected organ o cell type of origin o apparent grade of de-differentiation Goals: o molecular taxonomy: more precise, causal o molecular diagnosis: better estimation of risk and thus treatment strategy o molecular therapy (new drugs: "silver bullets")

  8. samples: mRNA from tissue biopsies, cell lines fluorescent detection of the amount of sample-probe binding probes: gene-specific DNA strands tissue A tissue B tissue C ErbB2 0.02 1.12 2.12 VIM 1.1 5.8 1.8 ALDH4 2.2 0.6 1.0 CASP4 0.01 0.72 0.12 LAMA4 1.32 1.67 0.67 MCAM 4.2 2.93 3.31 microarrays

  9. log-ratio Which genes are differentially transcribed? same-same tumor-normal

  10. Systematic Stochastic o similar effect on many measurements o corrections can be estimated from data o too random to be ex-plicitely accounted for o remain as “noise” Calibration Error model Sources of variation amount of RNA in the biopsy efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection probe purity and length distribution spotting efficiency, spot size cross-/unspecific hybridization stray signal

  11. Fundamental challenges Need to estimate from the data:  calibration  error bars To consider  no. probes is large, no. arrays small  large dynamic range  parametric vs. non-parametric  power  outliers and heavy tails  variance-bias trade-off

  12. microarray data i= 1…O(102) samples k= 1…O(104) genes

  13. bi per-sample normalization factor bk sequence-wise probe efficiency hik ~ N(0,s22) “multiplicative noise” ai per-sample offset Lik local background provided by image analysis eik ~ N(0, bi2s12) “additive noise” measured intensity = offset + gain  true abundance

  14. Parameters data: yki e.g. (n=20000)  (d=100) - sample normalization and offset (ai, bi): 2  d - scale of additive noise (s12): 1 - scale of multiplicative noise (s22): 1 (=asymptotic CV) - probe efficiency (bk): n

  15. data (cDNA slide): the variance-mean dependence model:  relation between uE(Yik) vVar(Yik)

  16. variance stabilization Xu a family of random variables with EXu=u, VarXu=v(u). Define  var f(Xu ) independent of u derivation: linear approximation

  17. variance stabilizing transformation E(h(X)) sd(h(X)) sd(X) E(X)

  18. 1.) constant variance (‘additive’) 2.) constant CV (‘multiplicative’) 3.) offset 4.) additive and multiplicative variance stabilizing transformations

  19. the arsinh transformation - - - log u ——— arsinh((u+uo)/c)

  20. profile likelihood transformed scale: model simplified data: yki e.g. (n=20000)  (d=100) sample normalization and offset (ai, bi): 2  d noise (c2): 1 true abundance of gene k (mk): n

  21. Here: profile log-likelihood

  22. resistant regression o profile maximum likelihood estimator: sensitive to deviations from normality o model assumes mk = mkii - differentially transcribed genes act as outliers. o robust variant of PML estimator, à la Least Trimmed Sum of Squares regression. o works as long as <50% of genes are differentially transcribed

  23. minimize Least trimmed sum of squares regression - least sum of squares - least trimmed sum of squares

  24. S R S S R Results o verification of the approximation o sample size dependence o outliers, heavy tails o sensitivity and selectivity

  25. Verification of the approximation Ym = meh+nh ~ N(0, sh2)n ~ N(0,1) h(y) = arsinh(cy) with c2=exp(sh2)-1

  26. evaluation: effects of different data transformations difference red-green rank(average)

  27. Normal QQ-plot

  28. Sample size dependence

  29. evaluation: sensitivity / specificity in detecting differential abundance o Data: paired tumor/normal tissue from 19 kidney cancers, in color flip duplicates on 38 cDNA slides à 4000 genes. o 6 different strategies for normalization and quantification of differential abundance o Calculate for each gene & each method: t-statistics, permutation-p oAllowing the same number of false positives for each method, compare the number of genes they find.

  30. evaluation: comparison of methods one-sided test for „up“ one-sided test for „down“ more accurate quantification of differential expression  higher sensitivity / specificity

  31. Coefficient of variation cDNA slide: H. Sueltmann

  32. application Identification of differentially expressed genes k by F-(type)- statistic o within one row: remove intensity dependence of the variance o across rows: often d<10 - pool variance estimation („fold change criterion“) - use regularized variances i: samples nd n»d k: genes

  33. Summary o estimate calibration and variance parametersfrom the data o applicable to cDNA chips, filters, and oligo chips (e.g.Affy) oavoid the problems of (log-)ratios at low intensities, in particular the sensitive dependence on small fluctuations osoftware available: www.bioconductor.org o related work: B. Durbin, D. Rocke (UC Davis)

  34. Acknowledgements Uni Heidelberg Günther Sawitzki MPI Molekulare Genetik Tim Beißbarth DFCI Harvard Robert Gentleman UMC Leiden Judith Boer RZPD Anke Schroth Bernd Korn … and many more! DKFZ Heidelberg Molecular Genome Analysis Frank Bergmann Andreas Buneß Katharina Finis Florian Haller Yvonne Keßler Jörg Schneider Klaus Steiner Stephanie Süß Markus Vogt Friederike Wilmer

  35. Other variance stabilizing transformations

More Related