1 / 55

STAT 254 -lecture1 An overview

STAT 254 -lecture1 An overview . Cell biology, microarray, statistics Bioinformatics and Statistics Topics to cover Keep a skeptical eye on everything you read or hear Keep an eye on bigger picture; while working on specifics The shaping of bioinformatics falls on your shoulders

ellison
Télécharger la présentation

STAT 254 -lecture1 An overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAT 254 -lecture1An overview • Cell biology, microarray, statistics • Bioinformatics and Statistics • Topics to cover • Keep a skeptical eye on everything you read or hear • Keep an eye on bigger picture; while working on specifics • The shaping of bioinformatics falls on your shoulders • What to take home : not just microarray, or high throughput data analysis methods, but a set of skills, ways of thinking about quantitative biology

  2. Exploratory data analysis multivariate high dimensional 20 min

  3. Study of Gene Expression:Statistics, Biology, and Microarrays IMS ENAR Conference Time : March 31, 2003 Place:Tampa, FL Ker-Chau Li Statistics Department UCLA kcli@stat.ucla.edu

  4. Outline • Review of cell biology Microarray gene expression data collection • Cell-cycle gene expression (Main Data set) • PCA/Nested regression; SIR (Dim. red.) • Similarity analysis - clustering (Why Popular?) • Liquid association • Closing remarks New statistical concept, fueled byStein’s lemma Justification for IMS

  5. PART I. Cellular Biology Macromolecules: DNA, mRNA, protein

  6. Why Biology hot? Because of

  7. Human Genome Project Begun in 1990, the U.S. Human Genome Project is a13-year effort coordinated by the U.S. Department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but effective resource and technological advances have accelerated the expected completion date to 2003. Project goals are to ■ identify all the approximate 30,000 genes in human DNA, ■ determine the sequences of the 3billion chemical base pairs that make up human DNA, ■ store this information in databases, ■ improve tools for data analysis, ■ transfer related technologies to the private sector, and ■ address the ethical, legal, and social issues (ELSI) that may arise from the project. Recent Milestones: ■ June 2000 completion of a working draft of the entire human genome ■ February 2001 analyses of the working draft are published Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

  8. Future Challenges: What We Still Don’t Know •Predicted vs experimentally determined gene function {1} •Gene regulation{2} (upstream regulatory region) • Coordination of gene expression, protein synthesis, and post-translational events {3} • Gene number, exact locations, and functions • DNA sequence organization • Chromosomal structure and organization • Noncoding DNAtypes, amount, distribution, information content, and functions • Interaction of proteins in complex molecular machines • Evolutionary conservation among organisms • Protein conservation (structure and function) • Proteomes (total protein content and function) in organisms • Correlation of SNPs (single-base DNA variations among individuals) with health and disease • Disease-susceptibility prediction based on gene sequence variation • Genes involved in complex traits and multigene diseases • Complex systems biology including microbial consortia useful for environmental restoration • Developmental genetics, genomics Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

  9. Medicine and the New Genomics • Gene Testing • Gene Therapy • Pharmacogenomics Anticipated Benefits • improved diagnosis of disease • earlier detection of genetic predispositions to disease • rational drug design • gene therapy and control systems for drugs • personalized, custom drugs Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

  10. Anticipated Benefits Agriculture, Livestock Breeding, and Bioprocessing • disease-, insect-, and drought-resistant crops• healthier, more productive, disease-resistant farm animals• more nutritious produce• biopesticides• edible vaccines incorporated into food products• new environmental cleanup uses for plants like tobacco Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

  11. How does the cell work? The guiding principle is the so-called Central dogma of cell biology

  12. Medicine and the New Genomics • Gene Testing • Gene Therapy • Pharmacogenomics Anticipated Benefits • improved diagnosis of disease • earlier detection of genetic predispositions to disease • rational drug design • gene therapy and control systems for drugs • personalized, custom drugs Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

  13. Anticipated Benefits Agriculture, Livestock Breeding, and Bioprocessing • disease-, insect-, and drought-resistant crops• healthier, more productive, disease-resistant farm animals• more nutritious produce• biopesticides• edible vaccines incorporated into food products• new environmental cleanup uses for plants like tobacco Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

  14. How does the cell work? The guiding principle is the so-called Central dogma of cell biology

  15. Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

  16. Gene to protein4 Nucleotides and 20 amino acids Protein is synthesized from amino acids byribosome

  17. Gene to Protein The mediator : mRNA Transcription Translation

  18. Transcription and translation

  19. PART II. Microarray Genome-wide expression profiling

  20. Exploring the Metabolic and Genetic Control ofGene Expression on a Genomic ScaleJoseph L. DeRisi, Vishwanath R. Iyer, Patrick O. Brown*

  21. Microarray

  22. MicroArray • Allows measuring the mRNA level of thousands of genes in one experiment -- system level response • The data generation can be fully automated by robots • Common experimental themes: • Time Course (when) • Tissue Type (where) • Response (under what conditions) • Perturbation: Mutation/Knockout, Knock-in • Over-expression

  23. Reverse-transcription Color : cy3, cy5 green, red

  24. Example 1 5 min Comparative expression Normal versus cancer cells ALL versus AML E.Lander’s group at MIT

  25. PART III. Statistics Low-level analysis Comparative expression Feature extraction Clustering/classification Pearson correlation Liquid association

  26. (not to be covered) Issues related to image qualities • Convert an image into a number representing the ratio of the levels of expression between red and green channels • Color bias • Spatial, tip, spot effects • Background noises • cDNA, oligonucleotide arrays,

  27. Genome-wide expression profileA basic structure Gene1Gene2 Genen x11 x12 …….. x1p x21 x22 …….. x2p … … ... … … ... xn1 xn2 …….. xnp cond1 cond2 …….. condp

  28. Cond1, cond2, …, condp denote various environmental conditions, time points, cell types, etc. under which mRNA samples are taken Note : numerous cells are involved Data quality issues : 1. chip (manufacturer) 2. mRNA sample (user)It is important to have a homogeneous sample so that cellular signals can be amplified Yeast Cell Cycle data : ideally all cells are engaged in the same activities- synchronization

  29. An application Two classes problem ALL (acute lymphoblastic leukemia) AML(acute myeloid leukemia)

  30. Which Genes to select? They have a method • For each gene (row) compute a score defined by sample mean of X - sample mean of Y divided by standard deviation of X + standard deviation of Y • X=ALL, Y=AML • Genes (rows) with highest scores are selected. That seems to work well. • 34 new leukemia samples • 29 are predicated with 100% accuracy; 5 weak predication cases Seems to work ! Improvement?

  31. Study of cell-cycle regulated genes • Rate of cell growth and division varies • Yeast(120 min), insect egg(15-30 min); nerve cell(no);fibroblast(healing wounds) • Regulation : irregular growth causes cancer • Goal : find what genes are expressed at each state of cell cycle • Yeast cells; Spellman et al (2000) • Fourier analysis: cyclic pattern

  32. Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al) Most visible event

  33. Example of the time curve: Histone Genes: (HTT2) ORF: YNL031C Time course: Histone

  34. EBP2: YKL172W TSM1: YCR042C YOR263C

  35. Why clustering make sense biologically? The rationale is Genes with high degree ofexpression similarityare likely to befunctionally relatedand may participate incommon pathways. They may be co-regulated bycommon upstreamregulatory factors. Rationale behind massive gene expression analysis: Simply put, Profile similarity implies functional association

  36. Protein rarely works as a single unit Some protein complexes

  37. Gene profiles and correlation • Pearson's correlation coefficient, a simple way of describing the strength of linear association between a pair of random variables, has become the most popular measure of gene expression similarity. • 1.Cluster analysis: average linkage, self-organizing map, K-mean, ... • 2.Classification: nearest neighbor,linear discriminant analysis, support vector machine,… • 3.Dimension reduction methods: PCA ( SVD)

  38. CC has been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to Galton(1822-1911) “Typical laws of heridity in man” Karl Pearson modifies and popularizes the use. A building block in multivariate analysis, of which clustering, classification, dim. reduct. are recurrent themes As a statistician, how can you ignore the time order ? (Isn’t it true that the use of sample correlation relies on the assumption that data are I.I.D. ???)

  39. Other methods forFinding Gene clusters • Bayesian clustering : normal mixture, (hidden) indicator • PCA plot, projection pursuit, grand tour • Multi-Dimension Scaling( bi-plot for categorical responses, showing both cases (genes) and variables(different clustering methods), displaying results from many different clustering procedures) • Generalized association plot (Chen 2001, Statistica Sinica) • PLAID model ( Statistica Sinica 2002, Lazzeroni, Owen)

  40. 1st PCA direction 2nd PCA direction 3rd PCA direction Eigenvalues

  41. Phase Assignment Smooth Non-smooth S S G1 G1 31 S/G2 S/G2 27 108 103 352 255 90 295 165 M/G1 239 90 G2/M M/G1 G2/M

  42. ARG1 Glutamate ARG2 Book a flight from LA to KEGG, JAPAN in less than 10 seconds

  43. ARG1 8th place negative Y Head X Compute LA(X,Y|Z) for all Z Backdoor Rank and find leading genes Adapted from KEGG

  44. Coverage of bioinformaticsby areas | topics Sequence analysis Linkage, pedigree Microarray DNA RNA Protein EST Drug Evolution Functional prediction SNP Alternative splicing System modeling Pathway discovery Promoter Motif Domain Drug -gene -protein Protein-protein 3-D structure Protein -gene TRANSFAC

  45. Coverage of Bioinformatics by expertise (hat, not person) Computer scientist Statistician/mathematician Biologist (raw data provider) (huge data volume) (Crude oil) Oil-refining (Noise, garbage, or ignorance?) Make researcher’s life easier (pipeline) Data cleaning Data mining Pattern searching /comparison (Bio-information distilling/ Bio-data refining) Web page browsing Literature searching Physical/Math/prob/stat models, computer optimization Data base/ visualization Generalization/inference Gene Ontology

  46. Math. Modeling : a nightmare Current Next mRNA F I T N E S S mRNA Observed mRNA protein kinase hidden ATP, GTP, cAMP, etc Cytoplasm Nucleus Mitochondria Vacuolar localization F U N C T I O N Statistical methods become useful DNA methylation, chromatin structure Nutrients- carbon, nitrogen sources Temperature Water

  47. Bioinformatics(knowledge integration center) • When • Where • Who • What • Why • Cell level • Organ level • Organism level • Species level • Ecology system level

More Related