660 likes | 862 Vues
DNA microarrays and functional genomics. Frédéric Devaux Laboratoire de génétique moléculaire Ecole Normale Supérieure. High-throughput analyses in molecular biology. Transcription. Translation. DNA. RNA. Proteins. Genome. Transcriptome. Proteome.
E N D
DNA microarrays and functional genomics Frédéric Devaux Laboratoire de génétique moléculaire Ecole Normale Supérieure
High-throughput analyses in molecular biology Transcription Translation DNA RNA Proteins Genome Transcriptome Proteome
Functional genomics: a chance to deal with life complexity! Humans cells: 25 000 genes 400 000 RNAs 1- 10 000 000 proteins 1 000 000 of small molecules (carbohydrates, ions, etc…) 1014 cells Infinity of possible interactions… Dynamic in time! High throughput, genome-wide, approaches are needed to describe and understand the functional integration in living organisms.
Microarray principle relies on the Watson-Crick complementary of DNA strands A DNA strand can be used as a probe to capture and detect its complementary strand.
Microarrays: general principle Sample= unknown nucleic acid mix (ex: total RNAs), labelled Scanner Glass slides with many (up to 1 000 000) specific probes hybridization Quantification of the signal present at each probe = Qualitative and semi-quantitative determination of the sample composition for millions of different RNA or DNA species
In situ synthesis of DNA chips (ex: Affymetrix) Lumière + A--- Lumière + G--- Masque céramique G G Groupement bloquant photolabile A A A A A A A A A A G G G G G Etc... Population n°1 hybridization Labeling Lecture au scanner à fluorescence Streptavidine-Phycoérythrine PM MM Identification et semi-quantification des acides nucléiques de la population n°1 Pour chaque séquence cible: 14 oligonucléotides PM: complémentarité parfaite MM: PM muté d ’une base=contrôle négatif
Population test Population reference Expression ratio between the two populations oligonucleotides Labeling Cy3 Cy5 Automatic spotting hybridization Scanner Spotted DNA microarrays Eisen et Brown, Methods in Enzymology 1999
The new generation: tiling arrays for a complete coverage of genomes D’après Johnson et al., Trends in genetics 2005
Application: toward the complete yeast transcriptome David et al., PNAS 2006
DNA microarrays: labeling samples Direct labeling Chemical labeling Random or specific primers Matrix Matrice Interaction chimique directe entre l’acide nucléique et une molécule porteuse du fluorochrome Reverse Transcription (RT) (ARN) Klenow (ADN) 5µg ARN total 5 µg total RNA
DNA microarrays: labeling samples Indirect labeling Random or specific primers Matrix Reverse Transcription (RT) (ARN) Klenow (ADN) Aminoallyl moiety Ester moiety Fluorochrome Nucleotide 1 µg total RNA
Les puces à ADN: Marquage des échantillons à étudier T7 promoter Matrix Reverse Transcription (RT) (ARN) Klenow (ADN) Transcription in vitro (T7 polymérase) 100 ng total RNA to single cells…
Experiments Interpretation Databases Microarray experiment Data representation and predictions Image analyses Differentially expressed genes Normalisation Data analyses Highthroughput = a lot of bioinformatic analyses Experimental design
What can I study with DNA chips? Gene expression Transcription RNA stability Protein/DNA and protein/RNA interactions Subcellular localisation of RNA or DNA Replicaiton dynamics Microorganism detection Comparative Genomic Hybridization Sequence variations (Single Nucleotide Polymorphism)
Concept 1: GENE CLUSTERS
Clustering of genes based on their expression profiles G1 Temps 800 genes clustered in 5 groups Spellman et Al., Molecular Biology of the Cell 1998
Some target genes of the PDR system are associated with mitochondrial gene clusters. In silico prediction YLR346c YGR035c YOR049c Mitochondrial defects activate the PDR system via PDR3. Only a subset of PDR3 target is activated. Some genes not identified as PDR3 target depend on PDR3 for their activation. ? PDR3 Lipid metabolism PDR16 Biological confirmation MFS permease Stress Oxydation HXT11 ABC transporter IPT1 HXT9 GRE2 FLR1 YAL061w PDR5 YOR1 PDR15 YKL071w ICT1 SNQ2 PDR10 Hallstrom et al., JBC 275 (2000); Devaux et al., FEBS letters 515 (2002) Extracting biological information from models. The example of PDR3 and mitochondria. Functional discovery via a compendium of expression profiles. Hughes TR et al. (Cell. 2000 Jul 7;102(1):109-26.)
Using gene clusters as markers for cancer typing 1- Clustering of 15 000 human genes based on their expression profiles in 55 breast tumors of the same « type ». 2- Identification of clusters discriminating different tumor groups. Cathy Nguyen, TAGC/CIML, Marseille
Using gene clusters as markers for cancer typing 3- Clustering of tumors based on their gene expression in one cluster 4-Correlation with survival= diagnostic tools Tumeurs I+ = 3 death among 27 Tumeurs I- = 14 death among 28 5- adapted treatments ? Cathy Nguyen, TAGC/CIML, Marseille
Concept 2: Gene and protein networks
Functional genomics: a chance to deal with life complexity! Yeast subnetwork for pseudohyphal growth Priz et al., Genome Res. 2004 14: 380-390.
HSF1 RPN4 PDR1 PDR3 YRR1 YAP1 AFT2 Analysing the functioning of gene networks Gene clusters Chemical stress (drugs) Transcriptome analyses Predictions confirmation by knock-out and ChIP experiments Gene ontology Promoter analyses Previous knowledge Prediction of the underlying regulatory network Dynamic model of drug response Comprehensive view of drug response networks
Selenite response in yeast Selenite 1mM vs 0 Temps (min) Temps (min) 2 5 10 20 40 60 80 2 5 10 20 40 60 80 735 genes Up-regulated 779 genes Down-regulated Selenium = essential oligo-element (metalloide), anti-cancer and anti-aging potential but toxic at high doses!
Integrating all these information in one picture Ex: yeast reponse to the metalloïde arsenate with cytoscape Haugen et al., Genome biology 2004, 5:R95
Transcriptional network controlling selenite response Salin, Fardeau et al., manuscrit en préparation Hahn et al., Mol. Microbiol. 2006 Haugen et al., Genome Biology 2004
Deciphering the whole eucaryotic transcription code: dream or reality?? Lee et al, Science 2002
Web: www.biologie.ens.fr/microarrays.html Hot Line: devaux@biologie.ens.fr Puces: http://transcriptome.ens.fr/sgdb/
Kinetic response of yeast to the antimitotic drug Benomyl Lelandais G., Lucau-Danila A. et al., MCB 2004
Usually the Log base 2 (Log2) is chosen because it fits with the observed range of biological effects observed (usually less than 100 fold repression or induction). Unchanged Log2 (ratio) = 0 Two fold induction Log2 (ratio) = 1 Two fold repression Log2 (ratio) = -1 Log transformation of the data Why? To put repressed and activated genes on the same range. Ratio: repressed within [0;1[ activated within ]1;+oo[ Log (Ratio): repressed within ]-oo; 0[ activated within ]0;+oo[
Log transformation of the data Why? To work with a normal, gaussian distribution: allow the use of parametric mathematical tests for data normalisation, search for differentially expressed genes, gene clustering. Number of genes Standard Deviation Log2(ratio) Median
Representation of microarray data: the MA plot M=log2R/G vs A=(log2R + log2G) / 2 What? M (log2(ratio)) A (log2(intensity)) Why? To take into account both the ratio of fluorescence and the intensity of the signal.
How? The principles of microarray normalisation: Possibility n°1: « The majority of the genes present on the array are unvariant » (median or LOWESS methods) Normalisation of the data Why? Because biases exist. General biases, similar for all genes, can be easily normalised: Cy3 is more powerful than Cy5 The scanner detection can be biased The exact amount of RNA, RT efficiency, etc… can vary from one condition to another Possibility n°2: « The majority of the genes present on the array are variant but there are as many up effect as down effects» (median method) Possibility n°3: « Majority of genes vary and highly biased distribution of up and down effects ». Methods: control genes, not supposed to vary (internal, ex:actine, GAPDH, genomic DNA, etc… or external (spike RNAs)). Problem: you need a lot of, actually unvariant, with different range of abundance,… Or use non parametric methods (ex: percentile ranks (Vivienne’s speech, Friday morning)).
Normalisation of the data: global median Before After Problem: The same for all range of intensities, but sometimes normalisation depends on the intensity!
Normalisation of the data: global Lowess The theory: M A Local linear regression Regression curve Essential parameters: window size and number of dots in each window
Normalisation of the data: global Lowess Application to microarray data: After Before Problem: need a significant number of spots in each « window » to be correct.
Normalisation of the data: specific biases They are biases which affects specifically some sequences: The main one is the Cy dye effects: some DNA or RNA sequences are more easily labelled with one fluorochrome. Solution: This bias can be averaged if, in replicate experiments, the dyes are switched (dye swap). SAMPLE 1 vs SAMPLE 2 Exp1: SAMPLE 2 vs SAMPLE 1 Exp2:
Looking for the significance of expression variations From which ratio is a difference significant?? Methods will depend on the number of replicates that you have
Looking for the significance of expression variations Answer 1: simple, statistical tests, for each ratio you obtain a p value that reflects the chance to be truly different from the « null » hypothesis. Ex: Gaussian test. Standard Deviation Median
Looking for the significance of expression variations Answer 2: Mock experiments to set up the significance of each ratio in a real experiment Real experiment (sample 1 vs sample 2) Mock experiment (sample 1 vs sample 1) Differentially expressed genes Experimental cut-off
MA plot log2(ICy5/ICy3) (log2(ICy5) + log2(ICy3))/2 N(m,s) VARAN makes statistical tests, analysing log2(ratio) distribution on a sliding window along the « intensity » axis (cf. LOESS). VRAN can use either a theoretical gaussian distribution or a mock experiment as a reference. Rinf0.99 Rsup0.99 m-2s m+2s log2 (ICy5/ICy3) Looking for the significance of expression variations Answer 1bis and 2bis: Significance depends on the intensity X The expression ratio of a gene does not have the same significance depending on the level of expression of this gene In every cases, if the number of replicates is less than 4, it is worth setting up filters for low intensity measures
Looking for the significance of expression variations Problems: Statistical tests based on the gaussian hypothesis do not always correspond to the biological truth (ex: in a gaussian tests, the cut off will depend on the number of varying genes) Mock experiments does not always correspond to the reality of your « real » experiments: it depends on how you are reproducible! Solution: Make more replicates (> 4) increase the power of your statistical tests or allow you to use more powerful methods (see next ex.) and get reality closer to theory.
Looking for the significance of expression variations Answer 3: With more than 4 replicates, you can use powerful statistical tests. Ex: Statistical Analysis of Microarrays (SAM) Replicate experiments p1 p2 Experimental p values p3 p4 Exprimental data Replicate experiments Median reference p values (p value obtained by chance using the real experimental variability) p’1 - - X 100 p’2 - p’3 - - p’4 Reference data generated by random permutation of experimental data
Looking for the significance of expression variations Answer 3: With more than 4 replicates, you can use powerful statistical tests. Ex: Statistical Analysis of Microarrays (SAM) SAM plot Observed p value Significant induction expected p value by chance Significant repression This method also gives a False Discovery Rate (FDR) which reflects the confidence of the data and helps to chose cut off
Ex. : Exp 2 Exp 1 Exp 2 Gene 1 Val(1,1) Val(1,2) Val(2,2) Gene 2 Val(2,1) Val(2,2) Val(1,2) Exp 1 Val(1,1) Val(2,1) Nuage de points How to cluster genes according to their expression profile? • Généralisation : n genes, 2 exp n genes, 3 exp n genes, N exp.
Gène Each gene is a dot in a n dimension space (n= number of experiments). How to deal with this and define gene groups? Visualise data in a limited space Split the genes into groups depending on their behavior How to cluster genes according to their expression profile? • Dimension reduction • Clustering
N-dimension Space 2-dimension Space PCA Exp 2 Projection Axis 1 Axis 2 Exp 1 More information is kept on axis 2 How to cluster genes according to their expression profile? Dimension reduction: the theory Gene 1 Gene 2 Dimension reduction: the Principal Component Analysis