1 / 60

Expression Profiling

Expression Profiling. Microarrays vs. RNA.seq. Question: What’s a microarray?. Answer: A microarray is a high density array of “molecules” attached to a solid support. 1) What does high-density mean? -millimeters, microns, sub-micron? -Affymetrix Patent, 1000 probes/cm 2

caelan
Télécharger la présentation

Expression Profiling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Expression Profiling

  2. Microarrays vs. RNA.seq

  3. Question: What’s a microarray? Answer: A microarray is a high density array of “molecules” attached to a solid support. 1) What does high-density mean? -millimeters, microns, sub-micron? -Affymetrix Patent, 1000 probes/cm2 2) What kind of molecules? -nucleic acids, proteins, organics, cells, tissues 3) How are they attached? -specifically, non-specifically, covalently, non-covalently 4) What kind of support? -glass, nylon, other polymers

  4. Array Options • 1 - 70mers Up to 390,000 features per array

  5. Experimental Design 1 • Key Point: We know beforehand what sequence is in each position on the array • Synergy with genome sequencing projects • What kind of experiments can I do on a DNA chip? • Any assay whose readout is the enrichment of nucleic acid sequences • What is measured? • The fluorescent intensity, or ratio of intensities, at a particular location on the array

  6. Experimental Design 2 mRNA1 mRNA1 mRNA2 cDNA1 cDNA2 cDNA1

  7. Scaling Signal Intensities -total fluorescence is scaled to be equal in both experiments -every spot on the array is multiplied by the same scaling factor Where N=number of genes on array I=intensity on the array T=some set arbitrary number Assuming total RNA is constant between experiments

  8. Fold-Changes Why use the log? Answer: provides symmetry around zero 10 copies mRNA/20 copies mRNA = 0.5 20 copies mRNA/10 copies mRNA = 2.0 But, log2(10 copies mRNA/20 copies mRNA ) = -1 log2(20 copies mRNA/10 copies mRNA ) = 1

  9. RNA.seq • Rapidly replacing array based methods mRNA cDNA sequence count

  10. Counting can be non-trivial • 3’ poly T priming bias • 5’ and 3’ UTR boundaries • Alternate splicing • Cryptic exons • Cryptic start sites • Paired-end reads can help

  11. RPKM • Reads Per Kilobase per Million reads Assuming total RNA is constant between experiments

  12. RNA.seq continued • No genome sequence necessary (de novo transcriptome assembly) • Dynamic range and sensitivity limited only by sequencing capacity • Specificity an issue with short reads • Splice sites • 5’ and 3’ UTR mapping • Multiplexing!

  13. Multiplexing with sample barcodes Sample 2 Sample 1 mRNA cDNA ligatebarcoded sequencing primers pool samples and sequence

  14. Finding significant fold changes Sample 2 Sample 1 C1 C2 X1 X2 Is X1/C1 less than X2/C2? C1= total number of mapped reads in sample 1 X1 = number of reads in sample 1 that map to gene X C2 = total number of mapped reads in sample 2 X2 = number of reads in sample 2 that map to gene X

  15. Significance testing with Hypergeometric (Fishers Exact Test) Pooled Sample Sample 1 C1+C2 C1 X1 X1+X2 Ho: X1/C1 = X1+X2/C1+C2 Ha: X1/C1 < X1+X2/C1+C2

  16. Significance testing with Hypergeometric (Fishers Exact Test) Remember that:

  17. …but with replication we just revert to t-tests (more or less) Sample 1 Sample 2 C1A C2A Replicate A X1A X2A C1B C2B X1B Replicate B X2B C1C C2C X1C Replicate C X2C

  18. …but with replication we just revert to t-tests (more or less) Is (X1A/C1A, X1B/C1B, X1C/C1C) different from (X2A/C2A, X2B/C2B, X2C/C2C) ?

  19. Gaussian (Normal) Distributions I • Mean (x) = • Standard Deviation (s) = # of people height

  20. What is a significant change? • Arbitrary • 2-fold • Top 20 • Two sample t-test • P-values, multiple hypotheses, and the Bonferroni correction • SAM, Tusher et al. (2001) PNAS 98,5116-5121 • ANOVA

  21. Microarray vs. RNA.seq • Cost, Time, Throughput • Serial vs. Parallel • Sensitivity • Specificity • Signal to Noise • Dynamic Range

  22. Gene Clustering • Metrics for determining coexpression • Unsupervised Clustering • Supervised Clustering

  23. Gene 1 Gene Clustering

  24. v1 v2 Condition 2 v5 v4 v3 Condition 1 Clustering Gene Expression Data • Choose a distance metric • Pearson Correlation • Spearman Correlation • Euclidean Distance • Mutual Information • Choose clustering algorithm • Hierarchical • Agglomerative • Principle Component Analysis • Super-paramagnetic and others

  25. Pearson Correlation Coefficient • Compares scaled profiles! • Can detect inverse relationships • Most commonly used • Spearman rank correlation technically more correct n=number of conditions x=average expression of gene x in all n conditions y=average expression of gene y in all n conditions sx=standard deviation of x Sy=standard deviation of y

  26. Correlation Examples Raw Data Normalized Correlation = 0.94 Correlation = -0.087

  27. Correlation Pitfalls 1 Correlation=0.97

  28. Correlation Pitfalls 2 Correlation=-0.02

  29. Avoid Pitfalls By Filtering The Data • Remove Genes that do not reach some threshold level in at least one (or more) conditions • Remove genes whose stdev/mean ratio does not reach some threshold • For spotted arrays, remove genes whose stdev does not reach some threshold

  30. c b a a2 + b2 = c2 Euclidean Distance • Based on Pythagoras • Scaled versus unscaled • Cannot detect inverse relation ships For Gene X=(x1, x2,…xn) and Gene Y=(y1, y2,…yn)

  31. A D Clustering: Example 1, Step 1 Algorithm: Hierarchical, Distance Metric: Correlation

  32. A B D C Clustering: Example 1, Step 2 Algorithm: Hierarchical, Distance Metric: Correlation

  33. A B D C E Clustering: Example 1, Step 3 Algorithm: Hierarchical, Distance Metric: Correlation

  34. Tree ViewEisen et al. (1998) PNAS 95: 14863-14868 conditions genes

  35. Advantages Easy Very Visual Flexible (mean, median, etc.) Disadvantages Unrelated Genes Are Eventually Joined Hard To Define Clusters Manual Interpretation Often Required A B D C E Hierarchical Clustering Summary

  36. k1 k2 k3 Clustering: Example 2, Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance

  37. k1 k2 k3 Clustering: Example 2, Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance

  38. k1 k2 k3 Clustering: Example 2, Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance

  39. k1 k2 k3 Clustering: Example 2, Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance

  40. k1 k2 k3 Clustering: Example 2, Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance

  41. K-means algorithm • Pick a number (k) of cluster centers • Assign every gene to its nearest cluster center • Move each cluster center to the mean of its assigned genes • Repeat 2-3 until convergence

  42. Advantages Genes automatically assigned to clusters Can vary starting locations of cluster centers to determine initial condition dependence Disadvantages Must pick number of clusters before hand All genes forced into a cluster K-means clustering summary

  43. Keep in Mind. • Clustering is NOT an analysis in itself. • Clustering cannot NOT work.

  44. Evaluating/Analyzing Clusters 1 • Measure spread within and between clusters

  45. Evaluating/Analyzing Clusters 2Enrichment of genes with similar functions • MIPS (Munich Information Center For Protein Sequences) http://mips.gsf.de/ • GO (Gene Ontology) Annotations http://www.geneontology.org/ • KEGGS (Kyoto Encyclopedia of Genes and Genomes) http://www.genome.ad.jp/kegg/kegg2.html

  46. Example A particular cluster has 25 coexpressed genes in it. 15 of these genes are annotated as being involved in rRNA transcription. Is 15/25 significant?

  47. Hypergeometric Probability Distributionthe “overlap problem” or sampling without replacement N • N = number of genes in the genome (6000 for yeast) • n = number of genes in the cluster (25) • m = number of rRNA transcription genes (109 from MIPS) • s = number of rRNA transcription genes in the cluster (15) m n s

  48. Hypergeometric Probability Distribution N m n s

  49. Hypergeometric Probability Distribution 6000 109 25 15

  50. Therefore in our example… 6000 109 25 15 • 15/25 rRNA transcription genes in the cluster is significant. • BUT… • 10 out of 25 genes in the cluster are not rRNA transcription genes. • 94 rRNA transcription genes are not in the cluster. • What about the other genes in the cluster?

More Related