Download
high throughput sequencing n.
Skip this Video
Loading SlideShow in 5 Seconds..
High-Throughput Sequencing PowerPoint Presentation
Download Presentation
High-Throughput Sequencing

High-Throughput Sequencing

156 Views Download Presentation
Download Presentation

High-Throughput Sequencing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. High-Throughput Sequencing Advanced Microarray Analysis BIOS 691-803, 2008 Dr. Mark Reimers, VCU

  2. Quantitative HTS - Outline • Technology • Preprocessing • Quantitative analysis • Applications • ChIP-Seq • RNA-Seq • Methyl-Seq

  3. The Technology • Most sequencing proceeds by addition of fluor-labeled bases • Do this in parallel on a flat surface • Capture each stage with good camera • Align images

  4. Roche - 454 • Parallel Pyrosequencing on beads

  5. Mardis, Trends in Genetics

  6. 454 Sequencing Operation

  7. Illumina - Solexa

  8. Resquencing each fragment with different primers Reconstruct each fragment separately ABI SOLiD

  9. Paired-End Reads

  10. Issues • Pre-processing • Base calling • Mapping reads • QA • Quantitative analysis • Variation and noise • Biases • Models • Accuracy and validation

  11. Pre-processing – Base Calling • Not all steps completed properly • Sequence can lag behind or skip ahead • Hence most light spots a mixture of different colors • Simple rule: use brightest signal

  12. Types of mismatches in uniquely mapped tags with a single mismatch are profoundly asymmetric and biased Courtesy Thierry-Mieg

  13. Typical Errors in Base-Calling

  14. Position of single mismatch in uniquely mapped tags Courtesy Thierry-Mieg

  15. Improving Base-Calling with SVM

  16. Pre-processing – Mapping Reads • Huge numbers (10M – 70M) • BLAT (2002 high-speed method) • Eland (proprietary Illumina) • Other new methods: MAQ, SOAP

  17. Fraction of reads mapping to targets Typically 5-10M reads per lane and 60-80% map to targets Some repetitive sequence Quality Assessment

  18. Comparing Samples - A Simple Normalization • Different numbers of counts per lane • Divide counts in a region of interest (a genomic region or a gene or an exon) by all counts (total per million reads -TPM) • For comparing genomic regions of different lengths divide also by length of region TPKM (total per kilobase per million)

  19. Quant. Analysis - Variation • Poisson model often used for random variation • Most HTS data ‘over-dispersed’ relative to Poisson • Negative Binomial often used • Parameter fitted

  20. Quantitative Analysis - Biases • Not all regions represented equally • GC rich regions represented more • Independent of GC some chromosome regions represented more • Euchromatin bias • Sequence initiation site biases • ‘Mapability’ biases – some regions won’t have any uniquely mapped tags

  21. Density of reads depends strongly on GC content of regions GC Bias GC content (%)

  22. Genomic Position Biases • Count tags from randomly sheared DNA in red with GC content in blue

  23. Start Position Bias

  24. Consistent Start Position Bias Counts per start site in lane 1 vs lane 2

  25. RNA-Seq

  26. RNA-Seq Data Gene Model Kidney Reads Liver Reads From Marioni et al 2008

  27. Accuracy of Illumina RNA-Seq

  28. Issues How replicable is RNA-Seq? How consistent are the two technologies? Which is better? Marioni et al, Genome Research, 2008 Comparing RNA-Seq & Affy

  29. Comparing Fold-Changes • D.E. by ILM • Red >250 • Green <250 • Black Not DE by ILM

  30. Model for Variation • Poisson counts hypergeometric comparison • Make uniform p-values by adding random term • Use lower tails only

  31. False Positive Rates • QQ-plots of p-values between tech. reps

  32. Different Concentrations are NOT Comparable! • QQ-plots of p-values between 3pM and 1.5 pM

  33. Normalization of RNA-Seq • Robinson et al noticed that most genes appeared less expressed in liver Fig 1 from Robinson & Oshlak, Genome Biology 2010

  34. A Better Normalization for RNA-Seq - TMM • Drop extremes of ratios • Drop very high count genes • Compute trimmed means of samples • Center log-ratios between samples

  35. New Things to do with RNA-Seq • Allele-specific expression • Splice variation • Between tissues • In disease • Alternate initiation sites • Select 5’ capped RNA fragments • Alternate termination

  36. It is possible to compare allele-specific expression counts Sample from VCU Replicate samples P-values for binomial tests of equality About half show differential expression! Allelic Comparison

  37. Detecting Splice Variation • Deep sequencing shows up clear variation in exon usage • Wang et al Nature 2008

  38. Tissue Map of Splice Variation From Wang et al • Brain is most distinctive • Individuals seem to differ • Cell lines seem to have distinct splice patterns

  39. Splicing is Complex • Many different splice operations exist • Only some of these characterized by counting exon reads

  40. Issues in Detecting Splice Variants • Counts in exons reflect biases (as yet uncharacterized) as well as actual abundance • Reads that bridge splice junctions would be definitive but mapping is very dubious with short (<40 base) reads • All possible splice junctions are not known • Hard to even search through the known ones

  41. Methodology for Splice Variants • Count reads mapped to exons and and compare ratios across samples • Wang et al, and most others • Count reads that cross splice junctions

  42. Methodology for Finding Junctions

  43. ChIP-Seq

  44. Chromatin Immuno-precipitation

  45. ChIP-Seq Workflow • Cross-link proteins to DNA • Fragment DNA • Extract with antibody • Reverse cross links • Sequence fragments • DO CONTROLS!

  46. ChIP-Seq Data • From Rozowsky et al, Nature Biotech 2009

  47. ChIP-Seq vs ChIP-chip

  48. Peak-Finding - Simple • Extend tags and count overlap • How much to extend?

  49. Peak Finding – Better • Tags starting on opposite strands are likely to start at opposite ends • Identifying the cross-over point leads to improved accuracy