1 / 39

Concepts and methods in sequencing and genome assembly

BCM-2002. Concepts and methods in sequencing and genome assembly. B. Franz LANG, Département de Biochimie Bureau: H307-15 Courrier électronique : Franz.Lang@Umontreal.ca. Outline Concepts in DNA and RNA sequencing Sequencing technologies Random genome sequencing, with/without cloning

Télécharger la présentation

Concepts and methods in sequencing and genome assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCM-2002 Concepts and methods in sequencing and genome assembly B. Franz LANG, Département de Biochimie Bureau: H307-15 Courrier électronique: Franz.Lang@Umontreal.ca

  2. Outline Concepts in DNA and RNA sequencing Sequencing technologies Random genome sequencing, with/without cloning Data formats of results – autoradiograms, traces, fastq and base call qualities Sequencing and assembly artifacts

  3. Concepts in DNA and RNA sequencing Reminder • DNA and RNA are polar (5’P; 3’OH), charged biopolymers, made up of nucleotides. • By convention, sequences are always written from 5’ (left) to 3’ (right); otherwise, the polarity has to be indicated. • DNA usually occurs in double-stranded, antiparallel perfectly base-paired form: • 5’ AGCTATTGATTTCCTTGG 3’ • 3’ TCGATAACTAAAGGAACC 5’ • RNAs are most often single-stranded and may form • secondary and tertiary base-pairs (intra-molecular, • or with other molecules). Single-stranded DNA does • the same. For sequencing, DNAs and RNAs have to • be denatured and single-stranded, without structure.

  4. Concepts in DNA and RNA sequencing Principles; see also Maniatis (a popular biochemistry cook-book): The initial two sequencing techniques are the enzymatic synthesis method of Sanger et al. (1977) and the chemical degradation method of Maxam and Gilbert (1977). Note that Maxam and Gilbert is slow, using highly toxic/cancerogenic substances, and no longer used - except for special applications such as mapping of protein binding to DNA. New Generation Sequencing (NGS) techniques have taken over for genome projects – see below. They do not require electrophoretic techniques but use instead various nano-technological approaches. The currently by far most popular technology is Illumina.

  5. Concepts in DNA and RNA sequencing Principle: Although very different in principle, both Maxam/Gilbert and Sanger produce populations of (radio- or fluorochrome-) labeled oligonucleotides that all start at the same site of a given DNA/RNA, and that end in a given nucleotide (G,A,T/U,C) that is generated with a given sequencing biochemistry (nucleotide-specific termination of DNA synthesis, or nucleotide-specific cleavage; etc.). Cleavage at random meGsite ========== > ‘Visible’ radioactive fragments Note that in any sequencing technology, only the labeled single-stranded DNAs or RNAs are sequenced; unlabeled material does not matter. When more molecules carry the same label, these need to be first separated (e.g., by electrophoresis).

  6. Concepts in DNA and RNA sequencing Electrophoretic separation, and detection principles: These populations of oligonucleotides are then resolved by electrophoresis under conditions that discriminate size differences at the single nucleotide level (PAGE). When loaded into four adjacent lanes of a sequencing gel, the order of nucleotides can be read directly from an image after visualizing the radioactive or any other label (see below). When sequence reactions are marked with four different fluorescent dyes (the current standard of Sanger sequencing), these can be loaded on a single lane (or capillary), and read automatically and continuously as different-wavelength light emissions, generated by laser excitation.

  7. Concepts in DNA and RNA sequencing Principles of RNA sequencing: RNA is sequenced similar to DNA, either directly by chemical methods (Maxam-Gilbert-like, yet inefficient, slow), by a Sanger-like synthesis protocol with reverse transcriptase (to produce cDNA sequence ladders), or after transformation to cDNA by regular DNA sequencing procedures (Sanger or NGS technologies). RNA classes may be separated by size (micro RNAs, tRNAs rRNAs …) or by enrichment of eukaryotic mRNAs carrying a 3’ poly-A, by purification with an oligo-dT column. That is, RNA sequencing may provide more information than just the primary sequence. Most RNAs have distinct start and processing sites. High volume RNA sequencing (NGS, called RNA-seq) allows precise identification of starts and stops, and measurement of relative quantities (i.e., quantitative mapping of RNA 5’ and 3’ ends).

  8. 2. Sequencing technologies • 2.1. Maxam and Gilbert (chemical) • Requires high amount of highly purified DNA fragments (e.g., restriction fragments). • Single radioactive label, can be on double- or single-stranded DNA. • Nucleotide-specific, partial chemical modification (random along DNA). • Chemical cleavage at modified nucleotides. • Denaturation (heat, formamide), to allow uniform electrophoresis of single-stranded DNA molecules that are perfectly linear and without secondary structure (if not – sequencing artifacts). • High-resolution slab gel PAGE, followed by autoradiography. • Reading (up to a few hundred nt/reaction) usually by a human expert. • Several days labor with a few gel runs provides at best 10 kbp sequence

  9. 2.1. Maxam-Gilbert sequencing– summarySlow, many DNA purification steps, requires lots of DNA, toxic reagents, no automation available, relatively short reads up to a few hundred.

  10. 2. Sequencing technologies • 2.2. Sanger (enzymatic synthesis) • Unique start of sequencing ladder is determined by a sequencing primer, hybridized to DNA or RNA. Purity of template is not an issue (!), a huge advantage. • DNA polymerase (reverse transcriptase for RNA) for primer elongation. • Nucleotide-specific termination (random) with one of four dideoxy-nucleotides that are mixed with the four regular nucleotides.

  11. 2. Sequencing technologies • 2.2. Sanger (enzymatic synthesis) • Label may be radioactive or a fluorescent dye on • Primer itself (e.g., 5’ P32; dye label added during primer synthesis). • Nucleotides incorporated during synthesis (e.g., P32, S35). • Dideoxy-nucleotides (different dyes emitting different colors – single lane or capillary sequencing is possible : current standard). • High-resolution slab gel or capillary electrophoresis • Autoradiography or automated reading of migrating fragments (laser, with camera or diodes). • Several days labor may produce ~100 kbp sequence. Robotic procedures for template purification and sequence reactions allows scale-up.

  12. 2.2. Sanger (enzymatic synthesis), summary In this example, the primer is labeled, therefore requiring gel separation in four lanes

  13. 2. Sequencing technologies 2.3. 454 Technology – Roche GS FLX pyrosequencing (several hundred MB per run; advantage: reads up to 1,000 nt) Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the "sequencing by synthesis" principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides. Originally the leader in NGS, this technology is less effective and more error-prone than Illumina, and will therefore be abandoned by the company in a few years !

  14. 2. Sequencing technologies 2.3. 454 Technology – Roche GS FLX

  15. 2. Sequencing technologies 2.3. 454 Technology – Roche GS FLX (multiplex reaction in oil emulsion droplets)

  16. 2. Sequencing technologies 2.3. 454 Technology – Roche GS FLX

  17. 2. Sequencing technologies 2.3. 454 Technology – Roche GS FLX • DNA polymerase incorporates the correct, complementary dNTPs onto the template. This incorporation releases pyrophosphate (PPi)stoichiometrically. • ATP sulfurylase quantitatively converts PPi to ATP in the presence of adenosine 5´ phosphosulfate. This ATP acts as fuel to the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light • Unincorporated nucleotides and ATP are degraded by apyrase, and the reaction can restart.

  18. 2. Sequencing technologies 2.4. Illumina (several GB per run; reads up to 300 nt)

  19. 2. Sequencing technologies 2.4. Illumina

  20. 2. Sequencing technologies 2.4. Illumina

  21. 2. Sequencing technologies 2.4. Illumina Base calling example for two clusters

  22. 2. Sequencing technologies 2.5. ABI SOLiD – sequencing by ligation (2,000 MB per run; but only 35 nt/read) A library of DNA fragments, ligated with universal sequence adaptors, is attached to the surface of magnetic beads (one fragment per bead). Emulsion PCR taking place in micro-reactors amplifies the fragments that are then covalently bound to a glass slide. SOLiD technology applies a rather complicated ligation/cleavage procedure. Partially degenerate, fluorescently labeled DNA octamers with dinucleotide sequence recognition cores are hybridized to the template, and perfectly annealing sequences are ligated to the primer. After imaging, unextended strands are capped and fluorophoresare cleaved. Repetitions of new priming, primer removal, and ligation cycles will in the end cover a stretch of 35 nt twice (redundantly), which improves the accuracy of base calling. Yet the value of a 35 nt reading starts dwindling ,in face of other NGS technologies producing longer reads almost every year (e.g., Illumina promising 300 nt for 2014). First cycle cleavage

  23. 2. Sequencing technologies …. and so on ….

  24. 2. Sequencing technologies 2.6. Ion Torrent (100 MB + per run; up to 200 nt/read) Incorporation of a deoxyribonucleotide triphosphate (dNTP) into a primed, growing DNA strand involves the release of pyrophosphate, and a hydrogen ion that s measured on a semiconductor chip. Microwells each containing one single-stranded template DNA molecule plus a DNA polymerase are sequentially flooded with A, C, G or T. Only if an introduced dNTP is complementary to the next unpaired nucleotide on the template strand it is incorporated into the growing complementary strand. If more than one nucleotides follow each other, the signal strength correlates with the number of identical incorporated nucleotides. The series of electrical pulses is translated into a DNA sequence, without intermediate signal conversion, the use of labeled nucleotides, or error-prone intermediate amplification steps. However, the signal precision is lower than with 454, Illumina, and Solid technologies.

  25. 2. Sequencing technologies 2.7. Pacific Biosciences (+/- 3,000 nt/read, up to 15,000) The PacBio RS II is a single molecule, real-time DNA sequencing system that provides the longest read lengths of any available sequencing technology, however in comparison to all other NGS technologies it has the lowest precision. Sequencing occurs on SMRT Cells, each containing thousands of Zero-Mode Waveguides (ZMWs) in which polymerases are immobilized. The ZMWs provide a way for directly watching DNA polymerase with a high-resolution camera, as it performs sequencing by synthesis (fluorescence measurement; four different flurochrome-labeled nucleotides). The long read length is precious for the assembly of genomes, in particular in regions containing long sequence repeats that cause otherwise problems in genome assembly. In addition, it detects DNA base modifications using the kinetics of the polymerization reaction during sequencing.

  26. 2. Sequencing technologies 2.7. Pacific Biosciences (+/- 3,000 nt/read, up to 15,000)

  27. Sequencing technologies – comparison from 2012 Quail et al. BMC Genomics 2012, 13:341

  28. 3. Random genome sequencing comparison 3.1. Sanger, Maxam Gilbert - withcloning – DNA is not amplified in vitro, therefore has no DNA amplification artifacts. Each clone receives original piece of DNA in a plasmid that is multiplied by E. coli.

  29. 3. Random genome sequencing comparison 3.2. NGS procedures without cloning , using either DNAs attached to nano-chips (micro wells) or in oil drop emulsion. 454, Illumina, Solid – DNA is highly PCR-amplified. Errors may therefore come from PCR amplification artifacts. Pacific Biosciences and Ion Torrent technologies both readsingle molecules directly without prior PCR amplification. Yet in contrast. their relatively high error rate is due to the signal imprecision itself.

  30. 4. Data formats of results – autoradiograms, traces, fastq and base call qualities Trace file typical for Sanger sequencing with base call qualities indicated by the height of blue bars and Q numbers. The advantage of this format is easy spotting of artifacts by a human expert. The typical NGS format (FastQ) only reports the sequence plus the quality encoded in machine readable format.

  31. 4. Data formats of results – quality scores in fastq format Typical NGS format (FastQ) only reports the sequence plus the quality encoded in machine readable format.

  32. 4. Data formats of results – quality scores

  33. 4. Data formats of results – quality scores

  34. 4. Data formats of results – quality scores (Illumina example)

  35. 5. Artifacts in sequencing and sequence assembly The denatured DNA is not linear as it folds back on itself and then migrates differently on the sequencing gel(Sanger) • reason: secondary structures, mainly in G+C –rich regions • effect: ‘compression’ zones in the sequencing ladder • solutions (i) sequence DNA in the two directions of complementary strands; sequencing artifacts due to folding are not symmetric; (ii) for Sanger sequencing, use nucleotide analogs that minimize secondary structure folding, like deaza-NTP, deaza-dITP, or ITP ( instead of NTPs or dGTP, respectively)

  36. 5. Artifacts in sequencing and assembly Sequencing ladders terminate prematurely or contain ‘holes’ Reasons: • sequencing reactions over-modified (M&G), or too elevated terminator concentrations (Sanger); • (ii) strong nucleotide bias, like long runs of A or T that cause many polymerases to fall of the template (Sanger)

  37. 5. Artifacts in sequencing and assembly Uncertain number of identical nucleotides in a row (homopolymers; > 6) Reasons: • Amplification errors by DNA polymerase (Illumina, 454) • Signal ambiguity when estimating the number of identical nucleotides from the height of a single signal (Illumina, very high error with 454)

  38. 5. Artifacts in sequencing and assembly Readings that only partially fit a genome sequence (one of the worst artifacts) Reasons: • Ligation of separate pieces into one fragment, during primer ligation (applies to all technologies using primer ligation) • Partial deletion of sequence during PCR at repeat sequence and folded structures (all technologies using PCR amplification)

  39. This is it, folks!

More Related