1 / 43

Assembly Group Presentation II

Assembly Group Presentation II. Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye. Presentation Overview. Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise. Sequencing Methods.

Télécharger la présentation

Assembly Group Presentation II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assembly Group Presentation II Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

  2. Presentation Overview • Sequencing Methods • Experimental comparison of De Bruijn graph and Overlay graph assemblers • Preliminary Results • Lab Exercise

  3. Sequencing Methods • Sanger Sequencing • Cycle sequencing rxn • ddNTP-terminated dye-labeled products • High-resolution electrophoretic separation • Parallelized in 96 or 384 capillaries • Read lengths up to 1kBp • Raw accuracy up to 99.999% • Costs 50 ¢ per kB

  4. Sequencing Methods • Second Gen. Sequencing • Cyclical array methods • 454 • Illumina • AB SOLiD • Polonator • HeliScope • Platforms vary in biochemistry and array generation yet conceptually similar in workflow

  5. Illumina

  6. Illumina continued

  7. AB SOLiD

  8. 454 Pyrosequencing • Create a DNA library • Ligate adaptors to fragments • Emulsion PCR • Agarose beads • Oil, water, PCR reagents • Results in 1 mill copies / fragment for each bead

  9. More 454 • Beads arrayed into picotiter plate • Immobilized via addition of enzyme containing beads • Each cell contains exactly 1 bead • Bst polymerase, luciferase, apyrase, ATP sulferylase used

  10. 4-mer 3-mer Measures the presence or absence of each nucleotide at any given position TACG Flow Order 2-mer KEY (TCAG) 1-mer Even more 454Example of Output

  11. Videos (454 Workflow)

  12. Videos (Pyrosequencing)note: we did not choose the music

  13. Comparison of 2nd Gen Platforms

  14. Presentation overview • Sequencing Methods • Experimental comparison of De Bruijn graph and Overlay graph assemblers • Preliminary Results • Lab Exercise

  15. De Bruijn Graph assemblers and Overlay Graph assemblers • De Bruijn Graph assemblers • Velvet, Abyss, Euler • Overlay Graph assemblers • Newbler, Edena, SSAKE, VCAKE

  16. Synthetic Data used for Experiments • Write a C program to simulate reads from reference genome with specific read length, coverage and base error rate • Human chr 22, ~33.5M bases • Streptococcus Suis, NC_012925.1, ~2M bases • Helicobacter acinonychis Sheeba, ~ 1.5M bases • Write anther C program to measure the quality of assemblers • N50 length • No. of contigs • Max contig length • No. of mis-assembled contigs

  17. Read Length • De Bruijn graph assemblers are only suitable for short reads data • K limitation • Use Hash table or Sorting to index K-mers • Need use a unique key(value) to represent each K-mer • K=16 416=232 <-> 32-bit integer (unsigned int) • K=32 432=264 <-> 64-bit integer (unsigned long long) • K>32? <-> multiple integer to represent the hash table key

  18. Simulate reads from Streptococcus Suis • 300 read length, 50X coverage, error rate 0.1% • Velvet default: K <= 31, so we use 31 • Recompile velvet, K = 99

  19. Quality and Accuracy • It is stated in some literatures that “De Bruijn based approach prone to false positives”, “Overlap graph has better quality”

  20. Simulate reads from Helicobacter acinonychisSheeba • 35 read length, 50X coverage, error rate 0.1%

  21. Simulate reads from Streptococcus Suis • 35 read length, 50X coverage, error rate 0.1%

  22. Runtime and Memory Usage • Overlap graph based assemblers are computing-expensive and use more memory • All-to-all alignment of reads, O(n2) • Use more memory to store overlap graph • Typically, number of reads is weigh larger than the number of K-mers • Especially for short reads data • With the same coverage and genome length, shorter reads means more reads • It is stated that De Bruijn graph are more suitable for NGS data • Shorter reads, and high throughput

  23. Simulate reads from Streptococcus Suis • 802995 reads • 50 read length, 20X coverage, error rate 0.1% • Xeon E5530 2.4 GHz

  24. However! • Recent advance of pattern matching algorithms and technical enable the use of overlap graph • Suffix tree, Suffix array, Prefix array, compressed suffix array • Suffix array • Be able to find overlap between reads in linear time • Usage of compressed suffix array can significantly reduce the memory requirements of overlap graph assemblers • Examples • D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel , De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 18:802-809, 2008. • Jared T. Simpson and  Richard Durbin Efficient construction of an assembly string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373. • Pasqual • Pushkar and I have developed a parallel sequence assembler based on overlap graph in our research project

  25. Simulate reads from Human chr22 • 6978908 reads • 50 read length, 20X coverage, error rate 0.1% • Xeon E5530 2.4 GHz with 4 cores/8 threads

  26. Mixed Length Reads • H. influenzae • 30 ~ 300 length • Velvet does not work • K is fixed • If we use big K, the reads shorter than K can not be assembled. • If we use small K, it is difficult to assemble the long reads • Overlap graph assemblers do not have this issue • Newbler

  27. Conclusion • Controversial • It is still unclear about the relation between De Bruijn graph and Overlap graph • We can still conclude from the experiments • Regarding quality and accuracy, Overlap graph assemblers are thought to be better than De Bruijn graph assembler • De Bruijn graph assemblers does not work for long reads • De Bruijn graph assemblers does not work for mixed length reads (K is fixed) • Traditional overlap graph assemblers are slower and use more memory, but latest assemblers are better than De Bruijn graph assemblers

  28. Presentation overview • Sequencing Methods • Experimental comparison of De Bruijn graph and Overlay graph assemblers • Preliminary Results • Lab Exercise

  29. Quality score and length distribution

  30. Quality score and length distribution

  31. Quality score and length distribution

  32. Quality score and length distribution

  33. Quality score and length distribution

  34. Quality score and length distribution

  35. Velvet Input: Fasta/Fastq Output: Fasta $> velveth <output_dir> <k-mer length> -fasta -long <reads.fasta> $> velvetg <output_dir>

  36. WGS assembler (Celera) • >50 separate programs make up the Celera Assembler pipeline • runCA script helps manage them all Input: frg format Output: Fasta $> sffToCA –trim soft –libraryname${Id}-trimsoft –output ${Id}-trimsoft${Id}.sff $> runCA –p ${Id} –d ${Id}ovlConcurrency=4 ${id}-trimsoft.frg

  37. Newbler Input: .sff Output: Fasta $> runAssembly <reads.sff> // de novo assembly

  38. MIRA MIRA stands for Mimicking Intelligent Read Assembly Input: Fasta + qual + trace info Output: Fasta, Ace $> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff $> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log

  39. Eagle view - M19107.ace

  40. Eagle view - M19501.ace

  41. Works Cited • “Next-generation DNA sequencing” Shendure et. al, http://compgenomics2011.biology.gatech.edu/images/f/f9/Shendure-NatureBiotechnology-2008.pdf • “Next-generation DNA sequencing methods” Mardis et. al, http://compgenomics2011.biology.gatech.edu/images/5/59/Mardis-AnnuRevGenet-2008.pdf

  42. Presentation overview • Sequencing Methods • Experimental comparison of De Bruijn graph and Overlay graph assemblers • Preliminary Results • Lab Exercise

  43. Lab Exercise • Download the Lab Exercise file from the Genome Assembly wiki page

More Related