1 / 20

Biological Motivation for Fragment Assembly

Biological Motivation for Fragment Assembly. Rhys Price Jones Anne R. Haake. What is fragment assembly?. The reconstruction of the contiguous chromosomal DNA sequence from short, experimentally-generated fragments.

anoush
Télécharger la présentation

Biological Motivation for Fragment Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake

  2. What is fragment assembly? • The reconstruction of the contiguous chromosomal DNA sequence from short, experimentally-generated fragments. • The sequence reassembly process must realign the short fragments, in the correct order, and then generate a consensus sequence.

  3. A Simple Case • Suppose target sequence is known to be about 10 bp • Sequenced fragments are: ACCGT CGTGC TTAC TACCGT

  4. --ACCGT-- ----CGTGC TTAC----- -TACCGT-- __________ TTACCGTGC Overlaps between fragments and the estimated length of the target sequence guide the assembly

  5. Why is fragment assembly important? • We need to have reliable, complete genomic sequences of human and other model organisms • base-pair sequence is the most basic piece of DNA information (gene structure and function described by sequence)

  6. Why fragment the DNA in the first place? • Human genome is large: ~3 X 109 base pairs long • Sequencers can generate sequences only approx. 500-600 bp long at a time

  7. Solutions? • Directed Sequencing: use custom primers to sequentially sequence from genomic DNA This is a slow and expensive process • Shotgun Sequencing: DNA is extracted, fragmented (e.g. sheared), cloned, sequenced from both ends of clone, reassembled, and finished (gaps are closed)

  8. Solutions? • Cloning of fragments is accomplished using different vectors, chosen according to the size of the fragments (inserts into the vector). • Large fragments: YACs 1 Mb, BACs 100-200 Kb • Intermediate: Cosmids, Lamba • Small: Plasmids, M13

  9. Human Genome Project vs Celera • HGP: initially used “tiling set” of large clones that cover genome • ends of the tiling set clones sequenced to allow ordering/mapping to the chromosome • individual clones subjected to shotgun sequencing • the sequences from the clones (shotgun fragments) then reassembled

  10. Celera: Whole Genome Sequencing • Celera (which won the race) took a whole genome sequence strategy • cloned all of the fragmented human genome into 3 different sized clone libraries • sequenced both ends of each clone • reassembly • advances in automated sequencing speed and accuracy were key to the success of the Celera approach

  11. Another Reason Fragment Assembly is Important: • Assembly and/or clustering sets of expressed sequence tags (ESTs) • The problem is that these are partial and they may span more than one exon (intron sequences, present in the genomic sequence have been spliced out) • Identity of the ESTs and assignment to genes is aided by finding overlap with other ESTs.

  12. Biological issues present some challenges for algorithm development • DNA sequencing data is imperfect • Every base in the DNA should be covered several times (at least twice; once in each direction) to minimize effects of random errors • Base calling (determining of the base identity from the DNA sequencer trace) errors can occur -the quality of traces is not always high. Capillary tube sequencing has reduced errors caused by lane bleed-through of slab gel sequencing

  13. Basecalling software (e.g. Phred) attempts to assign base to each position in sequence as well as quality data • The quality of the sequence tends to degrade at the ends. • Vector sequence also contaminating at ends. • NHGR standard: 99.99% accuracy before submission of sequence to GenBank.

  14. A big issue: • Human genome contains many repeats • Highly repetitive: not-transcribed, role unknown, present in millions of copies. Satellite (5-50 bp), Minisatellite (12-100 bp), Microsatellite (2-6 bp) • Moderately repetitive: some are transcribed, present in up to 100,000’s of copies • larger repeats with high copy number: • telomeres, SINE (e.g. Alu), LINE, tRNAs, rRNAs

  15. Another issue: • Orientation of the fragments is unknown • Is the input fragment or its reverse complement a substring of the consensus? CACGT CACGT-------- ACGT -ACGT--------- ACTACG --CGTAGT---- GTACT -----AGTAC--- ACTGA --------ACTGA CTGA ---------CTGA

  16. Yet, another • Chimeras (mixed or heterogeneous DNA) may be introduced during the cloning process • DNA from non-contiguous regions of the chromosome may be introduced as well as host DNA (for example, when growing plasmids in E. coli, the E. coli chromosomal DNA often contaminates clones)

  17. General Considerations: • The algorithms used to generate the consensus sequence must take the biological issues into account. • Need to consider prior biological information when analyzing a program’s assembly output. • e.g. known chromosomal sites or DNA fingerprinting data may be inconsistent with the program’s assembly output.

More Related