1 / 22

The Sequence and Assembly of the Highly Duplicated Regions of the Human Genome

Vicky Choi Department of Genetics Case Western Reserve University. The Sequence and Assembly of the Highly Duplicated Regions of the Human Genome. Repeats of the Human Genome. High-copy repeats. e.g. ALU, L1. Segmental duplications (low-copy repeats). Large block (>200Kb)

mabli
Télécharger la présentation

The Sequence and Assembly of the Highly Duplicated Regions of the Human Genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vicky Choi Department of Genetics Case Western Reserve University The Sequence and Assembly of the Highly Duplicated Regions of the Human Genome

  2. Repeats of the Human Genome • High-copy repeats e.g. ALU, L1 • Segmental duplications (low-copy repeats) • Large block (>200Kb) • Highly Similar (>97%) • Without characteristic sequences

  3. Identification of Duplications Sequence Assembly vs. Repeats vs. Sequence Assembly • HGP Goal: To provide an accurate reference sequence of the euchromatic portions of the genome

  4. Public: Clone-Based Sequencing Celera: Whole-Genome Shotgun Sequencing Human Genome Human Genome 500bp read BAC library (100-200Kb) BAC 100-200Kb fragment Draft, Finished Duplications Detection Method: Public Clones + Celera Random Reads Jeff Bailey & Celera Genomics

  5. 5-10 copies 10-20 copies ~40 copies Over-representation of Celera reads in the duplicated regions of clones

  6. 500 450 400 350 300 Coverage Duplicated 250 X Chromosome 200 Autosome 150 100 50 0 0 2 4 6 8 10 12 14 Copy # of Duplication Depth of Coverage vs. Copy Number of Duplication

  7. 47 At 81 100 5kb WGS Read Cov 200 400 Unique Duplication Duplicon Transition (Unique to Repeat)

  8. Segmental Duplication Database(SDD) • Scanned 32,610 clones • Statistics: 8,595 regions (2,972 clones) with 130.4 Mb—duplications 94-100%. • Confirmation: • Detected 42/43 FISH positive clones • Detected 21/23 Sequence duplications

  9. True Overlap Repeat-induced Overlap True Overlap vs. Repeat-induced Overlap

  10. A overlaps with B B B B A A A A A overlaps with C C C Yes. Consistent Overlap Inconsistent Overlap No. C Consistent Overlaps Does B overlap with C?

  11. Necessary… But Not Sufficient Long Repeats Under-represented

  12. Basic Idea of the Assembly Algorithm 1.Classify Overlaps: U-Overlaps & D-Overlaps 2.“Conservatively” assemble fragments into subcontigs 2.1 Assemble consistent U-Overlapping fragments 2.2 Assemble the overlapping BACs (u-driven)

  13. Clone Graph Algorithm (cont.) 3. Interval Realization of the Clone Graph 4. Orient and order subcontigs

  14. Contig 1 Contig 2 6. Undetermined D-overlaps Algorithm (cont.) 5. Eliminate the “interior” D-overlaps Need Additional Information

  15. SDD: 8,595 regions 13,186 fragments (3,949 BACs) contain duplications (>95% similar) (total length 247 Mbp) 403,937 overlaps = 379,507 (U-overlap) + 24,430 (D-overlap) 8,233 (u-driven) + 2,880 + 10,715 + 2,602(interior) Build 28 (Dec 2001 freeze) (Greg Schuler)

  16. 165 interior BAC pairs collapse in Build 28, 89 BACs are not in TPF

  17. Incorporation of Duplications Data Assembled BAC Length Build 28 + SDD Build 28 250K - 300K 418 461 300K - 500K 480 1328 500K - 800K 13 798 800K - 1M 0 248 1M - 2M 0 496 2M - 3M 0 129 3M - 10M 0 259 10M - 20M 0 67 911 3786 BACs contain duplications 140 699

  18. AC011767 AC004583 AC055876 AC025138 AC027267 AC018348 AC026495 AC016033 Example : 15q11-13 ~4Mbp (Devin Locke) (HERC2) Build 28 :

  19. 8 separate contigs

  20. Conclusion • SDD allows to assemble duplicated regions more conservatively, improving the assembly quality • Problematic regions: 165 “collapsed” regions, 89 BACs are not in TPF, “warped” regions • Incorporate additional data: e.g Phrap scores, BAC ends information Paralogous Sequence Variations(PSV)

  21. Jeff Bailey Greg Schuler (NCBI) Zhiping Gu (Celera Genomics) Peter Li (Celera Genomics) Martin Farach-Colton (Rutgers University) Evan Eichler Program in Mathematics and Molecular Biology (PMMB) Burroughs Wellcome (BWF) fellowship

  22. Build29

More Related