1 / 21

How to Build a Horse

Explore the process of sequencing, assembling, and validating the horse genome to study diseases, infer evolution, and understand genetic conditions analogous to humans.

mleon
Télécharger la présentation

How to Build a Horse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Build a Horse Megan Smedinghoff

  2. Background • In February 2007, Broad Institute released a draft genome of the horse (Equus caballus) • The project cost $15 million and was funded by the National Human Genome Research Institute and the National Institute of Health • 300,000 Bacterial Artificial Chromosomes were provided by the University of Veterinary Medicine in Hanover, Germany and the Helmholtz Centre for Infection Research in Braunschweig, Germany

  3. Horse Genome Statistics • The horse genome contains approximately 2.7 billion base pairs • The assembly was done using 6.8-fold coverage • The sequenced horse was a thoroughbred mare named Twilight from Cornell University Twilight posing for a picture at Cornell

  4. Why Sequence the Horse? • Allows scientists to study diseases that primarily affect horses such as Glanders • SNP information can be used to connect DNA to physical characteristics and explain differences between breeds • Lots of general information about mammals can be gained by looking at the horse since very few large mammals have been sequenced

  5. How the Horse Genome Affects Us • There are over 80 known genetic conditions in the horse that are analogous to human disorders • Horses have some conditions traditionally found in humans such as allergies and arthritis • Having the complete horse genome helps infer the order of evolution • Horse Racing?

  6. Project Proposal • Reassemble the horse genome using the Celera Assembler • Use existing UMD software to compare my assembly with the Broad assembly and produce a reconciled horse genome • Deposit the improved assembly in GenBank Advisor: Jim Yorke

  7. DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 750bp LIGATE & CLONE Primer SEQUENCE Vector Introduction to Genome Sequencing Slide courtesy of Art Delcher

  8. Trim the Reads Calculate Overlaps Build Unitigs Build Contigs Build Scaffolds Closure How Genomes are Assembled

  9. 5’ 3’ 3’ 3’ 5’ 5’ 5’ 3’ Read B Read A Read B Read B Read A Read A Read B Read A 5’ 3’ 5’ 5’ 3’ 3’ 3’ 5’ Assembly: Calculating Overlaps • Compare every possible combination of reads to find every overlap of a certain length (~40bp) • Must compare forward and reverse orientation of each pair of reads • Comparisons must take into account the possibility of sequencing errors and use alignment algorithms such as Smith-Waterman

  10. Unitig Reads Assembly: Creating Unitigs • A unitig is a set of reads that have been linked together based on overlaps • A unitig has no ambiguities

  11. A A B B C C D Assembly: Creating Unitigs (cont.) Best Buddy Algorithm for Unitig Assembly: If the longest overlap with read A is read B and the longest overlap with read B is read A, then reads A and B are best buddies D Read A and Read B are best buddies Read A and Read B are NOT best buddies

  12. Read 2 Read 1 Unitig A Unitig B Assembly: Creating Contigs • A contig is a set of overlapping unitigs • Contigs are assembled by using mate pair information • Since we know the distance between mates and the orientation of the mates, we can infer the placement of the unitigs Read 1 and Read 2 are mates

  13. Scaffold Contig A Contig B Reads Assembly: Building Scaffolds • Scaffolds are built from contigs • The orientation and approximate distances between contigs are inferred from mate pair information • When possible, the gaps between contigs are filled in with leftover sequence

  14. Arachne Assembler • 24-mer indexing • Any two reads that share at least one 24-mer are paired • Each pair is scored • Contigs are created by merging paired pairs • Repeat regions are avoided during contig assembly but used during scaffold assembly • Subreads are placed after scaffold assembly Serafim Batzoglou Arachne Author

  15. Celera Assembler • Find overlaps of at least 40bp with less than 6% error • Overlaps are found using 22-mers • After overlaps are calculated, Celera does error correction using a voting algorithm • Contigs are assembled using best buddy algorithm • Scaffolds are assembled from mate pair information • Scaffold gaps are filled when possible Gene Meyers Former vice president of Celera Genomics

  16. Project Expectations Fall 2007 Produce Celera Assembly Spring 2008 Produce Reconciled Assembly General Goals Tackle the unexpected problems that accompany genome assembly Document my work Validate my work wherever possible

  17. Validation • Genome assemblies are not perfect • I plan to validate my assembly by comparing it to the current draft • I expect about 1.5% difference between the Celera Assembly and the Broad Assembly • I will use Mummer to measure similarity between genomes

  18. Mummer • Mummer is a piece of software created by CBCB that is used to compare genomes • Mummer locates strings of at least 18bp that are present in each genome • Plotting the results makes it easy to see insertions, deletions, inversions, etc. Graphs courtesy of Adam Phillippy

  19. Implementation Details • I plan to use the Genome cluster at University of Maryland to produce my assembly • Much of my project will utilize existing software • I intend to use Perl to write any additional scripts that may be needed

  20. Time Permitting • The University of Maryland has recently produced a lot of software for the genome assembly pipeline, much of which has not been tested on large genomes • I hope to use programs like the UMD overlapper and Figaro to see how these programs affect my assembly Mihai Pop James White

  21. Acknowledgements • James Yorke, Aleksey Zimin, and the Genome Group for advising me on the nature of this project • Steven Salzberg, Art Delcher, and Adam Phillippy for giving lectures and producing slides on genome assembly topics • Gene Myers paper on Drosophila • Serafim Batzoglou paper on Arachne • Wikipedia

More Related