1 / 46

The Human Reference Assembly

The Human Reference Assembly. How the sequence is made. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Valerie Schneider, NCBI. http:// genomereference.org. HGP Goals. Throughput: 500 Mb/year Cost: < $0.25 per base

sissy
Télécharger la présentation

The Human Reference Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Human Reference Assembly How the sequence is made Deanna M. Church Staff Scientist, NCBI Short Course in Medical Genetics 2013 @deannachurch

  2. Valerie Schneider, NCBI http://genomereference.org

  3. HGP Goals Throughput: 500 Mb/year Cost: < $0.25 per base Variation: 100,000 SNPs mapped Collins FS et al, 1998

  4. 1999 2000 2005 2011 2010 Steve Sherry, NCBI

  5. BLACK: Deletion White: Insertion Kidd et al, 2007 APOBEC cluster

  6. Reference assembly history Genome Research, May, 1997

  7. Restrict and make libraries 2, 4, 8, 10, 40, 150 kb Find sequence overlaps tails WGS contig Reference assembly history WGS: Sanger Reads End-sequence all clones and retain pairing information “mate-pairs” Each end sequence is referred to as a read Scaffold

  8. Reference assembly history A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

  9. Reference assembly history Schatz et al, 2010

  10. Shotgun sequence deeper sequence coverage rarely resolves all gaps Fold sequence Assemble Gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Reference assembly history BAC insert Clone based assemblies BAC vector

  11. Reference assembly history Build sequence contigs based on contigs defined in TPF. Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence

  12. NCBI36

  13. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321

  14. A B C F F D G G E H H F K K G L L H A A I B B J C C K D D L M N O O O N (flip) N Reference assembly history Ideally… Non-sequence based Map

  15. A A A A B B B B C C C D D Y Z E Y F X ? G W H H H H I I I J J J J V K L L L M M M M N N N N O O O O Reference assembly history More like…

  16. WI Genetic WI/MRC RH Sequence vs. Non-sequence based maps Mmu7

  17. An assembly is a MODEL of the genome

  18. Reference assembly history Fragmented genomes tendto have less frame shifts Alexander Souvorov, NCBI

  19. Reference assembly history Fragmented genomes tend to have more partial models Alexander Souvorov, NCBI

  20. Reference assembly history

  21. 5 60 4 40 3 2 20 1 0 0 -1 20 -2 -3 40 -4 Select regulatory molecule Other transcription factor Nucleic acid binding G-protein modulator Extracellular matrix -5 60 Ribosomal protein Protein kinase Unclassified Hydrolase Chemokine Oxygenase Kinase Apolipoprotein Oxidoreductase Structural protein Cytokine receptor Cysteine protease Transcription factor Signaling molecule Intermediate filament Miscellaneous function Cell adhesion molecule Other cytokine receptor Defense/immunity protein Cysteine protease inhibitor Other cell adhesion molecule Zinc finger transcription factor KRAB box transcription factor Tumor necrosis factor receptor CAM family adhesion molecule Immunoglobulin receptor family member Major histocompatibility complex antigen Enrichment Observed Expected Reference assembly history Human- panther classifications (biological process) Evan Eichler, University of Washington

  22. Center sequence distribution: NCBI36

  23. Finding the data

  24. Church et al., 2011 PLoS http://genomereference.org

  25. Finding the data Issue tracking system (JIRA) publicly available

  26. Finding the data

  27. Finding the data HG-110 AC021180.6 AC149643.1

  28. Finding the data

  29. Putting the genome together Tiling Path File (TPF)

  30. Putting the genome together http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/overlap/

  31. Putting the genome together Serious alignment problem requiring review Minor alignment problem Excellent alignment Certificate submitted, not yet approved Certificate submitted and approved Join not evaluated Valid, contained clones

  32. Putting the genome together

  33. Putting the genome together

  34. Putting the genome together AGP: A Golden Path Provides instructions for building a sequence • Defines components sequences used to build scaffolds/chromosome • Switch points • Defines gaps and types GRC Produces GenBank components->Scaffolds, GenBank components->Chromosome Scaffolds->Chromosome

  35. Putting the genome together

  36. Large-Scale Variation Complicates Genome Assembly Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes

  37. GRCh37 (hg19) MAPT UGT2B17 MHC 7 alternate haplotypesat the MHC Alternate loci released as: FASTA AGP Alignment to chromosome http://genomereference.org

  38. ALT 1 Data Model Non-nuclear assembly unit (e.g. MT) Assembly (e.g. GRCh37) ALT 2 PAR Primary Assembly ALT 6 ALT 3 Genomic Region (UGT2B17) Genomic Region (MAPT) Genomic Region (MHC) ALT 7 ALT 4 ALT 8 ALT 5 ALT 9

  39. Take home messages • Assemblies are not genomes, they are models of genomes • All eukaryotic assemblies have some issues • Mis-assemblies • Missing variation • Assembly evidence is important • Assemblies are not static (if you are lucky!)

More Related