460 likes | 580 Vues
The Human Reference Assembly. How the sequence is made. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Valerie Schneider, NCBI. http:// genomereference.org. HGP Goals. Throughput: 500 Mb/year Cost: < $0.25 per base
E N D
The Human Reference Assembly How the sequence is made Deanna M. Church Staff Scientist, NCBI Short Course in Medical Genetics 2013 @deannachurch
Valerie Schneider, NCBI http://genomereference.org
HGP Goals Throughput: 500 Mb/year Cost: < $0.25 per base Variation: 100,000 SNPs mapped Collins FS et al, 1998
1999 2000 2005 2011 2010 Steve Sherry, NCBI
BLACK: Deletion White: Insertion Kidd et al, 2007 APOBEC cluster
Reference assembly history Genome Research, May, 1997
Restrict and make libraries 2, 4, 8, 10, 40, 150 kb Find sequence overlaps tails WGS contig Reference assembly history WGS: Sanger Reads End-sequence all clones and retain pairing information “mate-pairs” Each end sequence is referred to as a read Scaffold
Reference assembly history A T T T T C C C T T C T G A A A T G A T G A A A G A G T C
Reference assembly history Schatz et al, 2010
Shotgun sequence deeper sequence coverage rarely resolves all gaps Fold sequence Assemble Gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Reference assembly history BAC insert Clone based assemblies BAC vector
Reference assembly history Build sequence contigs based on contigs defined in TPF. Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321
A B C F F D G G E H H F K K G L L H A A I B B J C C K D D L M N O O O N (flip) N Reference assembly history Ideally… Non-sequence based Map
A A A A B B B B C C C D D Y Z E Y F X ? G W H H H H I I I J J J J V K L L L M M M M N N N N O O O O Reference assembly history More like…
WI Genetic WI/MRC RH Sequence vs. Non-sequence based maps Mmu7
Reference assembly history Fragmented genomes tendto have less frame shifts Alexander Souvorov, NCBI
Reference assembly history Fragmented genomes tend to have more partial models Alexander Souvorov, NCBI
5 60 4 40 3 2 20 1 0 0 -1 20 -2 -3 40 -4 Select regulatory molecule Other transcription factor Nucleic acid binding G-protein modulator Extracellular matrix -5 60 Ribosomal protein Protein kinase Unclassified Hydrolase Chemokine Oxygenase Kinase Apolipoprotein Oxidoreductase Structural protein Cytokine receptor Cysteine protease Transcription factor Signaling molecule Intermediate filament Miscellaneous function Cell adhesion molecule Other cytokine receptor Defense/immunity protein Cysteine protease inhibitor Other cell adhesion molecule Zinc finger transcription factor KRAB box transcription factor Tumor necrosis factor receptor CAM family adhesion molecule Immunoglobulin receptor family member Major histocompatibility complex antigen Enrichment Observed Expected Reference assembly history Human- panther classifications (biological process) Evan Eichler, University of Washington
Church et al., 2011 PLoS http://genomereference.org
Finding the data Issue tracking system (JIRA) publicly available
Finding the data HG-110 AC021180.6 AC149643.1
Putting the genome together Tiling Path File (TPF)
Putting the genome together http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/overlap/
Putting the genome together Serious alignment problem requiring review Minor alignment problem Excellent alignment Certificate submitted, not yet approved Certificate submitted and approved Join not evaluated Valid, contained clones
Putting the genome together AGP: A Golden Path Provides instructions for building a sequence • Defines components sequences used to build scaffolds/chromosome • Switch points • Defines gaps and types GRC Produces GenBank components->Scaffolds, GenBank components->Chromosome Scaffolds->Chromosome
Large-Scale Variation Complicates Genome Assembly Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
GRCh37 (hg19) MAPT UGT2B17 MHC 7 alternate haplotypesat the MHC Alternate loci released as: FASTA AGP Alignment to chromosome http://genomereference.org
ALT 1 Data Model Non-nuclear assembly unit (e.g. MT) Assembly (e.g. GRCh37) ALT 2 PAR Primary Assembly ALT 6 ALT 3 Genomic Region (UGT2B17) Genomic Region (MAPT) Genomic Region (MHC) ALT 7 ALT 4 ALT 8 ALT 5 ALT 9
Take home messages • Assemblies are not genomes, they are models of genomes • All eukaryotic assemblies have some issues • Mis-assemblies • Missing variation • Assembly evidence is important • Assemblies are not static (if you are lucky!)