1 / 10

Genome Annotation

Genome Annotation. BCB 660 October 20, 2011. From Carson Holt. Annotations. Automated Ab initio (based on genomic sequence alone) Involves comparisons to known proteins (BLAST similarity) Sequence motifs such as start/stop codons , intron/exon boundaries Evidence-based ( ESTs )

kapila
Télécharger la présentation

Genome Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Annotation BCB 660 October 20, 2011

  2. From Carson Holt

  3. Annotations • Automated • Ab initio (based on genomic sequence alone) • Involves comparisons to known proteins (BLAST similarity) • Sequence motifs such as start/stop codons, intron/exon boundaries • Evidence-based (ESTs) • Involves alignment of experimental EST (cDNA) data to a gene prediction • Manual • Manual curation of genes predicted automatically • Check gene structure, presence of conserved domains, match of ESTs to gene prediction • Align to related genes/proteins and look for oddities (missing exons, early stop codons, etc). • Annotation can then be manually edited • May also involve assigning function (based on sequence similarity, conserved domains) via Gene Ontology • Structural: exons, introns, UTRs, splice forms etc. • Functional: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.

  4. Classic strategy • Combine ab initio and evidence-based gene predictors together to come up with a concensus predicted gene set • Ask community to pitch in and manually annotate as many genes as possible • Leads to great variability in quality of different genome annotations, often many versions of official gene sets

  5. NGS and the future of genome annotation • In 2010, 1300 eukaryotic genome projects were underway -- assuming 10,000 genes per genome, that’s 13,000,000 new annotations will be needed -- quality control and maintenance become an issue • Some organizations dedicated to genome annotation (i.e ENSEMBL and VectorBase) but 1300 genomes will not be feasible • Need for high quality, automated annotation pipelines, that are easy to use by small research groups without extensive bioinformatics expertise

  6. MAKER Pipeline: Especially effective for Emerging Eukaryote Model Organisms • Incorporates ab initio and evidence-based gene predictors • Gene predictions are run a first time • Then a small subset of the genome assembly is used to train gene predictors (building genome-specific HMMs) • Then trained gene predictors are run again on whole genome • ** Really nice if you don’t have a basis to start from (e.g. de novo gene prediction)

  7. What does MAKER do? • * Identifies and masks out repeat elements • * Aligns ESTs to the genome • * Aligns proteins to the genome • * Produces ab initio gene predictions • * Synthesizes these data into final annotations • * Produces evidence-based quality values for downstream annotation management

  8. MAKER Steps involved 1. Compute phase RepeatMasker BLAST Exonerate SNAP (and other gene predictors) 2. Filter/cluster phase Identify/remove marginal predictions and alignments based on quality scores/cutoffs, etc Cluster to identify overlapping alignments/predictions– to remove redundancy and assess weight of evidence 3. Polish Realigns BLAST hits to obtain greater precision at exon boundaries (Exonerate) 4. Synthesis Collect evidence for each annotation, using EST evidence Evidences scores plus sequences (genomic, EST, coding, intron) passed to SNAP SNAP then uses this evidence to retrain and alter its internal HMM 5. Annotate Post-processing of SNAP prediction, recombine with evidence to generate complete annotations Output is a gff3 annotation that can be imported into genome browsers

  9. Inputs to MAKER • Genomic sequence • Config files • External executables • Sequence database locations • Compute parameters • Sequence database files (choice of these turns out to be extremely important) • Transposons file (default plus known organism-specific) • Repeatmasker database file (organism-specific, optionsal) • Proteins file (known proteins from related organisms you want to align to the genome) • ESTs/mRNAs file (the evidence)

  10. MAKER Output (Apollo browser)

More Related