1 / 47

Introduction to genome annotation - practical information

Introduction to genome annotation - practical information. Some possibilities and some pitfalls. Practical info. Coffee breaks Lunch Dinner at Koh Phangan 18.00. Understanding annotation. Some possibilities and some pitfalls. Henrik Lantz, BILS/SciLifeLab. Lecture synopsis.

lonid
Télécharger la présentation

Introduction to genome annotation - practical information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to genome annotation - practical information Somepossibilities and somepitfalls

  2. Practical info • Coffee breaks • Lunch • Dinner at KohPhangan 18.00

  3. Understanding annotation Somepossibilities and somepitfalls Henrik Lantz, BILS/SciLifeLab

  4. Lecture synopsis • What is annotation? • Structural genome annotation • Types of data used • Transcriptome annotation • Functional annotation

  5. What is annotation? • Identification of regions of interest in sequence data

  6. From a genome…

  7. …to an annotated gene

  8. GFF file format

  9. GFF3 file format

  10. GTF file format

  11. GTF file format

  12. Why is annotation important? Example: Differential expression Mapped reads - condition 1 Genome Mapped reads - condition 2

  13. Why is annotation important? RNA-seq reads Genome

  14. There are two major parts of annotation • 1) Structural: Find out where the regions of interest (usually genes) are in the genome and what they look like. How many exons/introns? UTRs? Isoforms? • 2) Functional: Find out what the regions do. What do they code for?

  15. Open reading frames

  16. Difficult in practice

  17. Combine data - use Maker! • External data - proteins, rna-seq (incl. ESTs) • Ab-initio gene finders • (Lift-overs from closely related genomes) Combined annotation

  18. Transcriptomes are different but have their own challenges • No introns, but where are the start and stop codons? • Still needs functional annotation

  19. Assembly quality • The quality of the assembly will heavily influence the quality of the annotation • SNP-errors can change start/stop-codons • Indels can cause frame-shifts • Annotation tools often have problems with incomplete loci • And of course, if a locus is completely missing from the assembly, it cannot be annotated

  20. Assembly validation suing CEGMA/BUSCO • CEGMA now depreceted, BUSCO actively developed • Both look for core genes; CEGMA=248 core genes, BUSCO=phylogenetic groups, up to 3000 genes • Both report %complete genes -> extrapolated to amount of gene space assembled

  21. BUSCO output

  22. CEGMA output #Prots %Completeness - #Total Average %Ortho Complete 233 93.95 - 265 1.14 9.87 Group 1 60 90.91 - 66 1.10 6.67 Group 2 52 92.86 - 58 1.12 11.54 Group 3 59 96.72 - 71 1.20 13.56 Group 4 62 95.38 - 70 1.13 8.06 Partial 238 95.97 - 277 1.16 12.18 Group 1 62 93.94 - 69 1.11 6.45 Group 2 54 96.43 - 61 1.13 12.96 Group 3 60 98.36 - 75 1.25 18.33 Group 4 62 95.38 - 72 1.16 11.29 # These results are based on the set of genes selected by Genis Parra # # Prots = number of 248 ultra-conserved CEGs present in genome # # %Completeness = percentage of 248 ultra-conserved CEGs present # # Total = total number of CEGs present including putative orthologs # # Average = average number of orthologs per CEG # # %Ortho = percentage of detected CEGS that have more than 1 ortholog #

  23. Data used - Proteins

  24. Data used - Proteins • Conserved in sequence => conserved annotation with little noise • Proteins from model organisms often used => bias? • Proteins can be incomplete => problems as many annotation procedures are heavily dependent on protein alignments >ENSTGUP00000017616 pep:novel chromosome:taeGut3.2.4:8_random:2849599:2959678:-1 gene:ENSTGUG00000017338 transcript:ENSTGUT00000018018 gene_biotype:protein_coding transcript_biotype:protein_coding RSPNATEYNWHHLRYPKIPERLNPPAAAGPALSTAEGWMLPWGNGQHPLLARAPGKGRER DGKELIKKPKTFKFTFLKKKKKKKKKTFK >ENSTGUP00000017615 pep:novel chromosome:taeGut3.2.4:23_random:205321:209117:1 gene:ENSTGUG00000017337 transcript:ENSTGUT00000018017 gene_biotype:protein_coding transcript_biotype:protein_coding PDLRELVLMFEHLHRVRNGGFRNSEVKKWPDRSPPPYHSFTPAQKSFSLAGCSGESTKMG IKERMRLSSSQRQGSRGRQQHLGPPLHRSPSPEDVAEATSPTKVQKSWSFNDRTRFRASL RLKPRIPAEGDCPPEDSGEERSSPCDLTFEDIMPAVKTLIRAVRILKFLVAKRKFKETLR PYDVKDVIEQYSAGHLDMLGRIKSLQTRVEQIVGRDRALPADKKVREKGEKPALEAELVD ELSMMGRVVKVERQVQSIEHKLDLLLGLYSRCLRKGSANSLVLAAVRVPPGEPDVTSDYQ SPVEHEDISTSAQSLSISRLASTNMD

  25. Data used - Proteins • Maker will align proteins for you: Blast -> Exonerate • Blast is not structure aware, Exonerate is (splice sites, start/stop codons) • Preferred file-format: fasta

  26. RNA-seq DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR AG GT AG GT GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR AA AAAAA ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

  27. Data used - RNA-seq • Should always be included in an annotation project • From the same organism as the genomic data => unbiased • Can be very noisy (tissue/species dependent), can include pre-mRNA • PASA, or some other filtering method, often needed

  28. Spliced reads DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR AG GT AG GT GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR AA AAAAA ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

  29. RNA-seq - Spliced reads

  30. Pre-mRNA

  31. Pre-mRNA DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR GT GT GT ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR AA AAAAA ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Translation

  32. Pre-mRNA

  33. A lot is transcribed in a cell

  34. Stranded rna-seq

  35. Three-prime bias in polyA-selected rna-seq

  36. How to use RNA-seq • Maker will align transcripts (ESTs), but these need to be assembled first. • Cufflinks: mapped reads -> transcripts

  37. How to use RNA-seq • Maker will align transcripts (ESTs), but these need to be assembled first. • Cufflinks: mapped reads -> transcripts • Trinity: assembles transcripts without a genome

  38. Mapped Trinity-assembled transcripts

  39. How to use RNA-seq • Maker will align transcripts (ESTs), but these need to be assembled first. • Cufflinks: mapped reads -> transcripts • Trinity: assembles transcripts without a genome • PASA can be used to improve transcript quality

  40. Ab initio gene finders are used in Maker • Commonly used programs: Augustus, Snap, Genemark-ES, FGENESH, Genscan, Glimmer-HMM,… • Uses HMM-models to figure out how introns, exons, UTRs etc. are structured • These HMM-models need to be trained!

  41. Liftovers are very useful for orthology determination • Kraken • Align the two genomes (Satsuma) and then transfer annotations between aligned regions

  42. General recommendations • Always combine different types of evidence! • One single method is not enough! • Use Maker!

  43. Transcript annotation • Here the transcript is already defined. The challenge is to find where the coding regions starts and stops • Transdecoder

  44. Transdecoder

  45. Transdecoder

  46. Or get help - NBIS assembly and annotation team • Five people working with assembly and annotation • Deliver high quality annotations • Enable visualization and manual curation through a web interface • Also available for consultation • http://nbis.se/support/supportform/index.php

  47. Biosupport.se

More Related