1 / 58

Evaluating genes and transcripts in Ensembl

Evaluating genes and transcripts in Ensembl. Sep 2006. Outline. Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups. Overview. other groups’ models. evidence. Ensembl predictions. manual curation.

didier
Télécharger la présentation

Evaluating genes and transcripts in Ensembl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating genes and transcripts in Ensembl Sep 2006

  2. Outline • Ensembl gene set • Pseudogenes • Ensembl EST genes • Ab initio predictions • Manual curation (Vega, CCDS) • Gene models from other groups

  3. Overview other groups’ models evidence Ensembl predictions manual curation

  4. Annotation process • Automated analysis • Repeat masking • RepeatMasker (Smit), tandem, inverted • Gene prediction • Genscan (Burge), FGENESH (Solovyev)… • Database searches • initial protein and DNA matches using BLAST • refined protein matches using GeneWise • refined EST matches using EST2GENOME, spangle • Pfam annotation using GeneWise.

  5. Human Proteins Other Proteins Human cDNAs Human ESTs GeneWise Exonerate Exonerate GeneWise genes Aligned cDNAs AlignedESTs GeneWise geneswith UTRs ClusterMerge ClusterMerge Supported ab initio (optional) Genebuilder Preliminary gene set cDNA genes Gene Combiner Final set + pseudogenes Pseudogenes Core Ensembl genes Ensembl EST genes

  6. Gene Builds Are Protein-based • Simple DNA - DNA alignments do NOT lead to translatable genes. • Essential to align at the protein level allowing for frameshifts and splice sites • GeneWise* • Protein - Genome alignments • Splice site model • Penalises stop codons • Models frameshifts • *E. Birney et al. Genome Research 14:988-995 (2004)

  7. The Ensembl Gene Build • Align species-specific proteins • Align similar proteins from closely related species • Use mRNA information to add UTRs • Build transcripts using mRNA evidence • Build additional transcripts using ab initio predictors and homology evidence • Combine annotations to make genes with alternative transcripts

  8. Ideal BLAST Reality The trouble with BLAST AG GT GT AG Real gene BLAST is good for finding possible exon positions In large genomic sequences.

  9. BLAST ‘replacements’ • Exonerate* (Guy Slater) • Fast gapped DNA-DNA matcher • 10,000 x faster than BLAST • Pmatch (Richard Durbin) • Fast exact protein-dna matcher • >10,000 x faster than BLAST • *BMC Bioinformatics 6: 31 (2005)

  10. Exonerate

  11. Adding UTRs protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) Combined prediction protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) GeneWise prediction protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) GeneWise prediction

  12. Gene Builder • Combines results after GeneWise and eventually ab initio predictions. • Clusters transcripts into genes by genomic exon overlap. • Groups transcripts, which share exons • Rejects non-translating transcripts • Removes duplicate exons • Attaches supporting evidence • Writes genes to database

  13. Evidence Tracks in ContigView Expanded tracks Compressed tracks

  14. Pseudogenes and ncRNA and ncRNA Pseudogenes

  15. Processed Unprocessed mRNA AAAAAA Produced by gene duplication and rearrangement Reverse transcription and re-integration pseudogene AAAAAA Pseudogenes: ‘False’ Genes

  16. Spliced Elsewhere • BLASTs single exon genes against a database of the multi-exon genes. • Span of real gene > 3x span of retro gene • Finds an additional ~ 600 pseudogenes in human • False positives can occur where gene predictions join together neighbouring genes in a cluster • Questions remain over the wisdom of calling all of these genes pseudogenes as some may be functional. Single exon transcripts with frameshifts Single exon transcripts with a spliced gene model elsewhere Transcripts which introns contain more than 80% repeat sequences A gene is labeled as a pseudogene if all transcripts in that gene are labeled as pseudo-transcripts

  17. ncRNAs • Functional RNAs • Families share conserved secondary structure • Low sequence identity • Ribosome • Spliceosome • tRNAs • miRNA

  18. RFAN • Hand made alignments • Use Infernal to make Covariance Models • Scan models over subset of EMBL to build family alignments

  19. miRNA • Highly conserved across species • Precursor stem loop sequence ~ 70nt • Mature miRNA ~ 21nt • BLAST genomic v miRBase precursors • RNAfold used to test for stem loop • Mature sequence identified • (only 2 nt changes tolerated)

  20. Structures • Structures identified by Infernal / RNAfold are stored as transcript attributes • ::::::::::::::::<<-<<<<<-<<<________________>>>>>>>>-->>,,,, • 1 AuCUUUGCGCAGGGGCaaUaucguAgccAGUGAGGcUuuaCCGAggcgcgauUAuuGCUA 60 • A+CUUUGCGCAG GGCA:UAU :UAGCCA+UGAGG+UU++CCGAGGCG: AUUA:UGCUA • 181 AGCUUUGCGCAGUGGCAGUAUCAUAGCCAAUGAGGUUUAUCCGAGGCGCAAUUAUUGCUA 240 • <<<<_.________.__>>>>,,,,,<<<.<<<<<<<<<<____......__>>>>>>>> • 61 gUugA.AAACUAUU.CCcaAccgCCCgcc.aagacgacauguua......uauugucggc 111 • :UU A AAA UA AA:+G G:C ::: ::A:::+UUA U :::U::+: • 241 AUUAAuAAAUUAAAuAAUAAAAGGG-GACuCUU-UUAGUGCUUAuaaaggUUUACUAACC 298 • >>->>>,,,,,,,,,,,,<<<<____>>>> • 112 uuuggcAAUUUUUGGAAGcccuccAaaggg 141 • :: G:CAA UU +AAG ::C+AA:: • 299 ACAGACAACUU---AAAGGUAACAAACCUA 325 • Displayable on website as markup on transcript sequence

  21. Human Build Statistics • NCBI 36 assembly, • released November 2005 • ‘known’ genes 21,571 • ‘novel’ genes 2,142 • Coding transcripts: 49,043 non-coding transcripts 4,145 • Ensembl exons: 278,632 • Human input sequences: • 260,031 proteins, redundant set +---------------------+----------+ | miRNA | 606 | | miRNA_pseudogene | 22 | | misc_RNA | 1060 | | misc_RNA_pseudogene | 7 | | Mt_rRNA | 1 | | Mt_tRNA | 22 | | Mt_tRNA_pseudogene | 603 | | protein_coding | 23713 | | pseudogene | 731 | | rRNA | 334 | | rRNA_pseudogene | 393 | | scRNA | 1 | | scRNA_pseudogene | 902 | | snoRNA | 609 | | snoRNA_pseudogene | 564 | | snRNA | 1387 | | snRNA_pseudogene | 632 | | tRNA_pseudogene | 131 | +---------------------+----------+

  22. Classification of Transcripts • Ensembl Transcripts or Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries • Known genes map to species-specific protein records (targeted build) • Novel genes do not map to species-specific protein records (similarity build)

  23. Names and Descriptions • Names are inferred from mapped proteins • Official gene symbol is assigned if available • HGNC (HUGO) symbol for human genes • Species-specific nomenclature committees • Otherwise Swiss-Prot > RefSeq > TrEMBL ID • Novel transcripts have only Ensembl identifiers • Genes named after ‘best-named’ transcript • Gene description is inferred from mapped database entries, the source is always given

  24. Supporting evidence ExonView

  25. Configuring the Gene Build Data availability Targeted build most useful in human and mouse. Similarity build more important in other species. Structural Issues ZebrafishMany similargenes near each otherGenome from different haplotypes MosquitoMany single-exon genes Genes within genes Configuration Files provide flexibility

  26. Low Coverage Genomes Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)

  27. Evaluating Genes and Transcripts • Ensembl gene set • Pseudogenes • EnsemblEST genes • Ab initio predictions • Manual curation (Vega, CCDS) • Gene models from other groups

  28. Human Proteins Other Proteins Human cDNAs Human ESTs Genewise Exonerate Exonerate Genewise genes Aligned cDNAs AlignedESTs Genewise geneswith UTRs ClusterMerge ClusterMerge Supported ab initio (optional) Genebuilder Preliminary gene set cDNA genes Gene Combiner Final set + pseudogenes Pseudogenes Core Ensembl genes Ensembl EST genes

  29. EST Analysis Map ESTs with Exonerate (determine coverage, % identity and location in genome) Filter on % identity and depth (6 million ESTs from dbEST – we map about 1/3)

  30. Merge ESTs according to consecutive exon overlap and set splice ends Assign translation Alternative transcripts with translation and UTRs Alternative Splicing Forms ESTs

  31. EST Genes and ESTs Ensembl transcript EST transcripts Human ESTs Latest Human Build NCBI 36 assembly EST Genes: 28,639 released Nov 2005 EST Transcripts: 58,916

  32. Evaluating Genes and Transcripts • Ensembl gene set • Pseudogenes • Ensembl EST genes • Ab initiopredictions • Manual curation (Vega, CCDS) • Gene models from other groups

  33. Ab initio Predictions GENSCAN transcript

  34. Evaluating Genes and Transcripts • Ensembl gene set • Pseudogenes • Ensembl EST genes • Ab initio predictions • Manual curation (Vega, CCDS) • Gene models from other groups

  35. Manual Curation • 7 (Washington University) • 14 (Genoscope) • 18 (Broad Institute) • 16, 19 (DOE Joint Genome Institute) • Other groups will also contribute to Vega • Manualannotation of finished clones • Vega Genome Browser • http://vega.sanger.ac.uk/ • Currently only chromosomes • 1, 6, 9, 10, 13, 20, 22, X and Y (Sanger Institute)

  36. Manual Curation • Manually-curated gene sets in Ensembl • WormBase (data import) • Caenorhabditis elegans • FlyBase (data import) • Drosophila melanogaster • Génoscope (data import) • Tetraodon nigroviridis • IMCB, Singapore (data import) • Takifugu rubripes • SGD (data import) • Saccharomyces cerevisiae • Vega includes some manually-curated finished clones from • Danio rerio, Mus musculus and Canis familiaris

  37. Manually-curated Vega Genes Vega manual curation

  38. Manually-curated Vega Genes

  39. Vega Genes ContigView Vega transcripts

  40. Ensembl / Havana Merge ~12,000 full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) were added to the human Ensembl gene set in v38 Transcripts: • Ensembl: red /black • Havana: blue • Ensembl/Havana:gold Genes: • Ensembl:red/black • Havana: blue • Ensembl/Havana: gold

  41. Ensembl / Havana Merge Merged Ensembl / Havana gene Ensembl transcript Havana transcripts

  42. Ensembl / Havana Merge Transcripts: +---------------------------+--------+-------+ | logic_name | status | count | +---------------------------+--------+-------+ | ensembl | KNOWN | 31402 | | ensembl | NOVEL | 4709 | | ensembl_havana_transcript | KNOWN | 174 | | havana | KNOWN | 11153 | | havana | NOVEL | 780 | +---------------------------+--------+-------+ Genes: +---------------------+--------+-------+ | logic_name | status | count | +---------------------+--------+-------+ | ensembl | KNOWN | 15044 | | ensembl | NOVEL | 2145 | | ensembl_havana_gene | KNOWN | 6407 | | ensembl_havana_gene | NOVEL | 18 | | havana | KNOWN | 71 | | havana | NOVEL | 25 | +---------------------+--------+-------+

  43. CCDS (consensus CDS) • Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human • Long term aim is to get to a single gene set for human • The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

  44. Comparison of CDSs to NCBI • Exact matching CDS on the genome with: • Complete CDS (ATG->stop) • No frameshifts • No phase problems • No internal stop codons NCBI Hinxton

  45. CCDS release (March 2005) • Conservative first set, so the following have been removed: • All CDSs which match XMs • CDSs with large cDNA v genomic discrepancies • CDSs with non consensus splice sites • Set contains: • 14795 different CDSs • 16085 transcripts (in Ensembl) • 13031 genes • The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

  46. ENCODE regions • 44 regions representing 1% of the genome • 14 manually / semi manually picked regions • 30 randomly picked 0.5Mb regions at varying gene density and non exonic conservation (three band for each). • Example id: ENr123

  47. GENCODE • Evaluation of prediction accuracy of automatic annotation methods in the Encode regions • Based on comparison to manual annotation generated by Havana team • Some experimental confirmation of annotation • Divided into categories of prediction based on data and methods used: • Ab initio • Comparative data only • Protein, EST and cDNA based • Any available data

  48. GENCODE regions • For 13 of the regions annotation released prior to competition for training: • 2 manual picks • 11 random (2 level 1, 4 level 2 and 5 level 3) • For the remaining 31 regions • 12 manual picks • 19 random (8 level 1, 6 level 2 and 5 level 3)

  49. EGASP • how complete is the Vega-ENCODE annotation when compared to other existing gene data sets? • how well the programs are able to reproduce the Vega-ENCODE annotation? • how reliable are the predictions outside of the Vega-ENCODE annotation • is there anything outside the annotation and the predictions? • M.G.Reese & R. Guigó Genome Biology 7:S1 (2006)

  50. EGASP 05 Nucleotide Exon Sn Sp CC Sn Sp SnSp Acembly 0.96 0.58 0.74 0.84 0.38 0.613 ECgene 0.96 0.46 0.66 0.75 0.30 0.528 Ensembl 0.91 0.92 0.92 0.77 0.82 0.800 Exogean 0.84 0.94 0.89 0.71 0.74 0.728 AceView 0.91 0.79 0.84 0.74 0.49 0.624 Pairagon 0.87 0.93 0.90 0.67 0.78 0.732

More Related