Evaluating genes and transcripts in Ensembl

Evaluating genes and transcripts in Ensembl Sep 2006

Outline • Ensembl gene set • Pseudogenes • Ensembl EST genes • Ab initio predictions • Manual curation (Vega, CCDS) • Gene models from other groups

Overview other groups’ models evidence Ensembl predictions manual curation

Annotation process • Automated analysis • Repeat masking • RepeatMasker (Smit), tandem, inverted • Gene prediction • Genscan (Burge), FGENESH (Solovyev)… • Database searches • initial protein and DNA matches using BLAST • refined protein matches using GeneWise • refined EST matches using EST2GENOME, spangle • Pfam annotation using GeneWise.

Human Proteins Other Proteins Human cDNAs Human ESTs GeneWise Exonerate Exonerate GeneWise genes Aligned cDNAs AlignedESTs GeneWise geneswith UTRs ClusterMerge ClusterMerge Supported ab initio (optional) Genebuilder Preliminary gene set cDNA genes Gene Combiner Final set + pseudogenes Pseudogenes Core Ensembl genes Ensembl EST genes

Gene Builds Are Protein-based • Simple DNA - DNA alignments do NOT lead to translatable genes. • Essential to align at the protein level allowing for frameshifts and splice sites • GeneWise* • Protein - Genome alignments • Splice site model • Penalises stop codons • Models frameshifts • *E. Birney et al. Genome Research 14:988-995 (2004)

The Ensembl Gene Build • Align species-specific proteins • Align similar proteins from closely related species • Use mRNA information to add UTRs • Build transcripts using mRNA evidence • Build additional transcripts using ab initio predictors and homology evidence • Combine annotations to make genes with alternative transcripts

Ideal BLAST Reality The trouble with BLAST AG GT GT AG Real gene BLAST is good for finding possible exon positions In large genomic sequences.

BLAST ‘replacements’ • Exonerate* (Guy Slater) • Fast gapped DNA-DNA matcher • 10,000 x faster than BLAST • Pmatch (Richard Durbin) • Fast exact protein-dna matcher • >10,000 x faster than BLAST • *BMC Bioinformatics 6: 31 (2005)

Exonerate

Adding UTRs protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) Combined prediction protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) GeneWise prediction protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) GeneWise prediction

Gene Builder • Combines results after GeneWise and eventually ab initio predictions. • Clusters transcripts into genes by genomic exon overlap. • Groups transcripts, which share exons • Rejects non-translating transcripts • Removes duplicate exons • Attaches supporting evidence • Writes genes to database

Evidence Tracks in ContigView Expanded tracks Compressed tracks

Pseudogenes and ncRNA and ncRNA Pseudogenes

Processed Unprocessed mRNA AAAAAA Produced by gene duplication and rearrangement Reverse transcription and re-integration pseudogene AAAAAA Pseudogenes: ‘False’ Genes

Spliced Elsewhere • BLASTs single exon genes against a database of the multi-exon genes. • Span of real gene > 3x span of retro gene • Finds an additional ~ 600 pseudogenes in human • False positives can occur where gene predictions join together neighbouring genes in a cluster • Questions remain over the wisdom of calling all of these genes pseudogenes as some may be functional. Single exon transcripts with frameshifts Single exon transcripts with a spliced gene model elsewhere Transcripts which introns contain more than 80% repeat sequences A gene is labeled as a pseudogene if all transcripts in that gene are labeled as pseudo-transcripts

ncRNAs • Functional RNAs • Families share conserved secondary structure • Low sequence identity • Ribosome • Spliceosome • tRNAs • miRNA

RFAN • Hand made alignments • Use Infernal to make Covariance Models • Scan models over subset of EMBL to build family alignments

miRNA • Highly conserved across species • Precursor stem loop sequence ~ 70nt • Mature miRNA ~ 21nt • BLAST genomic v miRBase precursors • RNAfold used to test for stem loop • Mature sequence identified • (only 2 nt changes tolerated)

Structures • Structures identified by Infernal / RNAfold are stored as transcript attributes • ::::::::::::::::<<-<<<<<-<<<________________>>>>>>>>-->>,,,, • 1 AuCUUUGCGCAGGGGCaaUaucguAgccAGUGAGGcUuuaCCGAggcgcgauUAuuGCUA 60 • A+CUUUGCGCAG GGCA:UAU :UAGCCA+UGAGG+UU++CCGAGGCG: AUUA:UGCUA • 181 AGCUUUGCGCAGUGGCAGUAUCAUAGCCAAUGAGGUUUAUCCGAGGCGCAAUUAUUGCUA 240 • <<<<_.________.__>>>>,,,,,<<<.<<<<<<<<<<____......__>>>>>>>> • 61 gUugA.AAACUAUU.CCcaAccgCCCgcc.aagacgacauguua......uauugucggc 111 • :UU A AAA UA AA:+G G:C ::: ::A:::+UUA U :::U::+: • 241 AUUAAuAAAUUAAAuAAUAAAAGGG-GACuCUU-UUAGUGCUUAuaaaggUUUACUAACC 298 • >>->>>,,,,,,,,,,,,<<<<____>>>> • 112 uuuggcAAUUUUUGGAAGcccuccAaaggg 141 • :: G:CAA UU +AAG ::C+AA:: • 299 ACAGACAACUU---AAAGGUAACAAACCUA 325 • Displayable on website as markup on transcript sequence

Human Build Statistics • NCBI 36 assembly, • released November 2005 • ‘known’ genes 21,571 • ‘novel’ genes 2,142 • Coding transcripts: 49,043 non-coding transcripts 4,145 • Ensembl exons: 278,632 • Human input sequences: • 260,031 proteins, redundant set +---------------------+----------+ | miRNA | 606 | | miRNA_pseudogene | 22 | | misc_RNA | 1060 | | misc_RNA_pseudogene | 7 | | Mt_rRNA | 1 | | Mt_tRNA | 22 | | Mt_tRNA_pseudogene | 603 | | protein_coding | 23713 | | pseudogene | 731 | | rRNA | 334 | | rRNA_pseudogene | 393 | | scRNA | 1 | | scRNA_pseudogene | 902 | | snoRNA | 609 | | snoRNA_pseudogene | 564 | | snRNA | 1387 | | snRNA_pseudogene | 632 | | tRNA_pseudogene | 131 | +---------------------+----------+

Classification of Transcripts • Ensembl Transcripts or Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries • Known genes map to species-specific protein records (targeted build) • Novel genes do not map to species-specific protein records (similarity build)

Names and Descriptions • Names are inferred from mapped proteins • Official gene symbol is assigned if available • HGNC (HUGO) symbol for human genes • Species-specific nomenclature committees • Otherwise Swiss-Prot > RefSeq > TrEMBL ID • Novel transcripts have only Ensembl identifiers • Genes named after ‘best-named’ transcript • Gene description is inferred from mapped database entries, the source is always given

Supporting evidence ExonView

Configuring the Gene Build Data availability Targeted build most useful in human and mouse. Similarity build more important in other species. Structural Issues ZebrafishMany similargenes near each otherGenome from different haplotypes MosquitoMany single-exon genes Genes within genes Configuration Files provide flexibility

Low Coverage Genomes Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)

Evaluating Genes and Transcripts • Ensembl gene set • Pseudogenes • EnsemblEST genes • Ab initio predictions • Manual curation (Vega, CCDS) • Gene models from other groups

Human Proteins Other Proteins Human cDNAs Human ESTs Genewise Exonerate Exonerate Genewise genes Aligned cDNAs AlignedESTs Genewise geneswith UTRs ClusterMerge ClusterMerge Supported ab initio (optional) Genebuilder Preliminary gene set cDNA genes Gene Combiner Final set + pseudogenes Pseudogenes Core Ensembl genes Ensembl EST genes

EST Analysis Map ESTs with Exonerate (determine coverage, % identity and location in genome) Filter on % identity and depth (6 million ESTs from dbEST – we map about 1/3)

Merge ESTs according to consecutive exon overlap and set splice ends Assign translation Alternative transcripts with translation and UTRs Alternative Splicing Forms ESTs

EST Genes and ESTs Ensembl transcript EST transcripts Human ESTs Latest Human Build NCBI 36 assembly EST Genes: 28,639 released Nov 2005 EST Transcripts: 58,916

Evaluating Genes and Transcripts • Ensembl gene set • Pseudogenes • Ensembl EST genes • Ab initiopredictions • Manual curation (Vega, CCDS) • Gene models from other groups

Ab initio Predictions GENSCAN transcript

Evaluating Genes and Transcripts • Ensembl gene set • Pseudogenes • Ensembl EST genes • Ab initio predictions • Manual curation (Vega, CCDS) • Gene models from other groups

Manual Curation • 7 (Washington University) • 14 (Genoscope) • 18 (Broad Institute) • 16, 19 (DOE Joint Genome Institute) • Other groups will also contribute to Vega • Manualannotation of finished clones • Vega Genome Browser • http://vega.sanger.ac.uk/ • Currently only chromosomes • 1, 6, 9, 10, 13, 20, 22, X and Y (Sanger Institute)

Manual Curation • Manually-curated gene sets in Ensembl • WormBase (data import) • Caenorhabditis elegans • FlyBase (data import) • Drosophila melanogaster • Génoscope (data import) • Tetraodon nigroviridis • IMCB, Singapore (data import) • Takifugu rubripes • SGD (data import) • Saccharomyces cerevisiae • Vega includes some manually-curated finished clones from • Danio rerio, Mus musculus and Canis familiaris

Manually-curated Vega Genes Vega manual curation

Manually-curated Vega Genes

Vega Genes ContigView Vega transcripts

Ensembl / Havana Merge ~12,000 full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) were added to the human Ensembl gene set in v38 Transcripts: • Ensembl: red /black • Havana: blue • Ensembl/Havana:gold Genes: • Ensembl:red/black • Havana: blue • Ensembl/Havana: gold

Ensembl / Havana Merge Merged Ensembl / Havana gene Ensembl transcript Havana transcripts

CCDS (consensus CDS) • Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human • Long term aim is to get to a single gene set for human • The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

Comparison of CDSs to NCBI • Exact matching CDS on the genome with: • Complete CDS (ATG->stop) • No frameshifts • No phase problems • No internal stop codons NCBI Hinxton

CCDS release (March 2005) • Conservative first set, so the following have been removed: • All CDSs which match XMs • CDSs with large cDNA v genomic discrepancies • CDSs with non consensus splice sites • Set contains: • 14795 different CDSs • 16085 transcripts (in Ensembl) • 13031 genes • The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

ENCODE regions • 44 regions representing 1% of the genome • 14 manually / semi manually picked regions • 30 randomly picked 0.5Mb regions at varying gene density and non exonic conservation (three band for each). • Example id: ENr123

GENCODE • Evaluation of prediction accuracy of automatic annotation methods in the Encode regions • Based on comparison to manual annotation generated by Havana team • Some experimental confirmation of annotation • Divided into categories of prediction based on data and methods used: • Ab initio • Comparative data only • Protein, EST and cDNA based • Any available data

GENCODE regions • For 13 of the regions annotation released prior to competition for training: • 2 manual picks • 11 random (2 level 1, 4 level 2 and 5 level 3) • For the remaining 31 regions • 12 manual picks • 19 random (8 level 1, 6 level 2 and 5 level 3)

EGASP • how complete is the Vega-ENCODE annotation when compared to other existing gene data sets? • how well the programs are able to reproduce the Vega-ENCODE annotation? • how reliable are the predictions outside of the Vega-ENCODE annotation • is there anything outside the annotation and the predictions? • M.G.Reese & R. Guigó Genome Biology 7:S1 (2006)

EGASP 05 Nucleotide Exon Sn Sp CC Sn Sp SnSp Acembly 0.96 0.58 0.74 0.84 0.38 0.613 ECgene 0.96 0.46 0.66 0.75 0.30 0.528 Ensembl 0.91 0.92 0.92 0.77 0.82 0.800 Exogean 0.84 0.94 0.89 0.71 0.74 0.728 AceView 0.91 0.79 0.84 0.74 0.49 0.624 Pairagon 0.87 0.93 0.90 0.67 0.78 0.732

Evaluating genes and transcripts in Ensembl