10 likes | 160 Vues
A G T A C - C T T A C T - - - G G A G T. | | | | | | | | | | | | | | |. A G T A C A A T T A C T G C T G G A G T. A G T A C A A T T A C T G C T G G A G T. | | | | | | | | | | | | | | |. A G T A C - C T T A C T - - - G G A G T. Human. Human. BLASTZ. Chain.
E N D
A GT A C - C T T A C T - - - G G A G T | | | | | | | | | | | | | | | A GT A C A A T T A C T G C T G G A G T A GT A C A A T T A C T G C T G G A G T | | | | | | | | | | | | | | | A GT A C - C T T A C T - - - G G A G T Human Human BLASTZ Chain Filter (Net) Elephant Figure 2: dealing with missing sequence. In this example, exon 3 of the human gene is not present in the elephant genome, In construction of the elephant gene-scaffold, the blue and red scaffolds are placed together separated by a gap big enough to accommodate the missing exon. In addition, the end of exon 1 is missing. Examination of the underlying assembly of the blue scaffold reveals that the end of the alignment block co-incides with the end of a contig. The exon is therefore extended into the inter-contig gap. These “gap” components in the projected structure result in strings of Xs of the correct length in the translation. Gene scaffold and project Elephant (a) Human Elephant (b) Tenrec Armadillo Human 481 Elephant 386 633 1358 1316 418 636 9115 Figure 3: dealing with alignment insertions/deletions. The protein-coding part of the sequences is shown in green. In (a), as well as containing a codon insertion, the elephant gene also includes a one residue insertion with respect to the human genome. A frameshift-intron is inserted into the elephant structure in order to maintain the correct reading frame. In (b), a one-residue deletion is accommodated by a two-residue frameshift-intron. 3590 909 1803 519 522 466 Rabbit Elephant Annotating 2x genomes in Ensembl Kevin Howe, B. Aken, M. Caccamo, Y. Chen, L. Clarke, S. Dyer, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, R. Durbin, J. Fernandez-Banet, X.M. Fernandez-Suarez, P. Flicek, S. Gräf, M. Hammond, J. Herrero, K. Jekosch, A. Kähäri, A.Kasprzyk, D. Keefe, F. Kokocinski, E. Kulesha, D. Lawson, D. London, I. Longden, K. Megy, P. Meidl, C. Melsopp, B. Overduin, A. Parker, G. Proctor, D. Rios, M. Schuster, I. Sealy, S. Searle, G. Slater, D. Smedley, J. Smith, S. Trevanion, A. Ureta-Vidal, J. Vogel, S. White, E. Birney, T. Hubbard Wellcome Trust Sanger Institute and European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK. Ensembl (http://www.ensembl.org/) is a joint project between the Wellcome Trust Sanger Institute and the European Bioinformatics Institute which aims to add value to genome sequence data by organising it on a reference sequence and adding consistent automatic annotation. Release 40 (August 2006) contained data for 21 vertebrate (14 mammalian) and 7 invertebrate genomes. One of the principle aims of the Ensembl project is to provide complete sets of protein-coding genes for all represented genomes. A semi-automatic procedure, the gene-build, has been (and continues to be) developed to achieve this. It is applied to the majority of genomes in Ensembl, with the exception of those that have reliable gene-structure annotation produced elsewhere (for example C. elegans and D. melanogaster, which have curated gene sets from WormBase and FlyBase respectively). This poster describes a new component to the gene-build system designed to address the specific challenges presented by genomes that have been sequenced to a low level of redundancy/coverage. The 2x sequencing initiative Complete, high-quality genome sequences are now available for a variety of model organisms, including human and mouse. The task now is to identify all functional elements on these genomes. One approach is to use comparative genomics to reveal sequence elements that have been conserved during evolution. To this end, a programme to sequence sixteen mammalian genomes to a low level of coverage (2x) is well underway (ref B). This study was performed at a time when four of the sixteen genomes had been released: those of the nine-banded armadillo (Dasypus novenmcinctus); the small Madagascar hedghog, or tenrec (Echinops telfairi); the African elephant (Loxodonta africana); and the rabbit (Oryctolagus cuniculus). The challenge for gene-building A 2x genome assembly will inevitably be both incomplete and highly fragmented. For many genes therefore, some or all of the underlying genome sequence will be missing. Even genes present in their entirety will be fragmented across a number of scaffolds in many cases. The standard Ensembl gene-build is based on the alignment of complete transcribed sequences (proteins, cDNAs and ESTs) to the corresponding part of the genome (ref A). It does not perform well when large parts of this corresponding genome sequence are missing and/or fragmented. For this reason, it is unsuitable for direct application to low-coverage genome assemblies. We have developed a new component to the gene-build software, WGA2Genes, to address this issue. Figure 1: An example of the WGA2Genes procedure. A human subsequence containing two genes has given rise to two separate gene-scaffolds, the first a super-structure containing 3 scaffolds from the original assembly. WGA2Genes Central to the method is a whole genome alignment (WGA) between a low-coverage genome to annotate (the target), and a finished, annotated genome (the reference). The WGA is used for two purposes (i) to infer assemblies of target scaffolds into “gene-scaffold” super-structures that contain complete genes; and (ii) to infer gene-structure annotation by projection of the annotation from the reference onto the target. The procedure has 4 main steps (summarised in figure 1): (1) A raw alignment between the target and reference genomes generated using BLASTZ (ref C). (2) The resulting blocks of local alignments are clustered into chains. The UCSC Axt tools are used for this (ref D). (3) The chains are filtered such that no residue in the target genome is aligned to more than once place in the reference genome. The result resembles a UCSC Net (ref D). (4) The filtered alignment is used to: a. Infer arrangements of scaffolds into gene-scaffolds. b. Project protein-coding gene-structure annotation from the reference onto the gene-scaffolds It is common for part of the sequence of a gene to be missing from the target genome. Where appropriate, exons are placed in or extended into sequence gaps, so that the resulting transcript is the correct length. An example of this is shown in figure 2 (far right). Insertion/deletion event within alignment blocks often change the reading frame of the projected transcript (with respect to the source), resulting in a translation interrupted by stops. This can be an indicator of loss of function of this gene in the target (i.e. the gene is a pseudogene in the target), or sequencing error. We address this explicitly by correcting the reading frame with “frameshift-introns” (see figure 3, far right). The pseudogene-calling component of the gene-build software is applied to the projected gene set. It identifies genes structures that are likely to the be result of retro-transposition and re-classifies them as processed pseudogenes. Application to 4 2x genomes WGA2Genes has to date been applied to four genomes: armadillo, tenrec, elephant and rabbit. The resulting gene-scaffolds and gene-structure annotations are available as part of release 40 of Ensembl. The Ensembl-annotated human genome (release 39) was used as reference. Table 1 shows some properties of the resulting projected gene-sets. Note that transcript completeness was used as a quality control measure and gene models (a) with less than 50% coverage of the source transcript, or (b) containing more than 50% gap sequence, were discarded in a post-filtering step. Figure 4 shows the number of human genes that could be successfully projected in various combinations of target genomes (including none). As the method is applied to more 2x genomes, it will become possible to investigate whether there is a subset of human genes for which projection-by-alignment is not possible. Figure 4: Number of human genes that gave rise to gene-models (by projection) in four 2x genomes. The total number of human genes was 23328, 3590 of which were not projected to any of the 4 genomes. By way of example, 1358 human genes were projected to armadillo, elehpant and tenrec but not rabbit. The number projected in armadillo and rabbit but not in tenrec or elephant (390) and vice-versa (786) are omitted from the diagram for clarity. References Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M. Genome Res. 2004 May;14(5):942-50. Margulies EH, Vinson JP; NISC Comparative Sequencing Program; Miller W, Jaffe DB, Lindblad-Toh K, Chang JL, Green ED, Lander ES, Mullikin JC, Clamp M. Proc Natl Acad Sci U S A. 2005 Mar 29;102(13):4795-800. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Genome Res. 2003 Jan;13(1):103-7. Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. Table 1: properties of the resulting gene sets. This poster was brought to you by the letters J, P and the number 511.