Tools for Maximising the Value of Genomic Data

WEHI Postgraduate Seminar Series 2003 Tools for Maximising the Value of Genomic Data Keith Satterley, Bioinformatics, The Walter & Eliza Hall Institute of Medical research 2nd. June 2003 keith@wehi.edu.au http://bioinf.wehi.edu.au/resources/presentations.html

Genomic data – what is it, where is it Gene Finding GenScan Comparitive Genomics Gene Finding Slam Twinscan Finding Regulatory Regions rVista Consite Toucan Programming Tools Languages Perl BioPerl BioJava Bio??? Slipper-a Perl program & results Link References Aknowledgements Overview

1953 2003 http://www.geneticscongress2003.com/index.php

Genomic data Whole genome data sets. According to http://www.ebi.ac.uk/genomes/ as at 28-May-03 Archea – 16 Bacteria – 107 Organelles – 308 Phages – 112 Plasmids – 280 Viroids – 40 Viruses – 880 TOTAL:1743

Eukaryota (completed chromosomes) http://www.ebi.ac.uk/genomes/

gnn.tigr.org http://gnn.tigr.org/sequenced_genomes/genome_guide_p1.shtml

GOLD – Genomes Online Database http://www.genomesonline.org/

It’s a Fact: Count @ 1 base per second, 24 hours a day, It would take you about 95 years to count the DNA in one cell.

3.1 Million years To count to 100 Trillion! Francis CollinsDirector, National Human Genome Research Institute 25th. April 2003 “Here in the very month of the 50th anniversary of the discovery of DNA’s double helix, I am pleased and honored — perhaps I should say exhilarated — to declare the goals of the Human Genome Project to be completed.” (04/25/03)— “..the information that will matter to you about your life is a fraction of your genetic code — probably less than 1 percent.” J.Craig Venter, 25-04-2003(Bio-IT World) http://www.genomesonline.org/

Most Recent Genomics News BETHESDA, Md., May 20, 2003 By June, researchers from the Whitehead/MIT Center and the Genome Sequencing Center at Washington University School of Medicine expect to complete the sequencing work (approximately four-fold coverage) necessary to create an initial working draft of the genome of the chimpanzee (Pan troglodytes). The Whitehead/MIT team expects to complete a high-quality draft of the dog genome sequence within the next 12 months. After the genome of the boxer is sequenced, researchers plan to sample and analyze DNA from 10 to 20 other dog breeds, including the beagle, to study genetic variation within the canine species. http://www.genome.gov/11007358

Gene finding is about detecting coding regions and inferring gene structure. Gene finding is difficult. DNA sequence signals have low information content (degenerated and highly unspecific) It is difficult to discriminate real signals Sequencing errors Prokaryotes: High gene density and simple gene structure, Short genes have little information, Overlapping genes. Eukaryotes: Low gene density and complex gene structure Alternative splicing, Pseudo-genes. Gene Finding

A Good Gene Finding Review has been prepared by Lorenzo Cerutti of the Swiss Institute of Bioinformatics. It is an EMBNet course, (September 2002) entitled “Gene Finding”. It is at: http://www.ch.embnet.org/CoursEMBnet/Pages02/slides/gene_finding.pdf Gene Finding

GenScan - Uses generalized hidden Markov models to predict complete gene structure http://genes.mit.edu/GENSCAN.html MZEF - Designed to predict only internal coding exons. http://www.cshl.org/genefinder FGENES – Useslinear discriminant analysis. http://genomic.sanger.ac.uk/gf/gf.shtml GeneFinder: http://www.cshl.org/genefinder GRAIL 1,1a,2 http://compbio.ornl.gov HMMgene - Designed to predict complete gene structure. http://genome.cbs.dtu.dk/services/HMMgene Genewise - Uses HMMs. Genewise is part of the Wise2 package:http://www.sanger.ac.uk/Software/Wise2. Procrustes - Predicts gene structure from homology found in proteins. http://hto-13.usc.edu/software/procrustes/index.html GeneMark.hmm. Recently modified to predict gene structure in eukaryotes.http://opal.biology.gatech.edu/GeneMark Geneid. Recently updated to a new and faster version. http://www1.imim.es/geneid.html Gene Finders

Gene Finders

Overall performances are the best for HMMgene and GENSCAN. Some program’s accuracy depends on the G+C content, except for HMMgene and GENSCAN, which use different parameters sets for different G+C contents. For almost all the tested programs, ”medium” exons (70-200 nucleotides long), are most accurately predicted. Accuracy decrease for shorter and longer exons, except for HMMgene. Internal exons are much more likely to be correctly predicted (weakness of the start/stop codon detection). Initial and terminal exons are most likely to be missed completely. Only HMMgene and GENSCAN have reliable scores for exon prediction. Gene Finders

Existing predictors are for protein coding regions Non-coding areas are not detected (5’ and 3’ UTR) Non-coding RNA genes are missed Predictions are for ”typical” genes Partial genes are often missed Training sets may be biased Atypical genes use other grammars Gene prediction limits

GENSCAN was developed by Chris Burge and Samuel Karlin, Department of Mathematics, Stanford University Genscan is a general probabilistic model of the gene structure of human genomic sequences. Genscan identifies complete exon/intron structures of genes in both strands of genomic DNA. The new Genscan Web Server is at http://genes.mit.edu/GENSCAN.html Genscan is also available for WEHI people at http://www.wehi.edu.au/resources/PBC/index.html with a greater choice of options. GenScan Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. (1997) 268, 78-94

Comparitive Genomics

“If you take a sequence and just run a gene prediction program on it, the programs don’t usually do very well. But if you take human and mouse sequence, and compare them against each other — looking for similar regions — you get better predictions. And the more genomes we have, the better it will get.” Quotes from the 50/50 series of interviews by Bio-IT World Gene MyersProfessor, Dept. of Electrical Engineering & Computer SciencesUniversity of California, Berkeley .

“Looking at the similarity between the human genome and other species is a really powerful way to get at functional sequences and to allow us to work on them in different species.” “Several groups, including ours, have gene-finding methods for comparative genomics. This is an active area where we will see significant advances in the next few years.” Quotes from the 50/50 series of interviews by Bio-IT World Richard DurbinHead of Informatics, Wellcome Trust Sanger Institute.

The Assumption that underlies comparitive genomics is that the two genomes had a common ancestor and that each organism is a combination of the ancestor and the action of evolution. Evolution can be broadly thought of as the combination of two processes: mutational forces that generate random mutations in the genome sequence, and selection pressures that 1. Eliminate random mutations (negative selection), 2. Have no effect on mutations (neutral selection) or, 2. Increase the frequency of mutant alleles in the population as a result of a gain in fitness (positive selection). The combined action of mutation and selection is represented generally by a RATE MATRIX of base-pair changes between the two observed genomes. Comparative Genomics

C.Elegans Comparative Genomics Human Mouse Rat Evolutionary relationship between metazoans that are sequenced, or due for sequencing. Evolutionary distances are in millions of years.

Comparative genomics may be defined as the derivation of genomic information following comparison of the information content of 2 or more species genome sequences There is a good article in Nature Genetics Reviews, April 2003 Vol 4 No 4,pp251-262. “Comparative Genomics: Genomice-Wide analysis in Metazoan Eukaryotes”, Ureta-Vidal, A. Laurence Ettwiller & Ewan Birney 2003 http://www.nature.com/cgi-taf/DynaPage.taf?file=/nrg/journal/v4/n4/full/nrg1043_fs.html Comparitive Genomics

The similarity is such that human chromosomes can be cut (schematically at least) into about 150 pieces (only about 100 are large enough to appear here), then reassembled into a reasonable approximation of the mouse genome. http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/ttmousehuman.html

…there has been an explosion in the availability of tools which may make it difficult to decide which tool is most suitable for your research. Indeed, to interpret these resources, you must be aware of the differences between them and between their underlying assumptions. Comparitive Genomics

Whole Genome Alignments K-browser http://hanuman.math.berkeley.edu/cgi-bin/kbrowser A multiple genome browser, currently set up for human, mouse and rat based on the MAVID alignments, UCSC genome browser.

Comparative Gene Prediction SLAM http://baboon.math.berkeley.edu/~syntenic/slam.html Example of a comparative genefinder Employs a generalised pair hidden Markov model approach for predicting gene structures within syntenic genomic sequences Performing gene finding and alignment of the sequences simultaneously

SLAM has been used for whole genome annotation projects. For the Mouse/Human analysis, SLAM used a human/mouse sytenny map, giving segments which are further broken up into 300kb pieces. These pieces are aligned by AVID . SLAM then ran on all syntenic pieces using AVID alignments as guides. Coding lengths < 120 were discarded. SLAM also predicted conserved non coding regions(CNS), the first de novo prediction of CNS in the human and mouse genome. The results are available at http://bio.math.berkeley.edu/slam/mouse/ A similar result is available for Human/Rat. SLAM

seq1 SLAM CDS 2421 2478 . + 2 gene_id "000001"; transcript_id "000001.1"; frame "1"; exontype "internal" seq1 SLAM CDS 3127 3805 . + 1 gene_id "000001"; transcript_id "000001.1"; frame "1"; exontype "internal" -------------------------------------------------------------------------------------------------------------------------------------------------------------- seq2 SLAM CDS 2134 2191 . + 2 gene_id "000001"; transcript_id "000001.1"; frame "2"; exontype "internal" seq2 SLAM CDS 2867 3545 . + 1 gene_id "000001"; transcript_id "000001.1"; frame "2"; exontype "internal“ -------------------------------------------------------------------------------------------------------------------------------------------------------------- > Protein 1: (244,244) aa (incomplete protein) Y 1 KCEAIASDCF LSGNVDIELK DHNNCISKIN VEDQKNCALS WAFASIYHLE 50 CE IAS CF LSGNVDIE K D ++C S I E+Q NC LS W F S HLE Z 1 TCERIASSCF LSGNVDIEWK DKSSCFSSIE TEEQGNCNLS WLFTSKTHLE 50 ... http://baboon.math.berkeley.edu/~syntenic/slam.html

One of the first gene predictors to substantially exceed the performance of GENSCAN on a genomic scale by using mouse–human comparison was TWINSCAN (Korf et al. 2001). TwinScan http://genes.cs.wustl.edu/query.html

DoubleScan -http://www.sanger.ac.uk/cgibin/doublescan/submit It is a program for comparative ab initio prediction of protein coding genes in mouse and human DNA. Generates exon candidates in both sequences. SPG-1....http://soft.ice.mpg.de/sgp-1 SGP-1 is a similarity based gene prediction program. Given two genomic DNA sequences it post-processes the pairwise local alignment to predict single or multiple gene models of protein coding genes in forward and reverse strands. Other Comparative Gene Predicters

Regulatory Sequence

… Leroy Hood brought out this point in his talk at the Bio2001 meeting in San Diego (24–28 June 2001) with his statement that “The difference between man and monkey is gene regulation.” Regulatory Sequence

“I think the places that we should be looking at now are the non-repetitive, unique, non-coding DNA. … If they are conserved, they must be important. There are discoveries in there.” Quotes from the 50/50 series of interviews by Bio-IT World Lincoln SteinAssociate Professor, Cold Spring Harbor Laboratory .

rVISTA. . . . . . . . . . . . . . . . . . . . . . . http://teapot.jgi-psf.org/ovcharen/rvista/index.html Consite. . . . . . . . . . . . . . . . . . . . . . . http://forkhead.cgb.ki.se/cgi-bin/consite Footprinter. . . . . . . . . . . . . . . . . . . http://abstract.cs.washington.edu/~blanchem/FootPrinterWeb/FootprinterInput.pl Toucan. . . . . . . . . . . . . . . . . . . . . . . http://www.esat.kuleuven.ac.be/~saerts/software/toucan.php/ Trafac . . . . . . . . . . . . . . . . . . . . . . . . http://trafac.chmcc.org/trafac/index.jsp Finding regulatory regions

VISTA is a set of tools for comparative genomics. It was designed to visualize long sequence alignments of DNA from two or more species with annotation information. The alignment engine behind VISTA. AVID is a program for globally aligning DNA sequences of arbitrary length. mVISTA (main VISTA) A program for visualizing alignments of an arbitrary number of genomic sequences from different species rVISTA (regulatory VISTA) combines transcription factor binding sites database search with a comparative sequence analysis. http://teapot.jgi-psf.org/ovcharen/rvista/index.html

rVista http://teapot.jgi-psf.org/ovcharen/rvista/index.html A program that combines transcription factor binding site (TFBS) searches with comparative sequence analysis. At the first step, human and mouse sequences are aligned using the global alignment program MAVID. At the second step, potential transcription factor binding sites were predicted by Match™ program based on TRANSFAC Professional 5.3 library. At the third step, the human-mouse sequence conservation of a DNA region spanning a transcription factor binding site was assessed using a novel strategy. Human and/or mouse annotation determine the genomic location of each predicted transcription factor hit.

Finding Regulatory Regions rVista A program that combines transcription factor binding site (TFBS) searches with comparative sequence analysis.

Boris Lenhard*†, Albin Sandelin*†, Luis Mendoza*‡, Pär Engström*, Niclas Jareborg*§ and Wyeth W Wasserman*¶ BioMed Central - Open Access Journal of Biology ConSitehttp://forkhead.cgb.ki.se/cgi-bin/consite “Identification of conserved regulatory elements by comparative genome analysis”

Consite is a web-based tool for detecting transcription factor binding sites in genomic sequences using phylogenetic footprinting. Two orthologous genomic sequences are aligned, and transcription factor binding sites are only reported for those regions in the alignment which transcend a certain treshold of conservation. ConSite - Identification of conserved regulatory elements by comparative genome analysis

The method is implemented as a graphical web application, ConSite, which is at: http://forkhead.cgb.ki.se/cgi-bin/consite or http://www.phylofoot.org/ Various tools are made available at phylofoot.org. ConSite

http://www.phylofoot.org/

Sequence View http://www.phylofoot.org/

http://www.phylofoot.org/

Tools for Maximising the Value of Genomic Data