Unraveling Gene Identification in Genomic Data: Too Much of a Good Thing

IDENTIFICATION OF GENES IN GENOMIC DATA TOO MUCH OF A GOOD THING

RELEVANT NUMBERS • E. coli genome • Four million base pairs • 4000 genes • Human genome • Three billion base pairs • 25,000 genes

GENE STRUCTURES • E. coli - WYSIWYG • Human - Life is complicated • Exons and Introns • Alternative splicing • Alternative starts and stops

HUMAN GENES

MICROBIAL WORK • Not trivial but more direct than human • Many approaches • GRAIL a first attempt • Web based searches • Markov most common

One-State Markov Model

First Order Probabilities(i I i-1)

General Markov Model • Probabilistic model • Uses adjacent base(s) to predict current base • Order of model depends on number of bases examined • Sum the probabilities for each base • High score wins (recognized as gene)

First Order Probabilities(i I i-1)

Second Order Probabilities(i I i-2, i-1) • 64 possibilities • kth order needs 4k+1 probabilities • DNA actually needs six models (six reading frames) – so six times all probabilities • Need known genes to determine real probabilities (train the model)

Example • 5th order model needs 24,576 probabilities 4096 hexamers x 6 frames Do these occur frequently enough in identified coding regions to give good probabilities?

Scan sequence and determine scores for each region based on probabilities • Scores above threshold declared genes because they are like previously identified genes • Implies a hidden relationship

GLIMMER and GLIMMERM • Use interpolated Markov models • If oligomers are available, will score up to 8th order • M version works for small eukaryotes • Plasmodium falciparum • Arabidopsis thaliana • Adds information about splice sites

GENESCAN • Combines multiple approaches • 5th order Markov model • Staden weight matrix to model cis sites • Poly A site • TATA and INR • CAP site • Translation termination sites • Maximal Dependence Decomposition • Donor and acceptor splice site modeling

A Tale of Two Genomes • Private genome – Celera • Public genome – Rest of the world • DRAFT ONLY (except 21 and 22)

Celera • Otto-Refseq • Gold standard – full length, curated cDNAs • Otto-Homology • ESTs • cDNAs • Mouse-human genomic similarities • Known proteins

Celera • De novo • GENESCAN • GRAIL • FGENESH • Manual curation

Celera results • De novo prediction of 76,400 genes (58,000 appeared to be new) • 21,350 supported by some other evidence • Otto homologies identify 17,764 other genes • Total is 39,114

Public • De novo using ENSEBL • Merge predictions with predictions from GENIE • Merge results with known genes in databases • Eliminate bacterial sequences

Public Integrated Gene Index (IGI)

Unraveling Gene Identification in Genomic Data: Too Much of a Good Thing

Unraveling Gene Identification in Genomic Data: Too Much of a Good Thing

Presentation Transcript

Frequency of Tetracycline Resistance Genes in Bacterial Genomic DNA of Swine Feces

Interrelating Different Types of Genomic Data

Identification of Novel Virulence-Associated Genes via Genome Analysis of Hypothetical Genes

Identification and analysis of differentially expressed genes in Saccharomyces cerevisiae .

Visualization of genomic data

Visualization of genomic data

DNA Sequencing and Identification of Encoded Genes

Identification of X-linked mental retardation genes

Data Identification

Bioinformatic Analysis of Chromatin Genomic Data

Identification of markers linked to Selenium tolerance genes

Comprehensive Identification of Conditionally Essential Genes in Mycobacteria

Identification of RNAi-Related Genes in Archaea

Genes and Genomic Datasets

Bioinformatic Analysis of Chromatin Genomic Data

GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

Bioinformatic Analysis of Chromatin Genomic Data

Computational identification of genes

Identification of Pathogen Defense Genes in Cereal Plants

Genomic DNA, Genes, Chromatin

From DoTS Assemblies to Genes via Genomic Alignment

From DoTS Assemblies to Genes via Genomic Alignment