Comp. Genomics

Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation

Outline • Gene finding using HMMs • Adding trees to HMMs • phyloHMM • N-SCAN • BLAST+ Gene Finding • SGP2 • Examples

Markov Sequence Models • Key: distinguish coding/non-coding statistics • Popular models: • 6-mers (5th order Markov Model) • Homogeneous/non-homogeneous (reading frame specific) • Not sensitive enough for eukaryote genes: exons too short, poor detection of splice junctions

1-p exon intron q p 1-q Length Distribution • Simple HMMs can only encode genometric length distributions • The length of each exon (intron) :

Exon Length Distribution • The length distribution of introns is ≈ geometric • For exons, it isn’t: also affected by splicing itself: • Too short (under 50bps): the spliceosomes have no room • Too long (over 300bps): ends have problems finding each other. • But as usual there are exceptions. • A different model for exons is needed • A different model is needed for exons.

Generalized HMM(Burge & Karlin, J. Mol. Bio. 97 268 78-94) • Instead of a single char, each state omits a sequence with some length distribution

Generalized HMM(Burge & Karlin, J. Mol. Bio. 97 268 78-94) • Overview: • Hidden Markov states q1,…qn • State qihas output length distribution fi • Output of each state can have a separate probabilistic model (weight matrix model, HMM…) • Initial state probability distribution  • State transition probabilities Tij

GenScan Model Burge & Karlin JMB 97

GenScan model • states = functional units on a gene • The allowed transitions ensure the order is biologically consistent. • As an intron may cut a codon, one must keep track of the reading frame, hence the three I phases: • phase I0: between codons • phase I1:: introns that start after 1st base • phase I2 : introns that start after 2nd base

Phylogenetic HMMs • Due to Siepel and Haussler • A simple gene-finding HMM looks at a single Markov process: • Along the sequence: each position is dependent on the previous position • If we incorporate sequences from multiple organisms, we can look at another process: • Along the tree: each position is dependent on its ancestor

Phylogenetic HMMs • A simple HMM can be thought of as a machine that generates a sequence • Every state omits a single character • Multinomial distribution at every state • A phyloHMM generates an MSA • Every state omits a single MSA column • Phylogenetic model at every state

Phylogenetic HMMs

Phylogenetic models in phyloHMM • Defines a stochastic process of substitution • Every position is independent • The following process occurs: • A character is assigned to the root • The character substitution occur based of some substitution matrix and based on the branch lengths • The characters at the leaves of the tree correspond to the MSA column

Phylogenetic models in phyloHMM • Different models for different states: • Different substitution rates • E.g., in exons, we’ll see less substitutions • Different patterns of substitutions • E.g., third position bias in coding sequences • Different tree topologies • E.g., following recombination

Formally • S – set of states • Ψ – phylogenetic models (instead of E in a standard HMM) • A – state transitions • b – initial probabilities

Formally • Q – substitution rate matrix (e.g., derived from PAM) • Π – background frequencies • τ – the phylogenetic tree • β – branch lengths

Formally • - Probability of a column Xi being omitted by the model ψi • Can be computed efficiently by Felsenstein’s “pruning algorithm” (recitation 6) • Joint probability of a path in the HMM and and alignment X • Viterbi, forward-backward etc. – as usual

Simple phylo-gene-finder • If the parameters are known – Viterbi can be used to find the most probably path – segmentation into coding regions Non-coding 3rd position

Phylo-gene-finder is a good idea • Use of phylogeny is important: • Imposes structure on the substitutions • Weights different pairs differently based on the evolutionary distance

N-SCAN • Another phylogeny-HMM-gene-finder • A GHHM that emits MSA columns • Annotates one sequence at a time: the target sequence • Distinguishes between a target sequence – T and other informative sequences (Is) that may contain gaps • States correspond to sequence types in the target sequence

N-SCAN • Bayesian network instead of a simple evolutionary model • Accounts for: • 5’ UTRs • Conserved non-coding • Highly conserved • No “coding” features

SGP-2 • Drawback of the described approaches: require meaningful alignment • Impossible if one of the genomes is not yet finished • An alignment is not necessary “correct”

SGP-2 • A framework working on two genomes • Idea: • Use BLAST to identify which positions are more/less conserved • Feed the BLAST scores into the gene-finding HMM • The BLAST results serve to modify the scores of the exons.

SGP-2

BACH1

OLIG2

PPM1A

Summary • Different approaches for gene finding • Adding phylogeny generally helps • But • What about genes/exons which are specific to humans • Ape genomes are not (almost) available and too similar • Phylogenetic help almost essential in more difficult problems • Motif finding (promoter analysis) • Ultraconserved regions with no evident function

Comp. Genomics

Comp. Genomics

Presentation Transcript

DNA Chips and Their Analysis Comp. Genomics: Lecture 13

Genomics

Genomics

Computational Genomics Fall 2004/5 www.cs.tau.ac.il/~bchor/CG05/comp-genom.html

Comp. Genomics

Comp. Genomics

Comp. Genomics

Genomics

Comp. Genomics

Intro to Comp Genomics

Comp. Genomics

Comp. Genomics

Comp. Genomics

Comp. Genomics

Genomics

Comp. Genomics

Genomics

Computational Genomics Spring 2009 cs.tau.ac.il/~bchor/CG09/comp-genom.html

Genomics

Comp. Genomics

Genomics

Comp. Genomics