1 / 46

Dynamic Programming Algorithms II

Dynamic Programming Algorithms II. Mohammed Aledhari Bioinformatics / data mining Thursday May 23 rd , 2013 . Outline. Local Sequence Alignment Alignment with Gap Penalties Multiple Alignment Gene Prediction Statistical Approaches to Gene Prediction

wilson
Télécharger la présentation

Dynamic Programming Algorithms II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Programming Algorithms II Mohammed Aledhari Bioinformatics/ data mining ThursdayMay 23rd, 2013

  2. Outline • Local Sequence Alignment • Alignment with Gap Penalties • Multiple Alignment • Gene Prediction • Statistical Approaches to Gene Prediction • Similarity-Based Approaches to Gene Prediction • Conclusion • References

  3. The Local Alignment Problem • Goal: Find the best local alignment between two strings • Input : Strings v, w and scoring matrix δ • Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings

  4. Local Alignments: Why? • Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. • Example: • Homeobox genes have a short region called the homeodomain that is highly conserved between species. • A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

  5. Compute a “mini” Global Alignment to get Local Local Alignment: Example Local alignment Global alignment

  6. Local Alignment: Free Rides Yeah, a free ride! Vertex (0,0) The dashed edges represent the free rides from (0,0) to every other node.

  7. This is more likely. This is less likely. Affine Gap Penalties • In nature, a series of kindels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments

  8. Adding “Affine Penalty” Edges to the Edit Graph There are many such edges! Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the number of vertices) So the complexity increases from O(n2) to O(n3)

  9. Multiple Alignment versus Pairwise Alignment • Up until now we have only tried to align two sequences. • What about more than two? And what for? • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

  10. Generalizing the Notion of Pairwise Alignment • Alignment of 2 sequences is represented as a 2-row matrix • In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T _ G C G _ A _ C G T _ A A T C A C _ A • Score: more conserved columns, better alignment

  11. Alignment Paths • Align 3 sequences: ATGC, AATC,ATGC x coordinate y coordinate z coordinate • Resulting path in (x,y,z) space: • (0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)

  12. Aligning Three Sequences source • Same strategy as aligning two sequences • Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align • For global alignments, go from source to sink sink

  13. 2-D vs 3-D Alignment Grid V W 2-D edit graph 3-D edit graph

  14. 2-D cell versus 2-D Alignment Cell In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube

  15. Architecture of 3-D Alignment Cell (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j,k) (i-1,j-1,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k)

  16. Multiple Alignment: Running Time • For 3 sequences of length n, the run time is 7n3; O(n3) • For ksequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk) • Dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

  17. Greedy Approach: Example • Consider these 4 sequences s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC

  18. Greedy Approach: Example (cont’d) • There are = 6 possible alignments s2GTCTGA s4GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s1 GAT-TCA s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC(score = 0) s2G-TCTGA s3GATAT-T (score = -1) s3GAT-ATT s4G-TCAGC (score = -1)

  19. Greedy Approach: Example (cont’d) s2 and s4 are closest; combine: s2GTCTGA s4GTCAGC s2,4GTCt/aGa/cA(profile) new set of 3 sequences: s1 GATTCA s3 GATATT s2,4GTCt/aGa/c

  20. Gene Prediction • Gene Prediction is the process of detection of the location of open reading frames (ORFs) and delineation of the structures of introns as well as exons if the genes of interest are of eukaryotic origin. • The ultimate goal is to describe all the genes computationally with near 100% accuracy

  21. Gene • Genes are the functional and physical unit of heredity passed from parent to offspring. • Gene: A sequence of nucleotides coding for protein • Gene Prediction Problem: Determine the beginning and end positions of genes in a genome • Genes are pieces of DNA, and most genes contain the information for making a specific protein.

  22. DNA transcription RNA translation Protein Central Dogma: DNA -> RNA -> Protein CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE Central Dogma simple idea

  23. Central Dogma and Splicing intron1 intron2 exon2 exon3 exon1 transcription splicing translation exon = coding intron = non-coding Batzoglou

  24. Coding v/s Noncoding Coding region Noncoding region Noncoding regions are the parts of DNA which do not encode protein sequences. They may or may not be transcribed into RNA. E.g.: tRNA, rRNA, sRNA genes Coding regions are the parts of DNA which will give rise to a mature messenger RNA that will be translated into the specific amino acids of the protein product

  25. Two Approaches to Gene Prediction • Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). • Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.

  26. Gene Prediction Methods • Gene Prediction represents one of the most difficult problems in the field of pattern recognition, particularly in the case of eukaryotes • The principle difficulties are: • Detection of initiation site (AUG) • Alternative start codons • Gene overlap • Undetected small proteins

  27. Gene Prediction Methods ACGTACTACGTACGTACGTACGATCGATCGATCGATCGATCGACTGATCGATCGATCGATCGTACGTAGCGACTGACTGACTGATCGACTACGTAGCTGCAGTCAGTCGACTGACTGACTA Ab-initio methods Ab-initio methods Homology based methods

  28. Ab-initio Methods • Predicts gene based on the given sequence alone. • Consists of two types of models: • Markov based models • Dynamic Programming

  29. A brief introduction of HMMs • Hidden Markov models (HMMs) are discrete Markov processes where every state generates an observation at each time step. • A hidden Markov model (HMM) is statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states.

  30. From Markov Model to HMM • HMMs are discrete Markov processes where each state also emits an observation according to some probability distribution, we need to augment our model. • Parameters • Initial state probabilities: πi • State transition probabilities: aij • Emission probabilities: ei(k)

  31. Genetic Code and Stop Codons UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames

  32. Stop-start Frames in a DNA Sequence • stop codons – TAA, TAG, TGA • start codons - ATG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

  33. Codon Usage in Human Genome

  34. Gene Prediction Tools • GENSCAN/Genome Scan • TwinScan • Glimmer • GenMark

  35. The GENSCAN Algorithm • Algorithm is based on probabilistic model of gene structure similar to Hidden Markov Models (HMMs). • GENSCAN uses a training set in order to estimate the HMM parameters, then the algorithm returns the exon structure using maximum likelihood approach standard to many HMM algorithms (Viterbi algorithm). • Biological input: Codon bias in coding regions, gene structure (start and stop codons, typical exon and intron length, presence of promoters, presence of genes on both strands, etc) • Covers cases where input sequence contains no gene, partial gene, complete gene, multiple genes.

  36. GENSCAN Limitations • Does not use similarity search to predict genes. • Does not address alternative splicing. • Could combine two exons from consecutive genes together

  37. GenomeScan • Incorporates similarity information into GENSCAN: predicts gene structure which corresponds to maximum probability conditional on similarity information • Algorithm is a combination of two sources of information • Probabilistic models of exons-introns • Sequence similarity information

  38. TwinScan • Aligns two sequences and marks each base as gap ( - ), mismatch (:), match (|), resulting in a new alphabet of 12 letters: Σ {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}. • Run Viterbi algorithm using emissions ek(b) where b∊ {A-, A:, A|, …, T|}.

  39. TwinScan (cont’d) • The emission probabilities are estimated from from human/mouse gene pairs. • Ex. eI(x|) < eE(x|) since matches are favored in exons, and eI(x-) > eE(x-) since gaps (as well as mismatches) are favored in introns. • Compensates for dominant occurrence of poly-A region in introns

  40. Glimmer • Gene Locator and Interpolated Markov ModelER • Finds genes in bacterial DNA • Uses interpolated Markov Models

  41. The Glimmer Algorithm • Made of 2 programs • BuildIMM • Takes sequences as input and outputs the Interpolated Markov Models (IMMs) • Glimmer • Takes IMMs and outputs all candidate genes • Automatically resolves overlapping genes by choosing one, hence limited • Marks “suspected to truly overlap” genes for closer inspection by user

  42. GenMark • Based on non-stationary Markov chain models • Results displayed graphically with coding vs. noncoding probability dependent on position in nucleotide sequence

  43. Useful internet gene prediction resources • http://www.nslij-genetics.org/gene/ • http://dot.imgen.bcm.tmc.edu:9331/seq-search/gene-search.html • http://genome.cs.mtu.edu/aat.html • http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html • http://genomic.sanger.ac.uk/gf/gf.shtml • http://searchlauncher.bcm.tmc.edu:9331/seq-search/gene-search.html

  44. Conclusion Local sequence alignment, Alignment with gap penalties, Multiple alignment, Gene prediction, Statistical approaches to gene prediction, Similarity-Based approaches to gene prediction

  45. References • http://bix.ucsd.edu/bioalgorithms/ (text book website) • http://biochem218.stanford.edu/ • http://www.cs.washington.edu/education/courses/cse527/09au/ • http://stellar.mit.edu/S/course/6/fa09/6.047/index.html

  46. Questions & comments Thank you

More Related