1 / 34

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Multiple Sequence Alignment Motif Finding and Gene Prediction. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. What is a Multiple Sequence Alignment?. characterize protein families by identify shared regions of homology molecular evolution analysis using Phylogenetic methods

phila
Télécharger la présentation

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment Motif Finding and Gene Prediction Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU

  2. What is a Multiple Sequence Alignment? • characterize protein families by identify shared regions of homology • molecular evolution analysis using Phylogenetic methods • tell us something about the evolution of organisms • Homologous genes (genes with share evolutionary origin) have similar sequences • Uncover changes in gene structure • Look for evidence of selection

  3. Motivation • Let n number of sequences • A new sequence i.e. gene/protein comes up • Wants to find its family

  4. Methods of MSA • Exact method • Heuristic methods

  5. F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d Exact method • Sequence Alignment (two sequences) A C G T A A G T 2 0 0

  6. V S N — S — S N A — — — — A S Exact method (Dynamic Programming) S A A N S V S N S Start

  7. For 3 seqs. of length n, time is proportional to n3 Dynamic Programming for Three Sequences • There are 7 ways to get to C[i,j,k] C[i,j,k] C[i-1,j,k-1] C[i-1,j-1,k-1] C[i-1,j,k-1] Enumerate all possibilities and choose the best one

  8. Dynamic programming cont. • More then three sequences • Four dimension • No deterministic polynomial time algorithm to find optimal solution • MSA complexity is NP • So, Heuristics algorithms for near optimal solution

  9. Heuristics for MSA • Iterative pair-wise alignment • Motif / Anchor – based alignment • Divide and conquer Algorithm • Statistical methods like Hidden Markov Model

  10. Divide and Conquer Algorithm

  11. Iterative Pairwise Alignment • Let four strings to align • MASH, MESH, SQUASH, SQUAMISH MASH MESH M_ _A_ _SH M_ _E_ _SH SQUA_ _SH SQUAMISH M_ _ASH M_ _ESH SQUASH

  12. Iterative Pairwise Alignment cont. • In other way MASH MESH SQUAMISH SQUA_ _SH SQUAMISH SQUA_ _SH _M_A _ _SH _M_E _ _SH

  13. Regulatory Motifs in DNA Sequences

  14. The Immune system • Immunity genes are usually dormant • When infected, somehow get switched on • When these genes are turned on, they produce proteins that destroy the pathogen, usually curing the infection

  15. Immune System in Fruit Flies • Fruit flies do not have sophisticated immune system as humans • Have small set of immunity genes, usually dormant • But when infected, somehow get switched on • For fruit flies, let we like to know which genes are switched on as an immune response

  16. Regulatory Motif ACGTCGCGTACGTAAACGCTCGCTAAACGCTCGCTAAACGCTCGCT • Regulatory motif is a short sequence of string, where the transcription factors, a protein that encourages RNA polymerase to transcribe the downstream genes, bind • Regulatory motif triggers gene activation • Also known as NF-κB binding sites • Immunity genes in fruit fly genome have strings that are reminiscent of TCGGGGATTTCC Upstream downstream Regulatory Motif

  17. The Fruit Fly Experiment • Which genes are switched on as an immune response? • Infect the fly, grind it up, collect a set of upstream regions form the genes in the genome • Each region contains at least one NF-κB binding sites • NF-κB (nuclear factor kappa-light-chain-enhancer of activated B cells) is a protein complex that controls the transcription of DNA • Suppose we do not know what the NF-κB pattern looks like, nor do the position • So, given a set of sequences from a genome, can we find short substrings that seem to occur surprisingly often.

  18. Profiles

  19. Profiles

  20. Profiles

  21. Profile Matrix

  22. Motif Finding Problem

  23. Gene Prediction Problem

  24. Genome Complexities • Human genome is larger than bacterial genomes, seems logical • But Salamander genome is ten times larger than the human genome • Junk DNA or introns are more in Salamander

  25. cDNA Problem cDNA

  26. Similar genes Across species

  27. Genome Complexities Does it mean intronexon lengths are same across species? • Jumps are inconsistent across species • A gene in an insect edition is differently organized than a related gene in a worm genome • The number of parts (exons) may be different • Information that appears in one part of human edition may be broken up into two in the mouse version or vice versa • So, quite different in terms of part structure.

  28. Genome Complexities • Human genes constitute only 3% of the human genome • No existing in silico gene recognition algorithm provides completely reliable gene recognition. • Roughly two approaches of gene prediction • Statistical methods • Similarity based approach

  29. Similarity Based Approach The Exon Chaining Problem • This approach uses previously sequenced genes and their protein products as a template • Find a set of potential exons, putative exons, by local alignment • The exon set may be overlapping • The problem is to choose the best subset of non-overlapping substrings as a putative exon structure

  30. Putative Exon Model • Let (l, r, w) describe an exon that starts at lth position, ends at rth position and has w weight • w may reflect local alignment score or any other measures (2, 3, 3) (7, 17, 12)

  31. Putative Exon Model • Let (l, r, w) describe an exon that starts at lth position, ends at rth position and has w weight • w may reflect local alignment score or any other measures 12 5 10 7 6 1 3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 or i is the current location j is the left end of the current location

  32. Putative Exon Model • Let (l, r, w) describe an exon that starts at lth position, ends at rth position and has w weight • w may reflect local alignment score or any other measures 12 5 10 6 7 3 1 4 or i is the current location j is the left end of the current location

  33. Exon Chaining Algorithm

  34. Reference • Multiple Sequence Alignment: No specific Reference, Use Web Resources • Motif Finding Problem: Chapter 4.4, Introduction to Bionformatics – by PavelPevzner • Gene Prediction Problem: Chapter 6.11, Introduction to Bionformatics – by PavelPevzner

More Related