1 / 66

Comparative Sequence Analysis in Molecular Biology

Comparative Sequence Analysis in Molecular Biology. Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A. Outline. What genome data is available? What is phylogenetic footprinting?

karli
Télécharger la présentation

Comparative Sequence Analysis in Molecular Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Sequence Analysisin Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A.

  2. Outline • What genome data is available? • What is phylogenetic footprinting? • Phylogenetic footprinting by multiple sequence alignment • Which parts of multiple sequence alignments are trustworthy? • FootPrinter: phylogenetic footprinting without alignment

  3. Outline • What genome data is available? • What is phylogenetic footprinting? • Phylogenetic footprinting by multiple sequence alignment • Which parts of multiple sequence alignments are trustworthy? • FootPrinter: phylogenetic footprinting without alignment

  4. How Many Genomes Are Available? • 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes) • 1766 bacterial genomes sequenced (as of 2/12/2012) • Insects, fungi, worms, plants, … • Many more will be finished very soon • Fertile ground for comparative genomics

  5. 1982-2003: number of nucleotides in GenBank doubled every 18 months Since 2003: doubled every 3 years

  6. Outline • What genome data is available? • What is phylogenetic footprinting? • Phylogenetic footprinting by multiple sequence alignment • Which parts of multiple sequence alignments are trustworthy? • FootPrinter: phylogenetic footprinting without alignment

  7. Phylogenetic Footprinting(Tagle et al. 1988) • Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones. • Consider a set of corresponding DNA sequences from related species. • Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”

  8. Outline • What genome data is available? • What is phylogenetic footprinting? • Phylogenetic footprinting by multiple sequence alignment • Which parts of multiple sequence alignments are trustworthy? • FootPrinter: phylogenetic footprinting without alignment

  9. How to Find Conserved Motifs ACTAACCGGGAGATTTCAGAhuman AAGTTCCGGGAGATTTCCAchimp TAGTTATCCGGGAGATTAGAmouse AAAACCGGTAGATTTCAGGrat

  10. Multiple Sequence Alignment AC--TAACCGGGAGATTTCAGA human AAGTT--CCGGGAGATTTCC-Achimp TAGTTATCCGGGAGATT--AGAmouse AA---AACCGGTAGATTTCAGGrat (Finding the optimal alignment is NP-complete.)

  11. Phylogenetic Footprinting • Use whole-genome multiple alignment such as provided by UCSC Genome Browser. • Search for regions of well conserved alignment. • Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe; Xie (2)] • RNA elements [Pedersen; Washietl] • General conservation & constraint [Bejerano; Boffelli; Cooper; Margulies (4); Pollard; Prabhakar; Siepel]

  12. Outline • What genome data is available? • What is phylogenetic footprinting? • Phylogenetic footprinting by multiple sequence alignment • Which parts of multiple sequence alignments are trustworthy? • FootPrinter: phylogenetic footprinting without alignment

  13. Why Doubt Alignments? • Multiple sequence alignment of short sequences (proteins, promoters) is difficult (NP-complete) • Aligning whole genomes adds the complications of huge sequences and genomic rearrangements • Vertebrate alignment has 3.8 billion columns • Automatically generated

  14. Assessing 4 Genome-Size Alignments(with Xiaoyu Chen) • Alignments: MLAGAN [Brudno 2003], MAVID [Bray 2003], TBA [Blanchette 2003], Pecan [Paten 2008] • Target ENCODE regions: 30 Mbp covering 1% of the human genome (ENCODE targets) • Total input: 554 Mbp over 28 vertebrates • Rich resource for comparing and assessing genome-size alignments Margulies et al. 2007, Genome Research

  15. Coverage of each alignment Alignment coverage: number of human bases aligned to a given species

  16. Coverage of each alignment In noncoding regions, as species distance from human↑, coverage↓

  17. Coverage of each alignment MAVID has lowest coverage

  18. Coverage of each alignment Other 3 have comparable coverage in placental mammals

  19. Coverage of each alignment MLAGAN has highest coverage in distant species, intronic and intergenic

  20. Level of agreement among alignments Coding bases UTR bases TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TBA (T) MAVID (V) MLAGAN (L) Intronic bases Intergenic bases Disagree% Unique% Agree%

  21. Level of agreement among alignments Coding bases UTR bases TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TBA (T) MAVID (V) MLAGAN (L) Intronic bases Intergenic bases Disagree% Unique% Agree% Agree%: Coding > UTR > Int.

  22. Level of agreement among alignments Coding bases UTR bases TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TBA (T) MAVID (V) MLAGAN (L) Intronic bases Intergenic bases Disagree% Unique% Agree% Unique%: Coding < UTR < Int.

  23. Level of agreement among alignments Coding bases UTR bases TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TBA (T) MAVID (V) MLAGAN (L) Intronic bases Intergenic bases Disagree% Unique% Agree% As species distance from human↑, Agree%↓Unique%↑

  24. Level of agreement among alignments Coding bases UTR bases TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TBA (T) MAVID (V) MLAGAN (L) Intronic bases Intergenic bases Disagree% Unique% Agree% Primates: high Agree%

  25. Level of agreement among alignments Coding bases UTR bases TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TBA (T) MAVID (V) MLAGAN (L) Intronic bases Intergenic bases Disagree% Unique% Agree% Placental nonprimates: Agree% > 0.5

  26. Level of agreement among alignments Coding bases UTR bases TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TBA (T) MAVID (V) MLAGAN (L) Intronic bases Intergenic bases Disagree% Unique% Agree% Distant species, Int: low Agree%, high Unique%

  27. Alignment agreement for mouse • Intronic & intergenic account for 95% of mouse bases aligned to human • Agree% in those categories: 44% to 62% • Much worse for more distant species • Building reliable MSA remains challenging Intronic bases Intergenic bases Disagree% Unique% Agree%

  28. Which Alignment Columns to Trust?(with Amol Prakash, generalizing Karlin and Altschul 1990) Goal: label each alignment column with confidence measure of alignment correctness • Identify sequences that do not belong • Users forewarned about regions of interest • Genome browser designers consider realigning • Alignment tool designers get feedback for possible improvements

  29. Sample Suspicious Alignment Human -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Chimp -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Rhesus -----------GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Mouse -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC Rat -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC Dog -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Cow -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Elephant -----------GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Tenrec -----------GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Opossum -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA----------TTAACAC Chicken -----------GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA----------TTAACAC ZebrafishGAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC

  30. Human Chimp Mouse Rat Chicken Pr(12345| ) Pr(125 | ) Pr(34 |) • sc(12345 | ) = log() Scoring Function Pr(1,2) Pr(1)Pr(2) Pairwise:score(1,2) = log ( ) Multiple: 1 2 3 4 5

  31. Outline of Computation Input Multiple sequence alignment A For each branch k of the tree { Compute scoring function sck (Felsenstein) Find all maximally scoring segments of A usingsck(Ruzzo & Tompa) Compute K,  using sck (Karlin & Altschul) Compute p-value pk of each segment score using K, (Karlin & Altschul) } Output Discordance: maxkpk

  32. Suspicious Alignment Regions Back to four ENCODE alignments spanning 30 Mbp of human aligned to 27 other vertebrates (with Xiaoyu Chen) • Identify suspicious alignment regions: • Length  50 bp • Discordance  0.1 at each position, all with respect to the same worst species • Fewer than 50% gapped sites • Suspicious% • Percentage of aligned bases in suspicious regions

  33. Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

  34. Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

  35. Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

  36. Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

  37. Can suspicious alignments be improved? • Baboon and MLAGAN (for example): all points (x,y), where • x = human-baboon alignment score of MLAGAN region suspicious for baboon • y = human-baboon alignment score of alternative alignment for same human region but not suspicious for baboon • y = x • y - x = μ, where μ = average y-xover all points • y - x = μ ± σ, where σ = standard deviation of y-xover all points

  38. Can suspicious alignments be improved?

  39. Summary of comparisons (all categories) Low is better High is better primates other placental mammals distant species TBA MAVID MLAGAN Pecan

  40. Conclusions • Disturbing lack of agreement among alignments: alignment still a hard problem • Performance of the aligners varies significantly by species group and region type, particularly distant species and noncoding regions

  41. Outline • What genome data is available? • What is phylogenetic footprinting? • Phylogenetic footprinting by multiple sequence alignment • Which parts of multiple sequence alignments are trustworthy? • FootPrinter: phylogenetic footprinting without alignment

  42. DNA TCCAACGGTGCTGAGGTGCAC Protein Gene DNA, Genes, and Proteins DNA: program for cell processes Proteins: execute cell processes

  43. Regulation of Genes • What turns genes on and off? • When is a gene turned on or off? • Where (in which cells) is a gene turned on? • How many copies of the gene product are produced?

  44. Regulation of Genes Transcription Factor RNA polymerase DNA Gene Regulatory Element

  45. Regulation of Genes Transcription Factor RNA polymerase DNA Gene Regulatory Element

  46. Goal • Identify regulatory elements in DNA sequences. These are: • Binding sites for proteins • Short subsequences (5-25 nucleotides) • Up to 1000 nucleotides (or farther) from gene • Inexactly repeating patterns (“motifs”)

  47. CLUSTALW multiple sequence alignment (rbcS gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC Larch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Turnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

  48. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACGT...(Rabbit) GAACGGAGTACGT...(Mouse) TCGTGACGGTGAT... (Rat) Finding Short Motifs Size of motif sought: k = 4

More Related