1 / 39

Framework for Sequence Cluster Merging ( Also showing importance of domain knowledge )

Framework for Sequence Cluster Merging ( Also showing importance of domain knowledge ). Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana University, Bloomington http://biokdd.informatics.indiana.edu/~agopu Email: agopu@cs.indiana.edu. Introduction.

gefen
Télécharger la présentation

Framework for Sequence Cluster Merging ( Also showing importance of domain knowledge )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana University, Bloomington http://biokdd.informatics.indiana.edu/~agopu Email: agopu@cs.indiana.edu

  2. Introduction • Sequence Clustering very important research topic. • Bottom-up approach – basically merge elements recursively upto certain specificity • Top-down approach – split elements until desired specificity is achieved • Two important issues: selectivity and sensitivity • Sequence clustering problem is unique • No “observable” attributes unlike most clustering problems • Example: • Supermarket: Soda, Fruit juice, Frozen foods, Clothing, etc. • Demographic: Height, Race, etc. • Sequence clustering: Just a bunch of amino acid characters! (with accompanying well studied sequence comparison/alignment programs).

  3. Introduction … • Getting back to sequence clustering… • Fragmentation problem – well known in sequence clustering algorithms. • Example: BAG (Sun Kim) • 99 % accuracy (selective) but at cost of ~40-50 % fragmentation (over-sensitive) • Solution? • Bottom-Up merging back of fragmented clusters

  4. Need for framework • Suggested bottom-up approach possible using various sub-methods • Framework: Do common and unique tasks seamlessly • Insert new sub-methods easily with very little hassle • Implemented primarily in Perl with supporting C programs and Unix Shell scripts

  5. Framework Schematic Merge Suggestions from Clustering Algorithm Test Scaffold Generate Combined Profile for Two Fragment Clusters Prepare Sequence Data Post-process New Clustering Result Test Merge’bility Enhanced Clustering Result

  6. Framework – Profile Generation Merge Suggestions from Clustering Algorithm Test Scaffold GENERATE COMBINED PROFILE FOR TWO FRAGMENT CLUSTERS Prepare Sequence Data Post-process New Clustering Result Test Merge’bility Enhanced Clustering Result

  7. Profile Generation – MSA • MSA = Multiple Sequence Alignment C1 MSA (C1) Combined Profile MSA (C1, C2) C2 MSA (C2)

  8. Profile Generation – MSA • Common first step: MSA profile generation for two fragment clusters C1 and C2 (Clustalw) • MSA (C1) and MSA (C2) • Most expensive step in framework • Common second step: Combined profile generation (Clustalw) • Prof_Align [MSA (C1), MSA (C2)]

  9. Profile Generation – MSA explained.. • All of the implemented techniques depend on MSA profiles • MSA profile: align more than 2 sequences simultaneously Image from http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/

  10. Profile Generation – MSA explained.. Image from http://www.mscs.mu.edu/~cstruble/class/mscs230/fall2002/notes/3

  11. Framework – Merge’bility Test Merge Suggestions from Clustering Algorithm Test Scaffold Generate Combined Profile for Two Fragment Clusters Prepare Sequence Data Post-process New Clustering Result TEST MERGE’BILITY Enhanced Clustering Result

  12. Model Comparison based Merge Test

  13. Model Comparison based Merge Test • Statistics/Machine learning technique based method: • Uses Relative Entropy and Statistical measures w.r.t. Runs test • Drawbacks • Almost impossible to nail down on threshold values for z-score or any other statistical measure • Extremely dependent sample size equality – does not work well when the two fragment sizes vary

  14. Model Comparison based Merge Test • Each column in a MSA profile is a probabilistic model (details of construction beyond the scope of this talk) • Compute similarity between corresponding columns in the two fragments – Kullback Liebler distance • Need to consider gaps while matching up columns – challenging task • Also need to screen for random “good” distances – taken care off using random model in distance computation

  15. Model Comparison based Merge Test

  16. Model Comparison based Merge Test • Using column wise comparison distance scores, compute “distance vector” • Symbolic representation for “good”, “bad” and “don’t care” distances (detail abstracted) • Do standard statistical test: Runs test to check out how random distance vector is… • Nice pattern: • y | y | y | n | n | y | y | y | n | n | y | y | y • Random pattern: • y | n | y | y | n | n | n | y | n | y | n | n | y | y

  17. Model Comparison based Merge Test 4) Do Runs test

  18. Model Comparison based Merge Test • Compute mean, standard deviation and subsequently z-score • Threshold to separate “good” and “bad” merges • Drawbacks again… • Threshold will be sample specific, hard to have one threshold for entire dataset (illustrated in test results) • Failure rate is high if sample size is unequal

  19. Phylogenetic Tree based Merge Test

  20. Merge’bility Test – Techniques … • Phylogenetic tree based method: • Evolutionary Distance based method • Drawback: Too strict; many false negatives possible; Also hard to nail a threshold • Evolutionary Least Common Ancestor (LCA) based method • Improved performance in both of the previously mentioned issues

  21. Phylogenetic TreeEvolutionary Distance based Merge Test

  22. Phylogenetic Tree Distance based method • Clustalw (or other tree generation tools) provide NJ tree of a MSA profile • Sequence length normalized distance from root for each sequence • 0 < distance < 1 • Define some threshold for distance that constitutes intra/inter cluster distances

  23. Phylogenetic Tree Distance based method • Distance between sequences from… • Two clusters will be closer to: • ‘1’ if two clusters are not merge’ble – call these “bad distances” • ‘0’ if two clusters are actually part of the same super cluster • The same cluster will be obviously closer to ‘0’ – these constitute “good distances”; don’t care in our case • Count number of “bad distances” • Gives a good idea of how good a merge is

  24. Phylogenetic Tree Distance based method • Good enough? Not yet – need for normalization of the “bad distance” count. • Why? • Number of edges between vertices of same/different clusters is proportional to size of clusters!

  25. Phylogenetic Tree Distance based method • Once normalization of number of “bad distances” is done, this method churned out decent results • Normalizing factor? Contentious.. What is a good normalizer? • Method too strict for unequally sized clusters. Most merges rejected leading to appreciable number of false negatives • Inherent nature of MSA programs and unequally sized profiles (cluster sizes)

  26. Phylogenetic TreeLCA Coverage based Merge Test

  27. Phy.Tree LCA coverage based method • Clustalw, Phylip (or other tree generation tools) provide a rooted phylogenetic tree for a MSA profile • Looking at the tree, one can easily make out if a pair of clusters should be merged or not • How? • Parse tree into a usual tree data structure and look for common ancestor of sequences of each cluster • Example…

  28. Phy.Tree LCA coverage based method • Good Merge • Sequences of the two clusters (shaded blue and red) are from the same super cluster

  29. Phy.Tree LCA coverage based method • Bad Merge • Sequences of the two clusters (shaded blue and red) are from different super clusters

  30. Phy.Tree LCA coverage based method • Same LCA for both clusters? Good merge! • If not … Bad merge? • Not quite. Possible that LCAs may be different but they cover sequences from either cluster upto a considerable extent • Better to use coverage of LCAs instead • Example…

  31. Phy.Tree LCA coverage based method • Why LCA Coverage? • Second cluster has three sequences, but its LCA covers four more sequences from the other cluster

  32. Phy.Tree LCA coverage based method • Coverage test: • For clusters Ci and Ck, choose smaller cluster say Ci i.e | Ci | < | Ck | • Define Cov (LCA[Ci]) as the number of sequences LCA Ci covers. • If Cov(LCA[Ci]) > # of sequences in Ci … where | Ci | < | Ck | • i.e. { Cov (LCA[Ci]) / | Ci | } > 1 • Or {Cross Coverage (LCA[Ci])} > 0

  33. Phy.Tree LCA coverage based method • Advantages: • Sample size difference does not play a big role • Demarcating between “good” and “bad” merges is much simpler and straight forward • Shown to work really well on a variety of data sizes, difficulty levels – test results… • Possible weakness: • Bound to fail for extremely small fragments (say 2 sequences each) – hard not to have a common LCA !

  34. Test Results – 4 datasets(from COG database)

  35. Test Results – Data set 1

  36. Test Results – Data set 2

  37. Test Results – Data set 3

  38. Test Results – Data set 4

  39. Acknowledgements! • A big thank you to: • Prof. Sun Kim, advisor • My parents, brother, grand parents! • All my colleagues and friends: JH, Zhiping, Scott Martin, SR, Raj, Anshul, Pat Hayes and everyone else! • Folks at CS & Informatics: CS Systems staff, Lucy, Linda, Wendy, Cheryl, Errissa, Bob! • Profs. Marty Siegel and Gary Wiggins – GPC. • RATS folks! • Did I forget someone?! Sorry if I did…

More Related