1 / 59

Motif Extraction and Grammar Induction in Language and Biology

Motif Extraction and Grammar Induction in Language and Biology. David Horn Tel Aviv University http://horn.tau.ac.il. Analogy between languages. Human languages are realized in streams of speech or in lines of text All computational problems can be realized by a Turing machine

Télécharger la présentation

Motif Extraction and Grammar Induction in Language and Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif Extraction and Grammar Induction in Language and Biology David Horn Tel Aviv University http://horn.tau.ac.il

  2. Analogy between languages • Human languages are realized in streams of speech or in lines of text • All computational problems can be realized by a Turing machine • Hereditary biology makes use of chains of nucleotides in DNA and RNA and chains of amino-acids in proteins All are one-dimensional representations of reality

  3. Reality, however, is not one dimensional hence one needs a set of syntactic rules, to make sense of the text, and a semantic mapping into actions in the real world. The distinction between syntactic and semantic levels (Chomsky 1957) is intuitively clear in human languages. In biology syntax refers to the structure of the sequences whereas semantics relates the different sequence elements to the complicated process involving transcription, the birth of mRNA, to translation, the birth of the protein. Examples of syntax (semantics): segmentation of the chromosome to genes (originators of proteins), promoters (transcription regulation) and 3’ UTRs (translation regulation); segmentation of genes to exons (coding) and introns (noncoding); finding motifs on promoters (TFBS relevant to enabling or forbidding transcription); finding motifs on proteins (relevant for protein interaction and protein functionality) etc.

  4. Can we induce the rules (and, sometimes, the words) from the texts? ADIOS is an algorithm that induces syntax, or grammar, from the text. MEX is an algorithm that extracts motifs, or patterns, from the text. Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman. Unsupervised learning of natural languages. Proc. Natl. Acad. Sci. USA, 102 (2005) 11629-11634.

  5. MEX: motif extraction algorithm • Create a graph whose vertices are words (for text) or letters (for biological sequences) • Load all strings of text onto the graph as paths over the vertices • Given the loaded graph consider trial-paths that may coincide with original strings of text • Use context sensitive statistics to define left- and right-moving probabilities that help to define motifs

  6. Find patterns in strings of letters Motifs EXtraction (MEX) Given a set of strings a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f h a v i n g n o t h i n g t o d o o n c e o r t w i c e s h e h a d p e e p e d i n t o t h e b o o k h e r s i s t e r w a s r e a d i n g b u t i t h a d n o p i c t u r e s o r c o n v e r s a t i o n s i n i t a n d w h a t i s t h e u s e o f a b o o k t h o u g h t a l i c e w i t h o u t p i c t u r e s o r c o n v e r s a t i o n alicewas beginning toget verytiredof sitting by hersister onthebank and of having nothing todo onceortwice shehad peep ed intothe book hersister was reading butit hadno pictures or conversation s init and what is theuseof abook thoughtalice without pictures or conversation

  7. (2,2) (2,1) j s e h d v a c g f u t b z q p o n r i k m l w x y (2,3) (2,4) begin end Creating the graph (directed)… • ∑ = {a-z} (1,1) • alice was (1,6) (1,5) (1,2) (1,4) (1,3)

  8. {1003;12} a {1003;11} b p c {1003;10} o {1003;13} {1003;4} d n {1003;5} structured graph m e {1003;3} {1003;14} {1003;6} {1003;9} l f {1003;7} k g {1002;2} {1002;1} {1003;8} h j i Creating the graph… (Cont’)

  9. a b p c o d n random graph m e l f k g h j i Creating the graph… (Cont’)

  10. (1) a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f c o n v e r s a t i o n a i i c c e e n n l l (2) begin begin end end w h e n a l i c e h a d b e e n a l l t h e w a y d o w n o n e s i d e a n d u p t h e o t h e r t r y i n g e v e r y d o o r s h e w a l k e d s a d l y w d h a Creating the graph - cont’d

  11. {1003;12} 1 {1003;11} 2 16 3 {1003;10} 15 {1003;13} {1003;4} 4 14 {1003;5} structured graph 13 5 {1003;3} {1003;14} {1003;6} {1003;9} 12 6 {1003;7} 11 7 {1002;2} {1002;1} {1003;8} 8 10 9 Searching for patterns

  12. search path 5 4 path 4 5 9 1 1 2 6 7 8 3 7 3 2 6 8 Searching for patterns (Cont’)

  13. Searching for patterns (Cont’)

  14. L matrix (numbers of paths) l(ei;ej) = number of occurrences of sub-path (ei,ej) Where(ei,ej)is: ei→ ei+1 → ei+2 → …→ ej-1 → ej Calculate conditional probabilities

  15. P(a) = 0.08 P(l|a) = 1046/8770 P(i|al) = 486/1046 P(c|ali) = 397/486 P(e|alic) = 397/397 P(w|alice) = 48/397 From L to P Calculating conditional probabilities

  16. The probability Matrix

  17. if if if P_R P_L Probability Matrix P_R = Right moving probability (end of pattern) P_L = Left moving probability (beginning of pattern)

  18. Detecting a significant pattern

  19. Significance test • Pn = (e1,en) : a potential pattern edge • m = # of paths from e1 to en r = # of paths from e1 to en+1 • ≡ (Pn is a pattern edge) • Assume P*n+1 = the “true” Pn+1, given (e1,en) • H0: P*n+1 ≥ Pn·η(does not diverge) H1: P*n+1 < Pn·η • odds to receive results at least as “extreme” as r and m are: If the outcome is less than a predetermined α The pattern is significant

  20. Rewiring the graph • Once the algorithm has reached the stop criteria (e.g. ceases to locate new patterns), for each significant patterns, the sub-paths it subsumed are merged into a new vertex • The graph is rewired in a length–significance descending order

  21. ALICE motifs Motifs selected in order of -length -weight (significance of drop) Shown here are results of one run over a trial-path and the beginning of the list of motifs extracted from it

  22. Application to Biology Vertices of the graph: 4 or 20 letters Paths: gene-sequences, promoter-sequences, protein-sequences. Conditional probabilities on the graph are proportional to the number of paths Trial-path: testing transition probabilities to extract motifs

  23. Extracting Motifs from Enzymes • Each enzyme sequence corresponds to a single path • Applying MEX to oxidoreductases • 6602 enzyme sequences • MEX motifs are specific subsequences >P54233 | 1.7.1.1 LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES V. Kunik, Z. Solan, S. Edelman, E.Ruppin and D. Horn. CSB2005

  24. Enzyme Motifs • 3165 motifs were obtained • Distribution of MEX motifs Number of motifs Length of motif

  25. Enzymes Representation • Each enzyme is represented as a ‘bag of motifs’ >P54233 | 1.7.1.1 LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES >P54233 | 1.7.1.1 RDEGTAD,TGKHPFN,LMHHGFITP,YVRNHGPVP,WTVEVTG,PDHGFPYHYKDN,KVTRVE,YGKYWCW,MGMMNNCWF • These 1222 MEX motifs cover 3739 enzymes

  26. n1:class Enzyme Function • The functionality of an enzyme is determined • according to its EC number • Classification Hierarchy[ Webb, 1992 ] • EC number: n1.n2.n3.n4 (a unique identifier) n1.n2:sub-class / 2nd level n1.n2.n3:sub-subclass / 3rd level n1.n2.n3.n4:precise enzymatic activity

  27. oxidoreductases hydrogen as electron donors NAD+ / NADP+ as electron acceptors NAD+ oxidoreductase 2 EC 1.12.1. H2 + NAD+ = H+ + NADH An example: • EC 1 .12 . 1 . n4

  28. Current knowledge regarding enzyme classification • High sequence similarity is required to guarantee functional similarity of proteins. • A recent analysis of enzymes by Tian and Skolnick 2003 suggests that 40% pairwise sequence identity can be used as a threshold for safe transferability of the first three digits of the Enzyme Commission (EC) number. • The EC number, which is of the form:n1:n2:n3:n4 specifies the location of the enzyme on a tree of functionalities.

  29. Current knowledge regarding enzyme classification • Using pairwise sequence similarity, and combining it with the Support Vector Machine (SVM) classification approach, Liao and Noble 2003 have argued that they obtain a significantly improved remote homology detection relative to existing state-of-the-art algorithms. • Cai at al (2003,2004) have applied SVM to a protein description based on physico-chemical features of their amino-acids such as hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility. • Ben-Hur and Brutlag 2004 used the eMotif approach and analyzed the oxidoreductases as `bags of motis’.With appropriate feature selection methods theyobtain success rates over 90% for a variety of classifiers.

  30. SVM classifier input: O17433 1148 262 463 610 7987 1627 260 P19992 124 7290 27 111 3706 18128 3432 Q01284 6652 198 1489 710 425 64 55 Q12723 693 145 7290 3712 65 543 522 P14060 455 2664 848 55 128 256 74 Q60555 7290 3712 65 543 522 6748 7159 The MEX method • Classification Tasks: • 16 2nd level subclasses • 32 3rd level sub-subclasses

  31. tp J= tp + fp + fn Methods • A linear SVM is applied to evaluate the predictive power of MEX motifs • Enzyme sequences are randomly partitioned into a training-set and a test-set (75%-25%) • 16 2nd level classification tasks • 32 3rd level classification tasks • Performance measurement:Jaccard Score • The train-test procedure was repeated 40 times to gather sufficient statistics

  32. Results • Average Jaccard scores: • 2nd level: 0.88± 0.06 • 3rd level: 0.84± 0.09 2nd level results Jaccard score EC subclass # of sequences EC subclass

  33. 2nd Level Classification Jaccard Score α < 0.01 EC Subclass

  34. 3rd Level Classification Jaccard Score α < 0.01 EC Sub-subclass

  35. Summary so far… • MEX is an unsupervised motif extraction method that finds biologically meaningful motifs • The meaning of the motifs is established by the classification tasks • Classifies enzymes better than Smith-Waterman and outperforms SVMProt • There exists correspondence between MEX motifs and PROSITE biologically significant patterns • Specific motifs of length 6 and longer specify the functionality of enzymes

  36. ADIOS (AutomaticDIstillation Of Structure) Representation of a corpus (of sentences) as paths over a graph whose vertices are lexical elements (words) Motif Extraction (MEX) procedure for establishing new vertices thus progressively redefining the graph in an unsupervised fashion Recursive Generalization

  37. ADIOS: recursive generalization • For each path • Slide a context window of size L • For each location i {0< i <=L} • Look for all paths with identical prefix at ip {0< ip<i} • and identical suffix at is {i< is <L} • If a ‘generalized’ pattern is found (and significant…) • Add E – the equivalence class to the graph • Rewire • Continue with MEX and ADIOS until no significant pattern is found

  38. Loading sentences onto a graph whose vertices are words • Is that a cat? • Is that a dog? • And is that a horse? • Where is the dog? Results will be finding the pattern P=‘is that a E ?’ with an equivalence class E={cat, horse, dog}

  39. a lexicon...

  40. a lexicon... and a path induced by a string

  41. a lexicon... and a path induced by a string

  42. a lexicon... and a path induced by a string

  43. a lexicon... and a path induced by a string

  44. a lexicon... and a graph induced by a corpus of strings

  45. Generalization

  46. The Model: The training process 987 234 132 120 567 621 321 2000 132 120 567 621 321 987 234 1203 321 987 234 1203 321 234 1203 321 987 1204 987 234 2001 987 1204 1205

  47. The Model: The training process 987 234 132 120 567 621 321 2000 132 120 567 621 321 987 234 1203 321 987 234 1203 321 234 1203 321 987 1204 987 234 2001 987 1204 1205

  48. First pattern formation Higher hierarchies: patterns (P) constructed of other Ps, equivalence classes (E) and terminals (T) Trees to be read from top to bottom and from left to right Final stage: root pattern CFG: context free grammar

  49. student learns from teacher • Teacher generates a corpus of sentences • Student generates syntax out of significant patterns and equivalence classes • Unseen teacher-generated patterns are checked by student (recall) • Student-generated patterns are checked by teacher (significance)

More Related