1 / 22

Motif Discovery in Protein Sequences using Messy De Bruijn Graph

Motif Discovery in Protein Sequences using Messy De Bruijn Graph. Mehmet Dalkilic and Rupali Patwardhan. Goal. The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences. Outline of Presentation.

hina
Télécharger la présentation

Motif Discovery in Protein Sequences using Messy De Bruijn Graph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan

  2. Goal The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences.

  3. Outline of Presentation • Motivation and Background • Approach • Implementation • Applications • Future Work

  4. Motivation • Most of the popular motif discovery algorithms being used right now depend on statistical significance to find the motif. • This project explores computational and graph theoretic ways of doing the same thing without using statistical significance. • Such an approach could drastically reduce the time required to search for motifs.

  5. What is a De Bruijn Graph? • De Bruijn Graph is a graph whose nodes are sequences of symbols from some alphabet and whose edges indicate the sequences which might overlap. • The parameters are nodelength(n) and overlap(k). • So if n=4 and k=3, an edge ACAT  CATS represents the sequence 'ACATS'

  6. Example • If we have a sequence ABCDEFG, • and we take nodelength=4 and overlap=3, • we will can represent this same sequence by the following De Bruijn Graph

  7. CDEF BCDE ABCD DEFG ABCDEFG Node Length = 4 Overlap = 3

  8. Applying this to Identify Repeating Sub-sequences • If we have a bunch of sequences, we can go on adding corresponding nodes and edges to our De Bruijn graph. • If any sub-sequence is repeated, the corresponding edge will already be present in that graph. • So we just increment the weight of that edge. • Eventually the edges corresponding to highly repeated sequences will have higher weights. • Now we can find the motif by simply following the graph along these edges with weights above a specified threshold .

  9. Example • Sequence 1: PAKARCDEKD • Sequence 2: ARCDEKHKH • Constructing the De Bruijn Graph for these sequences …

  10. PAKA EKHK KHKH DEKH CDEK DEKD KARC AKAR ARCD RCDE 1 1 1 1 1 • PAKARCDEKD • ARCDEKHKH 2 1 2 1

  11. Making them Messy • In the context of protein sequences, some amino acid residues can be substituted without affecting the function of the protein. • So a sequence could be considered 'similar' to an edge though its not exactly same. • Similarity is determined in the context of a standard scoring matrix, such as BLOSUM62. • In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question.

  12. Example • Consider the same 2 sequences as before, but with K replaced by R in one of them. • PAKARCDERD • ARCDEKHKH • As per BLOSUM62, K and R have a positive substitution score.

  13. PAKA ARCD AKAR KARC RCDE CDER CDEK DEKH DERD KHKH EKHK 1 1 1 1 1 1 • PAKARCDERD • ARCDEKHKH 2 1 1.75 1

  14. Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI

  15. Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI

  16. Sample output … http://biokdd.informatics.indiana.edu/rpatward/L519/project/ex1.html http://biokdd.informatics.indiana.edu/rpatward/L519/project/ttt.gif

  17. Results • When 41 sequences belonging to PS00021 family were given as input • The best motif output was YCRNPD • The Prosite Reg Ex for this family is [FY]-C-R-N-P-[DNR]. • http://biokdd.informatics.indiana.edu/rpatward/L519/project/PS00021_op.html

  18. Possible Applications • To predict if a given protein sequence is likely to belong to a particular protein family or not. • To construct regular expressions for protein families. • To fine-tune the results of clustering algorithms, by helping to decide whether to merge two clusters or not. • Do preprocessing to improve the performance of other motif discovery algorithms.

  19. Limitation of this Approach • The motif should have at least 3 continuous amino acid residues. • So the program runs into trouble if the motif consists of alternate residues. For example, something like AxAxCxDxAxGxC (x could be any residue). • The problem is due to the need for overlaps, which is inherent nature of De Bruijn Graphs.

  20. Future Work • We would like to integrate a machine-learning aspect to dynamically change the node length and other parameters to find the optimal motif. • We also want to try to extend this approach to do clustering itself.

  21. Link to the Implementation http://biokdd.informatics.indiana.edu/rpatward/L519/project.html

  22. Acknowledgement • I would like to thank Dr. Mehmet Dalkilic for his ideas and support.

More Related