Motif Discovery in Protein Sequences using Messy De Bruijn Graph

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan

Goal The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences.

Outline of Presentation • Motivation and Background • Approach • Implementation • Applications • Future Work

Motivation • Most of the popular motif discovery algorithms being used right now depend on statistical significance to find the motif. • This project explores computational and graph theoretic ways of doing the same thing without using statistical significance. • Such an approach could drastically reduce the time required to search for motifs.

What is a De Bruijn Graph? • De Bruijn Graph is a graph whose nodes are sequences of symbols from some alphabet and whose edges indicate the sequences which might overlap. • The parameters are nodelength(n) and overlap(k). • So if n=4 and k=3, an edge ACAT  CATS represents the sequence 'ACATS'

Example • If we have a sequence ABCDEFG, • and we take nodelength=4 and overlap=3, • we will can represent this same sequence by the following De Bruijn Graph

CDEF BCDE ABCD DEFG ABCDEFG Node Length = 4 Overlap = 3

Applying this to Identify Repeating Sub-sequences • If we have a bunch of sequences, we can go on adding corresponding nodes and edges to our De Bruijn graph. • If any sub-sequence is repeated, the corresponding edge will already be present in that graph. • So we just increment the weight of that edge. • Eventually the edges corresponding to highly repeated sequences will have higher weights. • Now we can find the motif by simply following the graph along these edges with weights above a specified threshold .

Example • Sequence 1: PAKARCDEKD • Sequence 2: ARCDEKHKH • Constructing the De Bruijn Graph for these sequences …

PAKA EKHK KHKH DEKH CDEK DEKD KARC AKAR ARCD RCDE 1 1 1 1 1 • PAKARCDEKD • ARCDEKHKH 2 1 2 1

Making them Messy • In the context of protein sequences, some amino acid residues can be substituted without affecting the function of the protein. • So a sequence could be considered 'similar' to an edge though its not exactly same. • Similarity is determined in the context of a standard scoring matrix, such as BLOSUM62. • In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question.

Example • Consider the same 2 sequences as before, but with K replaced by R in one of them. • PAKARCDERD • ARCDEKHKH • As per BLOSUM62, K and R have a positive substitution score.

PAKA ARCD AKAR KARC RCDE CDER CDEK DEKH DERD KHKH EKHK 1 1 1 1 1 1 • PAKARCDERD • ARCDEKHKH 2 1 1.75 1

Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI

Sample output … http://biokdd.informatics.indiana.edu/rpatward/L519/project/ex1.html http://biokdd.informatics.indiana.edu/rpatward/L519/project/ttt.gif

Results • When 41 sequences belonging to PS00021 family were given as input • The best motif output was YCRNPD • The Prosite Reg Ex for this family is [FY]-C-R-N-P-[DNR]. • http://biokdd.informatics.indiana.edu/rpatward/L519/project/PS00021_op.html

Possible Applications • To predict if a given protein sequence is likely to belong to a particular protein family or not. • To construct regular expressions for protein families. • To fine-tune the results of clustering algorithms, by helping to decide whether to merge two clusters or not. • Do preprocessing to improve the performance of other motif discovery algorithms.

Limitation of this Approach • The motif should have at least 3 continuous amino acid residues. • So the program runs into trouble if the motif consists of alternate residues. For example, something like AxAxCxDxAxGxC (x could be any residue). • The problem is due to the need for overlaps, which is inherent nature of De Bruijn Graphs.

Future Work • We would like to integrate a machine-learning aspect to dynamically change the node length and other parameters to find the optimal motif. • We also want to try to extend this approach to do clustering itself.

Link to the Implementation http://biokdd.informatics.indiana.edu/rpatward/L519/project.html

Acknowledgement • I would like to thank Dr. Mehmet Dalkilic for his ideas and support.

Motif Discovery in Protein Sequences using Messy De Bruijn Graph

Motif Discovery in Protein Sequences using Messy De Bruijn Graph

Presentation Transcript

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

ANALYZING PROTEIN NETWORK ROBUSTNESS USING GRAPH SPECTRUM

Motif Discovery: Algorithm and Application

DNA Motif and protein domain discovery

Convergent Dense Graph Sequences

Predictive Methods Using Protein Sequences Unit 23

Motif discovery

Motif search and motif discovery

Motif search and discovery

The AMADEUS Motif Discovery Platform

Protein Sequences

Motif Discovery in Heterogeneous Sequence Data

Exploring Protein Sequences

An Approx. Algo. For Alignment MSA using Motif Discovery

Biological Motif Discovery

Whole-genome motif discovery

Counting Graph Colourings by using Sequences of Subgraphs

Motif discovery in co-regulated genes

Motif discovery

Programmed Graph Rewriting: MoTif

Comparing Protein Sequences

De Bruijn Sequences