Motif Discovery in Protein Sequences using Messy de Bruijn Graph

Motif Discovery in Protein Sequences using Messy de Bruijn Graph Rupali Patwardhan Advisors: Dr. Mehmet Dalkilic Dr. Haixu Tang Rupali Patwardhan, Capstone Presentation

Outline of Presentation • Goal • Background and Motivation • Approach • Results • Future Work Rupali Patwardhan, Capstone Presentation

Goal To develop an algorithm that can take advantage of the properties of de Bruijn graph todiscover motifs in protein sequences Rupali Patwardhan, Capstone Presentation

What is a motif ? • A repeating pattern • VSKLIPKNRLMISTEWRSLGQQSPGWMHYMP • VMLPKDIAKLVPKTHLMSTEWRNRLGVQQSQG • SGVPRLLTASREWRNLGEPFIDQIHYSPRYAD • YRHVMLPKAMSTEWRSLGLKNPETGTLRILQE • GLGITQSLGWSREWRHTLGEPHILLFKREKDYQ Rupali Patwardhan, Capstone Presentation

Why are motifs interesting ? • They represent regions that have been conserved through evolution • So those regions are likely to be important for the function of the protein (e.g. an active site) • Motifs can be used to classify proteins into families based on their functions, or predict the function of a new protein Rupali Patwardhan, Capstone Presentation

PS00059 Zinc-containing alcohol dehydrogenases signature G-H-E-x(2)-G-x(5)-[GA]-x(2)-[IVSAC] H is a zinc ligand Rupali Patwardhan, Capstone Presentation

Motif Discovery Algorithms • There are two main categories • Stochastic Algorithms • Based on Statistical Significance e.g. MEME, GIBBS • Combinatorial Algorithms • Based on Enumeration e.g. PRATT, SPLASH Rupali Patwardhan, Capstone Presentation

Then why one more ? • Existing algorithms • Are too slow or computationally expensive for massive inputs (e.g. MEME) • Do not handle gapped motifs effectively • Need the length/number of the motifs to be specified in advance Rupali Patwardhan, Capstone Presentation

What is a de Bruijn Graph? • A graph whose nodes are subsequences of same length (l- tuples) and whose edges indicate the subsequences of the two connected nodes overlap • E.g. An edge ACAT  CATS represents the sequence “ACATS” Rupali Patwardhan, Capstone Presentation

CDEF BCDE ABCD DEFG ABCDEFG Rupali Patwardhan, Capstone Presentation

Applying this to Identify Repeating Subsequences • If we have a set of sequences, we can go on adding corresponding nodes and edges to our de Bruijn graph. • If any sub-sequence is repeated, the corresponding edge will already be present in that graph. • So we just increment the weight of that edge. • Eventually the edges corresponding to highly repeated sequences will have higher weights. • Now we can find the motif by simply following the graph along these edges with weights above a specified threshold . Rupali Patwardhan, Capstone Presentation

PAKA ARCD AKAR KARC RCDE CDEK DEKD 1 1 1 1. PAKARCDEKD 1 1 1 Rupali Patwardhan, Capstone Presentation

KHKH PAKA ARCD AKAR KARC RCDE CDEK DEKH EKHK DEKD NARC 1 1 1 1 1 1. PAKARCDEKD 2. NARCDEKHKH 1 2 1 2 1 Rupali Patwardhan, Capstone Presentation

Making them Messy • In the context of protein sequences, some amino acid residues can be substituted by some others without affecting the function of the protein. • So a sequence could be considered 'similar' to an edge even though its not identical. • Similarity between amino acid residues is determined using standard scoring matrices, such as BLOSUM62. • In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question. Rupali Patwardhan, Capstone Presentation

Example • Consider the same 2 sequences as before, but with K replaced by R in one of them. • PAKARCDERD • NARCDEKHKH • As per BLOSUM62, K  R substitution has a positive substitution score. Rupali Patwardhan, Capstone Presentation

PAKA ARCD AKAR KARC RCDE CDER CDEK EKHK DEKH KHKH NARC DERD 1 1 1 1 1 1 • PAKARCDERD • NARCDEKHKH 1 2 1 1 1 Rupali Patwardhan, Capstone Presentation

PAKA ARCD AKAR KARC RCDE CDER CDEK EKHK DEKH KHKH NARC DERD 1 1 1 1 1 1 • PAKARCDERD • NARCDEKHKH 1 2 1 1.4 1.4 Rupali Patwardhan, Capstone Presentation

Adjusting the weights to account for messiness • Suppose edge A is under consideration, and edges B and C originating from the same node as A are similar to A. WA’  WA + WB*s(A,B) + WC*s(A,C) Rupali Patwardhan, Capstone Presentation

Limitation of this Approach • The motif should have at least a few continuous amino acid residues • So the method may fail if the motif consists of alternate residues • E.g. AxAxCxDxAxGxC (x could be any residue) or AxCDxGxRGxC, since these motifs would not lead to high-weight edges in the de Bruijn graph • The problem is due to the need for overlaps, which is inherent nature of de Bruijn Graphs Rupali Patwardhan, Capstone Presentation

Gapped Version • For each node, we also create nodes obtained by applying a gap mask (or “Dont care” mask) on that node • We currently restrict the maximal number of “Dont cares” in a node to 2 • There are 10 such masks Rupali Patwardhan, Capstone Presentation

Gapped Version • Let ‘1’ represent a conserved amino acid and ‘0’ represent a gap or “Don’t care” • Then the 10 masks can be represented as: 1111, 0111, 1110, 1011, 1101, 1100, 0011, 1001, 0110, 1010, 0101 Rupali Patwardhan, Capstone Presentation

Masking Example • If ANCD is the node that we are applying the mask to • ANCD * 1001 = AxxD • ANCD * 1101 = ANxD • ANCD * 1011 = AxCD Rupali Patwardhan, Capstone Presentation

ARCD RCDM ANCD NCDE ASCD SCDT 1 1 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 1 Rupali Patwardhan, Capstone Presentation

AxCD xCDx AxCD xCDx AxCD xCDx 1 1 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 1 Rupali Patwardhan, Capstone Presentation

AxCD xCDx AxCD xCDx AxCD xCDx 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 3 Rupali Patwardhan, Capstone Presentation

AxCDxxGH ANCD NCDE CDEF DEFG EFGH AxCD NxDE CxEF DxFG ExGH ANxD NCxE CDxF DExG EFxH ANxx NCxx CDxx DExx EFxx xxCD xxDE xxEF xxFG xxGH AxxD NxxE CxxF DxxG ExxH xNCx xCDx xDEx xEFx xFGx AxCx NxDx CxEx DxFx ExGx xNxD xCxE xDxF xExG xFxH . . . . . . . . . . . . . . .

Implementation • The algorithm is implemented in Perl • Web Interface • http://biokdd.informatics.indiana.edu/rpatward/deBruijn/project.html Rupali Patwardhan, Capstone Presentation

Issues in Testing Motif Discovery Algorithms • No Benchmarking dataset • Difficult to compare different algorithms since they have very different kinds of parameters. • Some motifs are easier to find than others. Rupali Patwardhan, Capstone Presentation

Test I • First 100 PROSITE patterns and their corresponding protein families were used as the test dataset to test the accuracy of the output. • The output of the program was compared to MEME and PRATT. Rupali Patwardhan, Capstone Presentation

Results I For MEME and PRATT, the top 3 motifs were considered. Rupali Patwardhan, Capstone Presentation

Test II • We also tested families corresponding to 162 PROSITE patterns that did not have any continuous conserved amino acid residues, but had at least one occurrence of alternate conserved amino acid residues. Rupali Patwardhan, Capstone Presentation

Results II Rupali Patwardhan, Capstone Presentation

MEME was run on IBM SP cluster on 8 processors in parallel Rupali Patwardhan, Capstone Presentation

Future Work • Categorizing easy and difficult motifs. • Extending this approach to consensus-based multiple sequence alignment. • Predicting if a given protein sequence is likely to belong to a particular family or not. Rupali Patwardhan, Capstone Presentation

Acknowledgements • Dr. Mehmet Dalkilic • Dr. Haixu Tang • Dr. Sun Kim • Bioinformatics Research Group Rupali Patwardhan, Capstone Presentation

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

Presentation Transcript

Motif Discovery: Algorithm and Application

DNA Motif and protein domain discovery

Convergent Dense Graph Sequences

Predictive Methods Using Protein Sequences Unit 23

Motif discovery

Motif search and motif discovery

Motif search and discovery

The AMADEUS Motif Discovery Platform

Protein Sequences

Motif Discovery in Heterogeneous Sequence Data

Exploring Protein Sequences

An Approx. Algo. For Alignment MSA using Motif Discovery

Whole-genome motif discovery

Counting Graph Colourings by using Sequences of Subgraphs

Motif discovery in co-regulated genes

Biological Motif Discovery

Motif discovery

Motif Discovery in Protein Sequences using Messy De Bruijn Graph

Programmed Graph Rewriting: MoTif

Comparing Protein Sequences

Algorithms for Regulatory Motif Discovery

De Bruijn Sequences