Domain-SLiM mining from High Throughput Protein Interaction Data

Domain-SLiM mining from High Throughput Protein Interaction Data Hugo Willy August 19, 2010

Introduction to SLiM It stands for (Protein) Short Linear Motif By its name, it is a short linear stretch of region in a protein sequence that is recognized by another protein for binding It averages at 8-12 amino acids where some can go as short as three amino acids. It is currently one of the special mechanism of a protein recognizing its interaction partner

Protein Interaction in general Some proteins only function in terms of a complex. They have to be in a certain combination. These are called the obligate complexes. Their binding surface are usually large to provide strong chemical interactions.

Protein Interaction in general (2) On the other hand, some complexes are formed only “on-demand”. Once the task is done, they dissociate. These are called the transient complexes. The interaction surface of this type of interaction is generally smaller. SLiM based interaction is one of the transient ones.

Picture of non-linear interface(common case in obligate complexes) The interaction region is non linear

Picture of Linear interface The protein chain bound is linear on the interface of the partner.

Protein domains recognizing SLiMs In reality, the task of recognizing a SLiM often is performed by specialized protein domain. Some of the most well known example is the SH3 domain which recognize P..P motif where P is a proline amino acid. WW domains recognize PP.Y motif. These SLiMs, along with their functions (or domain association) are listed in databases like Eukaryotic Linear Motif (ELM) [1] and MiniMotif (MnM) [2]

Methods of finding SLiMs in proteins • The SLiMs listed in the two databases are mostly results of experimental procedures like mutagenesis and phage display. • They are laborious and expensive.

Computational Method to detect SLiMs From Sequence-based data (the focus of this talk) From Structural data – earlier this year, we published SLiMDiet [3], which is currently the most comprehensive SLiM listing from the PDB.

Sequence-based SLiM detection Protein sequence based Given a set of grouped sequence, find motif that occurs in unrelated sequences. Example: DILIMOT [4], SliMDisc [5], SLiMFinder [6] Protein interaction based Find correlated motifs that is over-represented in interacting proteins Example: D-STAR [7], MotifCluster [8], SLIDER [9]

Protein Sequence Based Methods Rely on occurrences on unrelated sequences. May need to remove protein domains from the motif search space because of their similarity The grouping of the sequences can be manual – by manually selecting known sequences with a certain property. For example, proteins that are exported outside the cells can be grouped to find the motif that is responsible for the export mechanism. Automated grouping – using the protein domain information or GO ontology annotation

Protein Sequence Based Methods (2) Once the grouping is done, the motif is mined using standard motif searching like MEME or TEIRESIAS. Because of the speed and rigid requirement of motif length of MEME, usually TEIRESIAS is the program of choice (it can start with a motif length and try to combine the motifs into longer ones). Teiresias uses L,W motif – motif of length L over window of length W.

Protein Sequence Based Methods (3) • The problem of this method is that it relies too much on the initial grouping. • The grouping must have the motif really over-represented. • All paper in this line have been comparing their performances in the ELM set (a dataset of curated sequences which are known to contain the ELM motif).

Protein Sequence Based Methods (4) • They also found some significant motifs from the group of protein known to interact with a certain protein domain which is known to have such SLiM interaction. • DILIMOT got published in PLoS Biology as they managed some biological validations.

Interaction based methods • To be precise, none of the interaction based methods designed up to date were specifically designed to find SLiMs. • Most of them are finding “correlated motif pair”. • These are a pair of motif which occur consistently more frequently in interacting proteins as opposed to some background model. • Examples: D-STAR, MotifCluster and SLIDER

Interaction based methods (2) • These methods rely solely on the density of interactions between the two set of proteins that contain the motif pair respectively. • D-STAR and SLIDER uses a Chi-Square scoring while MotifCluster uses hypergeometric scoring. • As I shall show later, they may not be suitable in finding SLiMs – they are by design finding interaction motif which may not be the binding motif themselves.

My current attempt - SLIMMER • I learnt that most of the time SLiMs are bound by a non-linear interface. • Thus, it is not very feasible to hope that both side of the interface contain linear motifs. • This was mentioned by one of D-STAR’s reviewer. • So, I try to find correlated motifs where one of them is a protein domain – which is by definition non-linear (they are distinct protein folds in 3D)

SLIMMER (2) • I basically combine the good ideas from many programs to accomplish this. • The strength of correlated motifs is that they can find seemingly insignificant motifs (by virtue of their sequence occurrence) by using the fact that once they occur, they interact intensively with the partner motif. • The correlated motif uses over-representation of the interaction occurrence, as opposed to sequence occurrence.

SLIMMER (3) • However, the tricks of sequence based method can also be applied. • They requires occurrence of the SLiMs in non-homologous sequences (which can be considered as independent occurrences). • This non homology is never considered in D-STAR, MotifCluster and SLIDER. • We should consider only non-homologous interactions when we count the occurrence of the motif pair.

SLIMMER (4) • The SLiM itself must have an occurrence probability better than random. MotifCluster uses the binomial distribution to compute the probability of seeing a motif M, k times in the sequence set (this is threshold approach). • I also tried another approach where I combine the binomial p-value of the motif occurrence and the hypergeometric p-value of the interaction occurrence.

SLIMMER (5) • Current results, SLIMMER is better than all methods available and it is also fast. • I am still implementing a better background model to deal with low complexity regions – using a simple 3rd or 4th order markov. • I also in the middle of trying a motif model that allows choices like [LIVM], [FWY] or [KRH] • The program allowing these currently is only SLiMFinder and it is very slow and inaccurate for now.

References • [1] P Puntervoll et al. ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res., 31(13):3625–3630, 2003. • [2] S Rajasekaran et al. Minimotif miner 2nd release: a database and web system for motif search. Nucleic Acids Res., 37(Database issue):D185–190, 2009. • [3] W Hugo et al. SLiM on Diet: finding short linear motifs on domain interaction interfaces in Protein Data Bank. Bioinformatics 2010 26(8):1036-1042 • [4] V Neduva et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol., 3(12):e405, 2005. • [5] N E Davey et al. SLiMDisc: short, linear motif discovery, correcting for common evolutionary descent. Nucleic Acids Res., 34(12):3546–3554, 2006. • [6] R J Edwards et al. SlimFinder: a probabilistic method for identifying overrepresented, convergently evolved, short linear motifs in proteins. PLoS ONE, 2(10):e(967), 2007.

References (2) • [7] S H Tan et al. A correlated motif approach for finding short linear motifs from • protein interaction networks. BMC Bioinformatics, 7:502, 2006. • [8] H C Leung et al. Clustering-based approach for predicting motif pairs from protein interaction data. J Bioinform Comput Biol. 2009 Aug;7(4):701-16. • [9] P Boyen et al. SLIDER: Mining correlated motifs in protein-protein interaction • networks. In Proceedings of the 2009 Ninth IEEE International Conference on • Data Mining, pages 716–721, 2009.

Domain-SLiM mining from High Throughput Protein Interaction Data

Domain-SLiM mining from High Throughput Protein Interaction Data

Presentation Transcript

High Performance Data Mining

Protein-protein interaction

High Performance Data Mining

Association Analysis-based Extraction of Functional Information from Protein-Protein Interaction Data

High Throughput Computing and Protein Structure

Simulating high throughput data with FBA

Significance Testing of High-Throughput Data

Mining Patterns from Protein Structures

A protein domain interaction interface database: InterPare

Ab Initio Crystal Structure Prediction: High-throughput and Data Mining

“High throughput” protein structure prediction application in EUChinaGRID

Protein Interaction (domain domain interaction)

An Ontology for Protein-Protein Interaction Data

Identifying Changes in Signaling from High-Throughput Data

Protein – protein interaction

Protein-Protein Interaction Network

High-throughput Biological Data The data deluge

Domain-Based Protein-Protein Interaction Prediction Using Random Decision Forest Framework

Data Analysis for High-Throughput Sequencing

High Data Throughput Recommended Standard

Conserved Domain C ombination In Protein Interaction

High Throughput Sequence (HTS) data analysis