Comprehensive Assessment of Signal Peptide Prediction Methods

Justin Choo, Tin Wee Tan, Shoba Ranganathan A comprehensive assessment of N-terminal signal peptide prediction methods

Targeting of secretory proteins • Secretory proteins reportedly represent 30% of the proteome of an organism (Skach, 2007) with functionally diverse classes of molecules such as cytokines, chemokines, hormones, digestive enzymes, antibodies, extracellular proteinases, morphogens, toxins • Short peptide called signal peptides (SPs), virtually controls majority of these proteins to the secretory pathway (Gierasch, 1989; Rapoport, 1992) • function as address labels / postal codes

Length of 13 to 36aa (Molhoj and Degan, 2004) Features of SP: tri-partite regions P3 P1

Varying length distribution (Choo and Ranganathan, 2008)

More than just targeting … • In vitro evidence of free SPs inhibit protein translocation (Chen et al., 1987; Simon et al., 1992) • Prevent premature or misfolding of secretory preproteins (Weiss and Bassford, 1990; Li et al., 1996) • Affect translocation efficiency (Thornton et al., 2006) and modulate secretion • Serve as ligand for opening translocation channel (Rutkowski et al., 2001) • Influence regulation of proteins to destination (Kurys et al., 2000)

More than just targeting … (2) • Associated with risk for autoimmune diseases due to inefficient processing of autoimmunity(Anjos et al., 2002) • Post-targeting functions : immune surveilance of healthy cells(Lemberg et al., 2001) • Signaling function • fragments found bound to MHC complexes on cell surface (O’Callaghan et al., 1998) • Cytosolic calmodulin (Martoglio et al., 1997) • Mutation or minor alterations implicated in a host of diseases and complications e.g. neurohypophyseal diabetes insipidus(Rittig et al., 2002), classic Ehlers-Danlos syndrome (connective tissue disease)(Symoens et al., 2008)

Challenge Cleaved off by type I signal peptidase (SPase I)

P1 P3 (Choo and Ranganathan, 2008)

Existing methods • Philius (Reynolds et al., 2008) – Bayesian networks • Phobius (Käll et al., 2004) – HMM • PrediSi (Hiller et al., 2004) – Position weight matrix (PWM) • RPSP (Plewczynski et al., 2008) – ANN • SigCleave (Rice et al., 2000) – PWM • SigHMM (Zhang and Wood, 2003) – Profile HMM • SignalP • HMM (Nielsen and Krogh, 1998) • ANN(Nielsen et al., 1997; Bendtsen et al., 2004)

Existing methods … (2) • Signal-BLAST (Frank and Sippl, 2008) – Pairwise alignment using BLAST • Signal-CF (Chou and Shen, 2007), Signal-3L (Shen and Chou, 2007) – KNN+subsite coupling • SIG-Pred (Bradford, 2001) – PWM • SOSUIsignal (Gomi et al., 2004) – Indices • SPEPlip (Fariselli et al., 2003) – ANN+PROSITE pattern • SPOCTOPUS (Viklund et al., 2008) – ANN+HMM

Objective • Benchmark 13 most popular, recent methods to provide a comparable results • Test with 3 datasets involving thousands of sequences

Omission • popular datasets (Menne et al. 2000; Nielsen et al., 1997) excluded -> derived from earlier Swiss-Prot (Rel. 27.0 & Rel. 38.0) -> ours include them • neural network-based approaches (Jagla and Schuchhardt, 2000; Reczko et al., 2002) • SVMs-based approaches (Mukherjee and Mukherjee, 2002; Vert, 2002; Cai et al., 2003; Sun and Wang, 2008) • profile HMM-based method CJ-SPHMM (Chen et al., 2003) • matrix-based + information theory (Liu et al., 2005) • a BLOMAP-encoding scheme (Maetschke et al., 2005) • hybrid:bio-basis func NNs + decision trees (Sidhu and Yang, 2006) • global alignment tool (Liu et al., 2007) • subcellular localizations (e.g. iPSORT, ProteinProwler) • N-terminus targeting signals (e.g. Predotar), that predict the presence of SPs but don’t indicate cleavage sites • Specialized tools e.g. SecretomeP which predict non-classical SPs i.e. signal sequences that remain uncleaved and TargetP, since it uses SignalP for SP prediction. SPEPlip -> unavailable

General filtering criteria a) Annotation hinting of uncertainty or experimentally unverified (e.g. “probable”, “missing”, “by similarity”, “inferred”, “potential”, “putative” and “conflict”) b) Lipoprotein cleaved by SPase II (“PROKAR_LIPOPROTEIN” under the “DR” field) c) Fragment sequence d) Organellar protein (under “OG” field) e) Mollicutes, a division of bacteria that lack cell wall (under “OC” field) f) Bacteria without any classification (e.g. [Swiss-Prot: SAT_RIFPS]) • Sequences with ambiguous characters or non-standard amino acid code (e.g. “X”, “Z”, “U” etc.) (e.g. [Swiss-Prot:KV3A6_MOUSE]) • Duplicates, redundancy reduction

Benchmark datasets +ve -ve

Evaluation

Aggregated results from all experiments 1st SignalP most accurate; ANN slightly > HMM 2nd Rapid Prediction of Signal Peptides (RPSP)

Experiment #1 77.4% 81.3% >80% acc . Euk . .

4,704 sequences #2 Euk GN GP

456 sequences #3 Euk GN GP

Discussion • Non-linear feature may be involved in the recognition of cleavage site (Ladunga, 2000) • better accuracy by ML techniques ? • Alignment-based approaches (e.g. Signal-BLAST and SigHMM • highly dependent on balance between sensitivity & specificity • not suitable for detecting seqs sharing weak homology • Most tools results of Euk >> Bac datasets • Larger set -> better model

Discussion … (2) • Most tools easily distinguish sec vs non-sec proteins; studies (Nielsen et al., 1998) involving discrimination between signal anchors and SPs lead to similar conclusions • Report on > 1/3 of the putatively assigned cleavage sites was observed to be inaccurate (Zhang and Henzel, 2004) • SignalP leads for all 3 organism groups across 3 experiments • consistency for both ANN and HMM versions • more complex models and robustness of its method • various specific scoring schemes to tackle different aspects (including SP-likeness, the probability of a segment containing the cleavage site and so on) • seq. window relatively wider (Euk:[-11,+2]; Gneg:[-21,+2], Gpos:[-15,+2])

Discussion … (3) • The majority of the tools clearly require ‘active learning’ or regular update to their underlying models to reflect the latest data distribution • Canonical Ala-X-Ala motif (von Heijne, 1986) - the essence for the postulation of the “(-3,-1) rule” • The rule states: P1 must be small residues (Ala, Ser, Gly, Cys, Thr or Gln) but prohibits aromatic (Phe, His, Tyr, Trp), charged (Asp, Glu, Lys, Arg) or large polar (Asn, Gln) at P3. Further, Pro must be absent from P3 to P1’ • Gram+ (61.9%), Gram- (77.5%) observed in our data • P3 and P1 have been known to be critical recognition sites for SPases I (Karla et al., 2005)

Conclusion • Alternative approach • Two-step prediction vs One • Larger window frame • More data needed to evaluate further 23

Benchmark datasets • Dataset #1: • +ve: 270 secreted recombinant human proteins taken from (http://share.gene.com/cleavagesite/index.html) • -ve: Original study omit specificity test; 270 human non-secretory proteins from (Zhang & Henzel, 2004) -> SigHMM; • Dataset #2: • +ve: 2349 SPdb5.1(Choo et al., 2005) - filtered from Swiss-Prot 55.0 and used by majority prediction methods construction • -ve: a mix of cytoplasmic and nuclear (Euk only) proteins • Dataset #3: • +ve: Swiss-Prot 57.0 • excludes entries existed in #1, #2 • 50% of Euk instances and > 90% of Bac sequences • putative SPs with high probability

Detailed results 25

Comprehensive Assessment of Signal Peptide Prediction Methods