Razvan C. Bunescu Raymond J. Mooney

Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions Razvan C. Bunescu Raymond J. Mooney Arun K. Ramani Edward M. Marcotte Institute for Cellular and Molecular Biology and Center for Computational Biology and Bioinformatics University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin {arun, marcotte}@icmb.utexas.edu {razvan, mooney}@cs.utexas.edu

Outline • Introduction & Motivation. • Two benchmark tests of accuracy. • Framework for the extraction of interactions. • Future Work. • Conclusions.

Introduction • Large scale protein networks facilitate a better understanding of the interactions between proteins. • Most complete for yeast. • Minimal progress for human. • Most known interactions between human proteins are reported in Medline. • Reactome, BIND, HPRD: databases with protein interactions manually curated from Medline. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associatedwithp9Ckshs1, a Cdk-binding subunit.

Motivation • Many interactions from Medline are not covered by current databases. • Databases are generally biased for different classes of interactions. • Manually extracting interactions is a very laborious process. Aim: Automatically identify pairs of interacting proteins with high accuracy.

Outline • Introduction & Motivation. • Two benchmark tests of accuracy. • Functional Annotation. • Physical Interaction. • Framework for the extraction of interactions. • Future Work. • Conclusions.

Accuracy Benchmarks – Shared Functional Annotations • Accuracy of interaction datasets correlates well with % of interaction partners sharing functional annotations. • Functional annotation  a pathway between the two proteins in a particular ontology: • KEGG: 55 pathways at lowest level. • GO: 1356 pathways at level 8 of biological process annotation.

Accuracy Benchmarks – Shared Known Physical Interactions • Assumption: Accurate datasets are more enriched in pairs of proteins known to participate in a physical interaction. • Reactome and BIND are more accurate than others  use them as source of known physical interactions. • Total: 11,425 interactions between 1,710 proteins.

Accuracy Benchmarks – LLR Scoring Scheme • Use the log-likelihood ratio (LLR) of protein pairs with respect to: • Sharing functional annotations. • Physically interacting. P(D|I) and P(D|-I) are the probabilities of observing the data D conditioned on the proteins sharing (I) or not sharing (-I) benchmark associations. • Higher values for LLR indicate higher accuracy.

Outline • Introduction & Motivation. • Two benchmark tests of accuracy. • Framework for the extraction of interactions. • Future Work. • Conclusions.

Medline abstract Protein Extraction Medline abstract (proteins tagged) Interaction Extraction Framework for Interaction Extraction Interactions Database • Extensive comparative experiments in [Bunescu et al. 2005] • Protein Extraction: Maximum Entropy tagger. • Interaction Extraction: ELCS (Extraction using Longest Common Subsequences). • Current framework aims to improve on the previous approach on a much larger scale (750K Medline abstracts).

Framework for Interaction Extraction [Protein Extraction] • Identify protein names using a Conditional Random Fields (CRFs) tagger trained on a dataset of 750 Medline abstracts, manually tagged for proteins. [Interaction Extraction] 2) Keeping most confident extractions, detect which pairs of proteins are interacting. Two methods: 2.1) Co-citation analysis (document level). 2.2) Learning of interaction extractors (sentence level). [Lafferty et al. 2001]

1) A CRF tagger for protein names • Protein Extraction  a sequence tagging task, where each word is associated a tag from: O(-utside), B(-egin), C(-ontinue), E(-nd), U(-nique). O O O O O O B E O O O O O In synchronized human osteosarcoma cells , cyclin D1 is induced in early G1 • The input text is first preprocessed: • Tokenized • Split in sentences (Ratnaparki’s MXTerminator) • Tagged with part-of-speech (POS) tags (Brill’s tagger)

1) A CRF tagger for protein names • Each token position in a sentence is associated with a vector of binary features based on the (current tag, previous tag) combination, and observed values such as: • Words before, after or at the current position. • Their POS tags & capitalization patterns. • A binary flag set on true if the word is part of a protein dictionary. current POS POS before POS after IN VBN JJ NN NNS , NN NNP VBZ VBN IN JJ In synchronized human osteosarcoma cells , cyclin D1 is induced in early words before words after current word

1) A CRF tagger for protein names • The CRF model is trained on 750 Medline abstracts manually annotated for proteins. • Experimentally, CRFs give better performance then Maximum Entropy models – they allow local tagging decisions to compete against each other in a global sentence model. • The model is used for tagging a large set (750K) of Medline abstracts citing the word ‘human’. • Each extracted protein is associated a normalized confidence value. • For the Interaction Extraction step, we keep only proteins scoring 0.8 or better.

2.1) Interaction Extraction using Co-citation Analysis • Intuition: proteins co-occurring in a large number of abstracts tend to be interacting proteins. • Compute the probability of co-citation under a random model (hyper-geometric distribution). N – total number of abstracts (750K) n – abstracts citing the first protein m – abstracts citing the second protein k – abstracts citing both proteins

2.1) Interaction Extraction using Co-citation Analysis • Protein pairs which co-occur in a large number of abstracts (high k) are assigned a low probability under the random model. • Empirically, protein pairs whose observed co-citation rate is given a low probabilty under the random model score high on the functional annotation benchmark. • RESULT: Close to 15K interactions extracted that score comparable or better than HPRD on the functional annotation benchmark.

Medline abstract CRF tagger Medline abstract (proteins tagged) Co-citation Analysis Ranked Interactions Naïve Bayes scores 2.1) Co-citation Analysis with Bayesian Reranking • Use a trained Naïve Bayes model to measure the likelihood that an abstract discusses physical protein interactions. • For a given pair of proteins, compute the average score of co-citing abstracts. • Use the average score to re-rank the 15k already extracted pairs. Re-ranked Interactions

Integrating Extracted Data with Existing Databases Extracted: 6,580 interactions between 3,737 human proteins Total: 31,609 interactions between 7,748 human proteins.

2.1) Co-citation Analysis: Evaluation

2.2) Learning of Interaction Extractors • Proteins may be co-cited for reasons other than interactions. • Solution: sentence level extraction, with a binary classifier. • Given a sentence containing the two protein names, output: • Positive: if the sentence asserts an interaction between the two. • Negative: otherwise. • If the sentence contains n > 2 protein names, replicate it into (n choose 2) sentences, each with only two protein names. • Training data: AImed, a collection of Medline abstracts, manually tagged.

AImed • Total of 225 documents (200 w/ interactions + 25 wo interactions) • Annotations for proteins and interactions In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associatedwithp9Ckshs1, a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity … cyclin D1 …becomes associated withp9Ckshs1 => Interaction cyclin D1 is associated with both p34cdc2 => Interaction cyclin D1 is associated with both p34cdc2 andp33cdk2 => Interaction

ELCS (Extraction using Longest Common Subsequences) [Bunescu et al., 2005] • A new method for inducing rules that extract interactions between previously tagged proteins. • Each rule consists of a sequence of words with allowable word gaps between them, similar to [Blaschke & Valencia, 2001, 2002]. - (7)interactions (0) between (5)PROT(9)PROT(17) . • Any pair of proteins in a sentence if tagged as interacting forms a positive example, otherwise it forms a negative example. • Positive examples are repeatedly generalized to form rules until the rules become overly general and start matching negative examples.

ERK (Extraction using a Relation Kernel) • The patterns (features) are sparse subsequences of words constrained to be anchored on the two protein names. • The feature space can be further pruned down – in almost all examples, a sentence asserts a relationship between two entities using one of the following patterns: • [FI]Fore-Inter: ‘interaction of P1with P2’, ‘activationof P1by P2’ • [I]Inter: ‘P1interacts with P2’, ‘P1is activated by P2’ • [IA]Inter-After: ‘P1– P2complex’, ‘P1and P2interact’ • Restrict the three types of patterns to use at most 4 words (besides the two protein anchors).

ERK (Extraction using a Relation Kernel) S1 In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associatedwithp9Ckshs1, a Cdk-binding subunit. S2Experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1is associated with both p34cdc2 and p33cdk2, and • The kernel K(S1,S2) the number of common patterns between S1and S2, weighted by their span in the two sentences. • K(S1,S2) can be computed based on the dynamic procedure from [Lodhi et al., 2002]. • Train an SVM model to find a max-margin linear discriminator between positive and negative examples • [FI] patterns: “human cells P1associated with P2”, … • [I] patterns: “P1associated with P2”, … • [IA] patterns: “P1associated with P2,”, …

Evaluation: ERK vs ELCS vs Manual • Compare results using the standard measures of precision and recall: • All three systems were tested on Aimed, using gold-standard proteins.

Evaluation: ERK vs ELCS vs Manual

Future Work & Conclusions Future Work: • Analyze the complete set of 750K abstracts using the relational kernel and integrate results into an improved composite dataset. Conclusions: • Created a large database of interacting human proteins by consolidating interactions automatically extracted from Medline abstracts with existing databases. • Final database: 31,609 interactions between 7,748 human proteins.

For Further Information • Consolidated database available on line: • http://bioinformatics.icmb.utexas.edu/idserve/ • Papers available online: • http://www.cs.utexas.edu/users/ml/publication/bioinformatics.html • “Consolidating the Set of Known Human Protein-Protein Interactions in Preparation for Large-Scale Mapping of the Human Interactome,” Ramani, A.K., Bunescu, R.C., Mooney, R.J. and Marcotte, E.M.,Genome Biology, 6, 5, r40(2005). • “Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions,”Arun Ramani, Edward Marcotte, Razvan Bunescu, Raymond Mooney, to appear in the Proceedings of ISMB BioLINK SIG: Linking Literature, Information and Knowledge for Biology, Detroit, MI, June 2005. • “Collective Information Extraction with Relational Markov Networks,” Razvan Bunescu and Raymond J. Mooney, Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004), pp. 439-446, Barcelona, Spain, July 2004. • “Comparative Experiments on Learning Information Extractors for Proteins and their Interactions.,” Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward M. Marcotte, Raymond J. Mooney, Arun Kumar Ramani, and Yuk Wah Wong, Artificial Intelligence in Medicine (Special Issue on Summarization and Information Extraction from Medical Documents), 33, 2 (2005), pp. 139-155.

The End

Protein Interaction Datasets – Normalization • Need a shared convention for referencing proteins and their interactions. • Map each interacting protein to a LocusLink ID => small loss of proteins. • Consider interactions symmetric => many duplicates eliminated. • Omit self interactions – cannot be evaluated on functional annotation benchmark. Example: HPRD reduced from 12,013 to 6,054 unique symmetric, non-self interactions.

Protein Interaction Datasets – Normalization • Dataset statistics after normalization (Is  interactions, Ps  proteins):

Accuracy of manually curated interactions

Razvan C. Bunescu Raymond J. Mooney