Biomedical Information Extraction using Inductive Logic Programming

Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training grant 5T15LM007359-02

Abstract Automated methods for finding relevant information from the large amount of biomedical literature are needed. Information extraction (IE) is the process of finding facts from unstructured text such as biomedical journals and putting those facts in an organized system. Our research mines facts about a relationship (e.g. protein localization) from PubMed abstracts. We use Inductive Logic Programming (ILP) to learn a set of logical rules that explain when and where a relationship occurs in a sentence. We build rules by finding patterns in syntactic as well as semantic information for each sentence in a training corpus that has been previously marked with the relationship. These rules can then be used on unmarked text to find new instances of the relation. Some major research issues involved in this approach are handling unbalanced data, searching the enormous space of clauses, learning probabilistic logical rules, and incorporating expert background knowledge.

The Central Dogma • Discoveries • protein - protein interactions • protein localizations • genetic diseases • Most knowledge stored in articles • Just Google it? *image courtesy of National Human Genome Research Institute

World of Publishing • Current • authors write articles in Word, LaTeX, and publish in conferences, journals, etc • humans index and extract relevant information (time and cost intensive) • Future? • all published articles available on the Web • semantic web – extension of HTML for content • articles automatically annotated and indexed into searchable databases

Information Extraction • Given a set of abstracts tagged with biological relationships between phrases • Do learn a theory (eg, set of inference rules) that accurately extracts these relations

Training Data

Why Use ILP? • KDD Cup 2002 • logical rules (handcrafted) did best on IE task • Hypotheses are comprehensible • written in first-order predicate calculus (FOPC) • aim to cover only positive examples • Background knowledge easily incorporated • expert advice • linguistic knowledge of English parse trees • biomedical knowledge (eg. MESH)

ILP Example: Family Tree • Background Knowledge • mother(ann, mary) • mother(ann, tom) • father(tom, eve) • father(tom, ian) • female(ann) • female(mary) • female(eve) • male(tom) • male(ian) • Positive • daughter(mary, ann) • daughter(eve, tom) • Negative • daughter(tom, ann) • daughter(eve, ann) • daughter(ian, tom) • daughter(ian, ann) • … Ann Mother Mother Tom Mary Father Father Eve Ian • Possible Rules • daughter(A,B) if male(A) and father(B,A) • daughter(A,B) if mother(B,A) • daughter(A,B) if female(A) and male(B) • daughter(A,B) if female(A) and mother(B,A)

Sundance Parsing Sentence … NP-Conj seg VP segment NP segment … unk conj unk cop unk noun … smf1 and smf2 are mitochondrial membrane_proteins … Sentence Structure Predicates parent(smf1,np-conj seg) parent(np-conj seg,sentence) child(np-conj seg,smf1) child(sentence,np-conj seg) next(smf1,and) next(np-conj seg,vp seg) after(np-conj seg,np seg) … Part of Speech Predicates noun(membrane_proteins) verb(are) unk(smf1) noun_phrase(np seg) verb_phrase(vp seg) … Biomedical Knowledge Predicates in_med_dict(mitochondrial) go_mitochondrial_membrane(smf1) go_mitochondrion(smf1) … Lexical Word Predicates novelword(smf1) novelword(smf2) alphabetic(and) alphanumeric(smf1) …

Sample Learned Rule gene_disease(E,A) :-isa_np_segment(E), isa_np_segment(A), prev(A,B), pp_segment(B), child(A,C), next(C,D), alphabetic(D), novelword(C), child(E,F), alphanumeric(F). Sent. Prepositional Phrase Noun Phrase Noun Phrase B A E C D F Novel Word Alphabetic Word Alphanumeric Word

Ensembles for Rules • N heads are better than one… • learn multiple (sets of) rules with training data • aggregate the results by voting on classification of testing data • Bagging (Brieman ’96) • each rule-set gets one vote • Boosting (Freund and Shapire ’96) • each rule gets weighted vote

Drawing a PR Curve Precision Recall

Craven Group Boosting Rule Quality Bagging Testset Results

Handling Large Skewed Data • 5 fold cross-validation • train : 1007 positive / 240,874 negative • test : 284 positive / 243,862 negative • With a 95% accurate rule set … • 270 true positives • 12,193 false positives! • recall = 270 / 284 = 95.0% • precison = 270 / 12,363 = 2.1%

Handling Large Skewed Data • Ways to handle data • assign different costs to each class • much more important to not cover negatives • under-sampling with bagging • negatives under-represented • key is to pick good negatives • filter data to restore equal ratio in testing data • use naïve Bayes to learn relational parts

Filters to Reduce Negatives naïve Bayes filter pos genes diseases pos neg join back noun phrase filter split into parts neg 1 : 39 1 : 485 naïve Bayes filter 1 : 1,979

Probabilistic Rules • Logical rules are too strict and often overfit • Add probabilistic weight to each rule • based on accuracy on tuning set • Learn parameters • make each rule a binary feature • use any standard Machine Learning algorithm (Naïve Bayes, perceptron, logistic regression…) to learn the weights • assign probability to examples based on weights

Weighted Exponential Model where is a weight for each feature Taking logs we get We need to set to maximize log probability of the tuning set

Weighted Exponential Model

Incorporating Background Knowledge • Creation of predicates that capture salient features • endsIn(word, ‘ase’) • occursInAbstractNtimes(word, 5) • Incorporation of prior knowledge into the learning system • protein(word) if endsIn(word, ‘ase’) and occursInAbstractNtimes(word, 5).

Searching in Large Spaces • Probabilistic bottom clause • probabilistically remove least significant predicates from the “bottom clause” • Random rule generation • in place of hill-climbing, randomly select rules of a given length from the bottom clause • retain only those rules which do well on a tune set • Learn coverage of clauses • neural network, Bayesian learning, etc.

References • Nelson, Stuart J.; Powell, Tammy; Humphreys, Betsy L. The Unified Medical Language System (UMLS) Project. In: Encyclopedia of Library and Information Science. Forthcoming. • Christopher D. Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing 1999. MIT Press • Ellen Riloff The Sundance Sentence Analyzer. 2002 • Ines De Castro Dutra, et. al. An Emperical Evaluation of Bagging in Inductive Logic Programming. 2002. in Proceedings of the International Conference on Inductive Logic Programming. Syndey, Australia. • Dayne Frietag and Nicholas Kushmerick Boosted Wrapper Induction. 2001. in Proceedings of American Association of Artificial Intelligence (AAAI-2000) • Souyma Ray and Mark Craven Representing Sentence Structure in Hidden Markov Models for Information Extraction. 2001. in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001) • Tina Eliassi-Rad and Jude Shavlik A Theory-Refinement Approach to Information Extraction. 2001. in Proceedings of the 18th International Conference on Machine Learning • M. Craven and J. Kumlien Constructing biological knowledge-bases by extracting information from text sources. 1999. in Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 77-86. Germany. • Leo Breiman. Bagging Predictors. 1996. Machine Learning, 24(2):123-140. • Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. 1996. in International Conference on Machine Learning, pages 148-156.

Biomedical Information Extraction using Inductive Logic Programming