Tools for Ontology-based Corpus Annotation

Tools for Ontology-based Corpus Annotation Tomoko OHTA, Yuka TATEISI, and Jun’ichi TSUJII Department of Information Science, Graduate School of Science, University of Tokyo

Abstract Introduction: Automatic information extraction is a key technology to help researchers access the information contained in research papers and to extend databases on substances and biological processes. We aim to build information extraction systems [1,2] from biology papers and their abstracts available from the MEDLINE [3] database. As a part of a project on information extraction from the research papers in biology domain, we are creating an expert-tagged corpus of MEDLINE abstracts, which will be used for training and testing the information extraction systems [4,5]. The markup scheme is based on a conceptual domain model (ontology) and implemented in XML [6]. Tools: The task of annotation can be regarded as identifying and classifying the terms that appears in the texts according to a pre-defined classification. For a reliable classification, the classification must be well defined and easy to understand by the domain experts who annotate the texts. To fulfill this requirement, we think that the tag-set should be based on a concrete data model (ontology) of the biology domain, which serves as a standardized representation of background knowledge of domain experts. Although a XML-tagged text can be created by using text editors, semantically annotated corpora must be created by domain experts who are not always familiar with XML tag scheme. Thus an easy-to-use tagging tool to help annotators is indispensable for efficiency and accuracy. We developed a GUI-based tag definition tool TagEdit and tagging tool JTag in JAVA language. In the tag definition tool TagEdit, definition of new tag-set, refinement of definition, enhancement of the tag-set by adding or removing tags, and enrichment of tags by adding or removing attributes are available. The tagging tool JTag has two frames: one is a tag selection frame, and the other is an annotation frame. In the tag selection frame, a tag-set based on ontology defined by using TagEdit is appeared as a concept hierarchy. Tag data including the class of tag, the position of tag, and values of attributes is saved as XML document and annotated text can be saved as tag-embedded form.

Introduction • Information need • Information retrieval (IR) & filtering (IF) • Information extraction (IE) • Document / term classification & categorization • Summarization, … • Overview of GENIA system • Background knowledge design • Ontology, Data model, Markup language, … • Resource building • Corpus annotation (aid tool), Database construction, … • Core module • Information extraction, Information retrieval, … • Web-based integrated interface

Overview of GENIA System Retrieval Module Corpus • Request enhancement • Spawn request • Classify documents Information Extraction Module • Identify & classify terms • Identify events Interface Module Database • GUI • HTML conversion • System integration Ontology Markup language Data model Raw(OCR) Text Structure Annotated Background Knowledge Document Named-Entity Event MEDLINE Corpus Module • Markup generation / compilation • Annotated corpus construction • User • IR Request • Abstract • Full Paper Security Database Module Concept Module • DB design / access / management • DB construction • BK design / construction / compilation

Overview of GENIA Ontology • We aim to construct an ontology to model bio-molecular reactions in human. • The ontology will be used by biological event information extraction systems from online research papers and documents • The ontology consists of multiple taxonomies, relation between their nodes, and corresponding linguistic representations. • We are implementing the ontology on a prolog-like typed-feature manipulation language LiLFeS(Makino, et al. Proc. COLING-ACL '98, 807-811, 1998.), on which various natural language processing programs are implemented. • By using LiLFeS, we aim to seamlessly incorporate the ontology into natural language processing systems.

Name Ontology ROLE2 attribute3 attribute4 : ROLE4attribute3 attribute4 : Taxonomies Terms SUBSTANCE1 attribute1 attribute2 : SUBSTANCE2attribute1 attribute2 : SUBSTANCE3 attribute1 attribute2 : • AMINO ACID • DNA • ORGANIC COMPOUND • PROTEIN • AGENT • ENZYME • PHOSPHATASE • TRANSCRIPTION FACTOR SUBSTANCE4 attribute1 attribute2 : ROLE1 attribute3 attribute4 : ROLE3 attribute3 attribute4 :

Event Ontology REACTION3attribute1attribute2 : REACTION5 attribute1 attribute2 : REACTION1attribute1attribute2 : REACTION2 attribute1 attribute2 : • substance ACTIVATE substance • substance ACTIVATE protein • protein ACTIVATE pathway • PHOSPHORYLATE • INHIBIT • REGULATE REACTION4attribute1attribute2 :

Seamless Incorporation intoNatural Language Processing System Practical NLP applications Event Extractionfrom BiologyResearch Papers Knowledge Acquisition Module Grammar Semantics Sequential HPSG Parser Parallel HPSG Parser Parallel Programming Environment Domain Ontology ProgrammingLanguage

Ontology and Texts Top level ontology Middle level ontology e.g. Gene Ontology Bottom level ontology e.g. Database model of Pathway Databases Concept e.g. Fact Databases Granularity Text Textbook Review article etc. Research article etc. Case report form etc.

Corpus Annotation • Purpose • Provide Semantically Annotated Corpus • Markup the Instances of GENIA Ontology • Learning and Testing Data for Information Extraction Programs • Outline • Definition of GPML (GENIA Project Markup Language): New mechanisms for handling overlaps and complicated attribute structures • Target Objects: Named Entities • Substance: protein, DNA, RNA, … • Source(location): organism, cell, tissue, … • Target Texts: 1,000 MEDLINE abstracts

GPML(GENIA Project Markup Language) • Text structure & information markup • Named-entity markup • Coreference markup • Event markup

Text structure & info. markup • Document structure • [document] • a document header, author names, a publication date, a title, an abstract, keywords, a body • [body] • sections with a title, captions with a title • [abstract/section/caption] • lines • Document information • [document header] • a unique document id • a source, a language, a domain • document categories or classes • …

Named-entity markup • Attributes of NE element • A unique ID • For referring to tag element. • A name • Close to the canonical form (as possible as can) • Zero or more classes • To determine the class of this named-entity • A equivalence link • To synonym or abbreviation / full-form • Extra information • An annotator name • The time of annotation • Assurance • …

Coreference markup • Attributes of coreference (REFEXP) element • A unique ID • One or more links to referred objects (for both conjunction and disjunction) • Which can be named-entities and events. • Extra information • An annotator name, The time of annotation, Assurance, … • Auxiliary element (REFAUX) • To handle complicated coordination of referred objects • Underlying principle: Disjunction of conjunction ( … and … ) or … or ( … and … )

Event markup • Attributes of event element • A unique ID • A class and a type • To determine the form of this event. • Zero or more links to from-molecules, to-molecules, from-tissues, to-tissues, components, and enzymes • To describe this event • Zero or more effect names • To determine the effect of this event • Affirmative mode & Definiteness • To recognize positive & negative sentence and quantity & quality • Extra information • An annotator name, The time of annotation, Assurance, …

Example of NE Annotation UI - 85146267 TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">. AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class" cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU-26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.

TagEdit: Tag Definition Tool • Implemented in JAVA language • Functions: • Definition of new tag sets • Refinement of tag definition • Enhancement of tag sets • Features: • Tag sets in conformity with XML or GPML can be defined and modified • Tag definitions is saved as a file

Definition of new tag Click “Create Child Tag” to create new class Select class and click right button

Refinement of tag Select class and click right button to refine the tag

Jtag: Tagging Tool • Implemented in JAVA Language • Functions: • Insertion, deletion and edition of tags • Features: • Tag data is saved as XML document • Annotated text can be saved as GPML document (tag-embedded form)

Screen Capture of JTag

Insertion of Tag Click “Insert” to insert a new tag

Edition and Deletion of tag Click “Edit” to edit the tag Click “Delete” to delete the tag

Saving data Click “save” or “save as” to save the tag data as a XML document Click “Export” to save GPML document

Searching the terms Click “Highlight Special Words” to search a term Click “Position Jump” to jump to any position

Tools for Ontology-based Corpus Annotation