A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications

A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications Robert Gaizauskas Natural Language Processing Group Departments of Computer Science, University of Sheffield

Outline of Talk • Text Mining: Scenario, Definitions and Brief History • Text Mining Tasks + Methodologies • Text Mining Technologies • Text Mining Prototype Applications • Conclusions and Future Directions/Challenges Euromap Text Mining Seminar

Text Mining: Scenario Euromap Text Mining Seminar

Text Mining Scenario Components: Texts • Genres • Newspapers • Company reports • Web pages • Scientific papers • Legal documents • E-Formats • Word Documents (.doc, .rtf) • PDF/Postscript • HTML/SGML/XML • Languages • English … French … Greek … Russian … Chinese … Hindi … Sanskrit … Linear B • Character encodings: ASCII, ISO 8859, Unicode Euromap Text Mining Seminar

Text Mining Scenario Components: Users • User domain of interest • Business – competitor intelligence, corporate intranet/memory • Scientists – access to literature • Military/police intelligence – open source intelligence, intranet • Journalists – news archives • User level of expertise • Novice/expert • User linguistic competence • Adult/child • Native/non-native language speaker • Uni/multi-lingual Euromap Text Mining Seminar

Text Mining Scenario Components: Information Access Needs • Ad hoc searching • Specific questions: “What year did the Berlin Wall come down?” • General background/context: “Tell me about Zakopane” • Stable intelligence gathering • Scenario-related: “Build a database recording new projects in the energy sector: the players, location, energy type, start date, capitilisation” • Entity-related: “Build a database of key scientists in the pharma industry: name, employer, position, start and end dates” • Current awareness • Alerting: “Let me know when any papers are published on the crystallographic structure of any lipase” • Document selection: “Assemble articles on drug approvals” • Summarisation • Single/multi-document: “Summarise the Bulger trial” Euromap Text Mining Seminar

Text Mining Scenario Components: Tools • Information retrieval Euromap Text Mining Seminar

What is Information Extraction? • The Information Extraction (IE) task: from each text in a set ofnatural language texts extract information about predefined classesof entities and relationships and place this information into a template or database record. E.g. from financial newswire stories identify those dealing withmanagement succession events and from these extract details oforganisations and persons, the post being assumed or vacated, thereason for vacancy, etc. • IE may also be described as the activity of populating astructured information repository (database) from an unstructured,or free text, information source. Euromap Text Mining Seminar

What is Information Extraction? (cont) The resulting structured database is then used for some other purpose: • searching or analysis using conventional database queries; • data-mining; • generating a summary (perhaps in another language); • constructing indices into/within/between the source texts. Euromap Text Mining Seminar

Example: AWall Street Journal Article <DOC> <DOCID> wsj94_008.0212 </DOCID> <DOCNO> 940413-0062. </DOCNO> <HL> Who's News: @ Burns Fry Ltd. </HL> <DD> 04/13/94 </DD> <SO> WALL STREET JOURNAL (J), PAGE B10 </SO> <CO> MER </CO> <IN> SECURITIES (SCR) </IN> <TXT> <p> BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, wasnamed executive vice president and director of fixed income at thisbrokerage firm. Mr. Wright resigned as president of Merrill LynchCanada Inc., a unit of Merrill Lynch & Co., to succeed MarkKassirer, 48, who left Burns Fry last month. A Merrill Lynchspokeswoman said it hasn't named a successor to Mr. Wright, who isexpected to begin his new position by the end of the month. </p> </TXT> </DOC> Euromap Text Mining Seminar

Example: A Management Succession Event Template <TEMPLATE> := DOC_NR: "NUMBER" ^ CONTENT: <SUCCESSION_EVENT> * <SUCCESSION_EVENT> := ORGANIZATION: <ORGANIZATION> ^ POST: "POSITION TITLE" | "no title" ^ IN_AND_OUT: <IN_AND_OUT> + VACANCY_REASON: {DEPART_WORKFORCE, REASSIGNMENT, NEW_POST_CREATED, OTH_UNK} ^ <IN_AND_OUT> := PERSON: <PERSON> ^ NEW_STATUS: {IN, IN_ACTING, OUT, OUT_ACTING} ^ ON_THE_JOB: {YES, NO, UNCLEAR} OTHER_ORG: <ORGANIZATION> - REL_OTHER_ORG: {SAME_ORG, RELATED_ORG, OUTSIDE_ORG} - <ORGANIZATION> := ORG_NAME: "NAME" - ORG_ALIAS: "ALIAS" * ORG_DESCRIPTOR: "DESCRIPTOR" - ORG_TYPE: {GOVERNMENT, COMPANY, OTHER} ^ ORG_LOCALE: LOCALE_STRING {{CITY, PROVINCE, COUNTRY, REGION, UNK} * ORG_COUNTRY: NORMALIZED-COUNTRY-or-REGION | COUNTRY-or-REGION-STRING * <PERSON> := PER_NAME: "NAME" - PER_ALIAS: "ALIAS" * PER_TITLE: "TITLE" * Euromap Text Mining Seminar

Example: A (Partially) FilledManagement Succession Event Template <TEMPLATE-9404130062> := DOC_NR: "9404130062" CONTENT: <SUCCESSION_EVENT-1> <SUCCESSION_EVENT-1> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "executive vice president" IN_AND_OUT:<IN_AND_OUT-1> <IN_AND_OUT-2> VACANCY_REASON: OTH_UNK <IN_AND_OUT-1> :=<IN_AND_OUT-2> := IO_PERSON: <PERSON-1>IO_PERSON: <PERSON-2> NEW_STATUS: OUTNEW_STATUS: IN ON_THE_JOB: NOON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-2> REL_OTHER_ORG: OUTSIDE_ORG <ORGANIZATION-1> :=<ORGANIZATION-2> := ORG_NAME: "Burns Fry Ltd.“ORG_NAME: "Merrill Lynch Canada Inc." ORG_ALIAS: "Burns Fry“ORG_ALIAS: "Merrill Lynch" ORG_DESCRIPTOR: "this brokerage firm“ORG_DESCRIPTOR: "a unit of Merrill Lynch & Co." ORG_TYPE: COMPANYORG_TYPE: COMPANY ORG_LOCALE: Toronto CITY ORG_COUNTRY: Canada <PERSON-1> := <PERSON-2> := PER_NAME: "Mark Kassirer" PER_NAME: "Donald Wright" PER_ALIAS: "Wright" PER_TITLE: "Mr." Euromap Text Mining Seminar

Example:Uses for Templates • From the completely filled version of the preceding template a natural languagesummary can be generated: BURNS FRY Ltd. named Donald Wright as executive vice president. Donald Wright resigned as presidentof Merrill Lynch Canada Inc.. Mark Kassirer left as president ofBURNS FRY Ltd. • Or, a table can be constructed:. Euromap Text Mining Seminar

Key Features of Information Extraction • Texts are unrestricted NL, but typically short • Template is predefined and fixed • Information extracted is `literal' or `factual‘ • The precise definition of the task permits quantitativeevaluation of IE systems' performance against human generated results Euromap Text Mining Seminar

What IE is NOT: Information Retrieval • The Information Retrieval (IR) task: given a user query and a documentcollection retrievethat subset ofdocuments from the collection which are relevant to the user'squery. • E.g. given the query exonuclease gamma-delta resolvase return those abstracts in PubMed pertaining to these proteins • Once the IR system returns the documents, the user browses theselected documents in order to fulfil his or her information need. • Depending on the IR system, the user may be further assisted by • relevance ranking of retrieved documents • highlighting of search terms in the text to facilitate identifying passages ofparticular interest Euromap Text Mining Seminar

Strengths and Weaknesses of IR Strengths: • Can search huge document collections very rapidly • Insensitive to genre and domain of the texts • Can rank documents with respect to likely relevance • Searches can be iteratively refined Weaknesses: • Documents are returned not information/answers, so user must further read texts to extract information • Frequently not discriminating enough (“1563 documents match your request”) Euromap Text Mining Seminar

Strengths and Weaknesses of IE Strengths: • Extracts facts from texts, not just texts from text collections • Can feed other powerful applications (databases, indexing engines) Weaknesses: • Porting to new genres and domains is time-consuming and requires expert • Limited accuracy • Not fast enough to run over large text collections while user waits Euromap Text Mining Seminar

A Brief History of IE • The first published work on information extraction (though it was not called this at the time) was in late 1960s • A significant precursor was the psychologist Roger Schank’s work on scripts and story understanding in the 1970’s • The 1980’s saw the emergence of some commercial systems targetted at financial transactions and newswires • The big impetus to current research started in the late 1980’s when DARPA initiated a series of competitive evaluations of “Message Understanding” systems (Message Understanding Conferences – MUC) MUC ran for 10 years (1987-98) and significantly advanced the field • Currently there are a number of IE systems on the market and a large and on-going research effort in the field Euromap Text Mining Seminar

Outline of Talk • Text Mining: A Definition and Brief History • Text Mining Tasks + Methodologies • Entity Extraction • Attribute Extraction • Relation Extraction • Event Extraction • Text Mining Technologies • Text Mining Prototype Applications • Conclusions and Future Directions/Challenges Euromap Text Mining Seminar

IE Component Tasks • To fill templates IE researchers have discovered that systems must be able toperform a variety of simpler tasks • Studying and evaluating these component tasks in isolation has proved a useful way forward for IE • Component IE tasks which were specified as part of MUC: • Named Entity Recognition (persons, organisations,locations,dates) • Coreference (multiple references to same entity) • Template Elements (organisations, persons, artifacts, locations) • Template Relations (employee_of, product_of, location_of) • Scenario Template (management succession) Euromap Text Mining Seminar

MUC Scoring and Scoring Metrics • Correctanswers, called keys, are produced manually forall the MUC tasks. • Scoring of system results, called responses, against keys is done automatically. • At least some portion of the answer keys are multiply producedby different humans so that interannotator agreement figurescan be computed. Interannotator agreement figures of 95% are sought.Figures of less than 80% are interpreted as meaning the task isnot sufficiently clearly defined. • Principal metrics are: • Precision(how much of what your system returns is correct) • Recall(how much of what is correct your system returns) • F-measure (a weighted combination of precision and recall) Euromap Text Mining Seminar

State-of-the-artEvaluation Results (MUC-7) Euromap Text Mining Seminar

Outline of Talk • Text Mining: A Definition and Brief History • Text Mining Tasks + Methodologies • Entity Extraction • Attribute Extraction • Relation Extraction • Event Extraction • Text Mining Technologies • Text Mining Prototype Applications • Conclusions and Future Directions/Challenges Euromap Text Mining Seminar

Applying IE to Biological Science Journal Papers • IE is an appropriate technology when: • large volumes of text make human analysis infeasible • template-oriented information seeking is appropriate (stableinformation need, narrow domain) • conventional IR is inadequate • some error is tolerable • To date most IE applications are newswire-oriented, withthe bulk being in the financial/competitor intelligence area • Bioinformatics applications provide an interesting challengeto IE • different text types-- journal papers (SGML/PDF),abstracts (BIDS, MEDLINE) • different genre-- scientific writing • different domain -- biochemistry/molecular biology Euromap Text Mining Seminar

EMPathIE: Enzyme and Metabolic Pathways Information Extraction • Aim: Use IE techniques to create a database ofenzyme and metabolic pathway data from academic journal papers tosupport drug discovery • Partners: Depts of Computer Science and Information Studies, U. of Sheffield; Glaxo-Wellcome Research; Elsevier Science • Sponsors: Glaxo-Wellcome Research; Elsevier Science • PostDoc: Dr. Kevin Humphreys • Status: Complete. Project ran 11/97 -- 11/99 Euromap Text Mining Seminar

EMPathIE: Scenario • metabolic processes involve biochemical reactions in which enzymes play key catalytic roles • each reaction involves an enzyme, some number of inputs andresults in some number of products • sequences of such reactions form metabolic pathways • identifying pathways can suggest potential sites for theapplication of drugs to affect a particular end result • reactions are typically reported one/journal paper --identifying pathways frequently requires combining informationfrom several papers Euromap Text Mining Seminar

EMPathIE: Text Sources • Project focused on 13 journal papers from • FEMS Letters (Federation of European Microbiological Societies), and • Biochimica et Biophysica Acta from 1992-1995 • Papers supplied by Elsevier Science and marked up according to their proprietary SGML DTD • mark up reliable for bibliographical and text structureinformation • typographical markup (e.g. italics for gene names)inconsistent and hence ignored Euromap Text Mining Seminar

Sample EMPathIE Article Federation of European Microbiological Societies Isocitrate lyaseactivity in halophilicarchaea A. Oren and P. Gurevich, The Hebrew University of Jerusalem Abstract: Eight species of halophilic Archaea were tested for the presence of isocitrate lyase activity. High activities (up to 100nmol –1mg protein -1) were detected in Haloferaxmediterraneiand Haloferax volcaniiwhen grown in medium containing acetateas the principal carbon source. Little activity was found in representatives of the genera Halobacteriumand Haloarcula. Isocitrate lyase from Haloferax mediterraneirequired high potassium chloride concentrations, optimal activity being found at 1.5-3 M potassium chloride and pH 7.0. Replacement of potassium chloride by sodium chloride resulted in much lower activities. Sulfhydryl compounds (cysteine, glutathione) were not stimulatory. In other properties (stimulation by magnesium ions, sensitivity to different inhibitors) the enzyme resembled isocitrate lyases from representatives of theBacteria and Eucarya. Full Text: … Euromap Text Mining Seminar

EMPathIE Template Specification <ENZYME> := <PATHWAY> := NAME: "NAME" + NAME: "NAME" + CODE: “EC_CODE" * INTERACTION: <INTERACTION> + WEIGHT: "WEIGHT" - SUBUNITS: "SUBUNITS" * <INTERACTION> := ENZYME: <ENZYME> ^ <ORGANISM> := SOURCE: <SOURCE> - NAME: "NAME" + PARTICIPANT: <PARTICIPANT> * STRAIN: "STRAIN" * NON_PARTICIPANT:<NON_PARTICIPANT> * GENUS: "GENUS" - <PARTICIPANT> := <COMPOUND> := COMPOUND: <COMPOUND> ^ NAME: "NAME" + TYPE: {SUBSTRATE,PRODUCT, SUPPLIER: "SUPPLIER" * ACTIVATOR, COFACTOR, INHIBITOR,BUFFER} ^ <SOURCE> :=CONCENTRATION: "CONCENTRATION" - ENZYME: <ENZYME> ^ TEMPERATURE:"TEMPERATURE" ORGANISM: <ORGANISM> ^ ACIDITY: "ACIDITY" - Euromap Text Mining Seminar

Filled EMPathIE Template ENZYME-1 PATHWAY-1 Name: isocitrate lyase Name: glyoxylate cycle E.C. Code: 4.1.3.1 Interaction: INTERACTION-1 ORGANISM-1 INTERACTION-1 Name: Haloferax volcanii Enzyme: ENZYME-1 Strain: ATCC 29605 Participants: PARTICIPANT-1 Genus: halophilic Archaea PARTICIPANT-2 COMPOUND-1 PARTICIPANT-1 Name: glyoxylate phenylhydrazone Compound: COMPOUND-1 Type: Product COMPOUND-2 Temperature: 35C Name: KCl PARTICIPANT-2 SOURCE-1 Compound: COMPOUND-2 Enzyme: ENZYME-1 Type: Activator Organism: ORGANISM-1 Concentration: 1.75 M Euromap Text Mining Seminar

PASTA: Protein Active Site TemplateAcquisition • Aim: Use IE techniques to create a database ofprotein active site data from academic journal papers and abstracts to support protein structure analysis • Partners: Depts of Computer Science, Information Studies,Molecular Biology and Biotechnology, U. of Sheffield • Sponsors: BBSRC-EPSRC BioInformatics Programme • PostDoc: Dr. George Demetriou • Status: Complete. Project ran 03/98 -- 03/01 Euromap Text Mining Seminar

PASTA: Scenario • Extract information concerning the roles of amino acids inprotein molecules and create a database of protein active sites fromboth scientific journal abstracts and full articles • New protein structures are being reported at very high rates inthe literature Euromap Text Mining Seminar

PASTA: Scenario (cont) • Full evaluation of the results of protein structure comparisonsoften requires the investigation of extensive literature references • E.g. to determine whether an amino acid has been reportedas presentin a particular region of a protein • Computational methods that can extract informationdirectly from these articles would be very useful to biologists incomparison classification work and to those engaged in modellingstudies Euromap Text Mining Seminar

Sample PASTA Article (BIDS Abstract) TI: The crystal structure of a triacylglycerol lipase from Pseudomonascepacia reveals a highly open conformation in the absence of a bound inhibitor AU: Kim_KK, Song_HK, Shin_DH, Hwang_KY, Suh_SW NA: SEOUL NATL UNIV,COLL NAT SCI,DEPT CHEM,SEOUL151742,SOUTH KOREASEOUL NATL UNIV,COLL NAT SCI,DEPT CHEM,SEOUL151742,SOUTH KOREA JN: STRUCTURE, 1997, Vol.5, No.2, pp.173-185 IS: 0969-2126 AB: Background: … Results: We have determined the crystal structure of a triacylglycerol lipase from Pseudomonas cepacia (Pet) in the absence of a bound inhibitor using X-ray crystallography. The structure shows the lipase to contain an alpha/beta-hydrolase fold and a catalytic triad comprising of residues Ser87, His286and Asp264. The enzyme shares several structural features with homologous lipases from Pseudomonas glumae (PgL) and Chromobacterium viscosum (CvL), including a calcium-binding site. The present structure of Pet reveals a highly open conformation with a solvent-accessible active site. This is in contrast to the structures of PgL and Pet in which the active site is buried under a closed or partially opened 'lid', respectively. Conclusions: … Euromap Text Mining Seminar

(Partially) Filled PASTA Template <TEMPLATE-str-1997-5-2-1>:= DOC_JR: "STRUCTURE, 1997, Vol.5, No.2, pp.173-185" DOC_AUTH: "Kim_KK, Song_HK, Shin_DH, Hwang_KY,Suh_SW" DOC_IS: "0969-2126“ <RESIDUE-str-1997-5-2-1>:= RES_TYPE: SERINE RES_NO: "87" SITE/FUNCTION: "catalytic","hydrolytic activity","interfacialactivation","stereoselectivity", "calcium-binding site", ”active-site" SEC_STRUCT: A-HELIX QUATERN_STRUCT: - REGION: 'lid' INTERACTION: - <PROTEIN>:=<SPECIES-str-1997-5-2-1>:= PRO_NAME: "Triacylglycerol lipase“SPE_NAME: "Pseudomonas cepacia" PRO_SCOP_FAM: "Lipase“SPE_NAME_TYPE: SCIENTIFIC PDB_CODE: 1LGY <IN_PROTEIN -str-1997-5-2-1 >: = <IN-SPECIES-str-1997-5-2-1 >:= RESIDUE: <RESIDUE-str-1997-5-2-1 PROTEIN: <PROTEIN-str-1997-5-2-1 PROTEIN: <PROTEIN-str-1997-5-2-1>SPECIES: <SPECIES-str-1997-5-2-1> Euromap Text Mining Seminar

Outcomes I: The PASTA System System processes texts in four principal stages: • text preprocessing performs text structure analysis andtokenisation • lexical and terminological processingperforms morphological analysis,multi-token matching against terminology lexicons, and small-scale parsing using terminology grammars • parsing and semantic interpretation splits text intosentences, tags tokens with parts-of-speech, performs partialphrasal parsing and compositional semantic interpretation intoa predicate-argument “logical form” • discourse interpretationintegrates each sentence'spredicate-argument representation into a hierarchically structured semantic netencoding the system's domain model A final stage generates template output as required. Euromap Text Mining Seminar

PASTA System: TextPreprocessing • Text structure analysis • Scientific articles typically have a rigid structure, including abstract, introduction, method and materials, results, anddiscussion sections. • Certain sections can be targeted for detailed analysis while others can be skipped completely. • Where articles are available in SGML with a DTD, an initialmodule is used to identify particular markup, specified in a configuration file, for use by subsequent modules. • Where articles are in plain text, an initial `sectioniser' module is used to identify and classifysignificant sections using sets of regular expressions. • Tokenisation • in addition to the normal white-space/punctuation delimitedtokenisation required for newswires, scientific papers require further sophistication: NaCl,Tyr152 Euromap Text Mining Seminar

PASTA System: Lexical andTerminological processing • The main information sources used for terminologyidentification inthe biochemical domain are: • case-insensitive terminology lexicons (at present approximately 25,000component terms in 52 categories -- see next slide) • morphological cues, mainly standard biochemical suffixes • hand-constructed grammar rules for each terminology class • For example, the enzyme namemannitol-1-phosphate5-dehydrogenasewould be recognised • by the classification ofmannitolas apotential compound modifier and phosphate as a compound-- both matched in the terminology lexicon • by morphological analysis suggestingdehydrogenase as a potential enzyme head, due to its suffix–ase • by domain-specific grammar rules combining the enzyme head with a known compound and modifier which can play the role ofenzyme modifier Euromap Text Mining Seminar

Biochemical Terminological Lists Principal Term Classes • protein names (trypsin, lipase, etc.) • amino acids (Glycine, Phe, etc.) • gene names • species (human, E.coli, etc.) • secondary structure (alpha helix, beta sheet, etc.) • supersecondary structure (coiled-coil alpha helix, etc.) • quaternary structure (dimer,hexamer, etc.) • regions (carboxy-terminal) and sites (glycosylation site, etc) • chains (butyl chain, catalytic chain, etc.) • interactions (hydrogen bonds, contacts) • bases (DNA, RNA) • elements (N, Ca, NZ, etc.) • non-protein entities (cofactors, substrates, etc.) • measure terms (kcal, millimeter, joule, etc.) Principal Terminology Resouces • Protein Data Bank • Enzyme classification • SCOP classification • CATH classification • IUPAC / IUBMB Nomenclature Recommendations Euromap Text Mining Seminar

PASTA System:Parsing andSemantic Interpretation • The syntactic processing modules treat any terms recognised inthe previous stage as non-decomposable units, with a syntacticrole of proper noun. As a consequence: • The sentence splitting module cannot propose sentence boundaries within a preclassified term. • The part-of-speech tagger only attempts to assign tags totokens which are not part of proposed terms. • The phrasal parser treats terms as preparsed noun phrases. • Parsing is carried out with a general phrasal (feature-basedunification) grammar of English. • The phrasal grammar includes compositional semantic rules, which are used to construct a semantic representation of the “best”, possibly partial, parse of each sentence. • This predicate logic-like representation is passed on as inputto the discourse interpretation stage. Euromap Text Mining Seminar

S Syntactic Analysis VP N Det NP V PP P PP NP This contains cleft the putative catalytic residue above the core of the beta-barrel PASTA System:Parsing andSemantic Interpretation (cont) “This cleft contains the putative catalytic residueGlu132above the core of the beta-barrel.” . Semantic Analysis contain(e1), cleft(e2), lsubj(e1,2),det(e2,this), residue(e3), lobj(e1,e3), name(e3,”Glu32”), adj(e3,putative),adj(e3,catalytic) core(e4),above(e1,e4) secondary_structure(e5),name(e5,”beta-barrel”),of(e4,e5) Euromap Text Mining Seminar

PASTA System: Discourse Interpretation • The semantic representation of each sentence is added to a predefined domain model made up of • an ontology, or concept hierarchy, and • inheritable attributes and inference rules associated with concept nodes in the hierarchy • The domain model is gradually populated with instances of concepts from the text to become a discourse model • A powerful coreference mechanism attempts to merge each newly introduced instance with an existing one, subject to various syntactic and semantic constraints. • Inference rules of particular instance types may then fire to hypothesise the existence of instances required to fill a template (e.g. an organism with a source_of relation to an enzyme). • The coreference mechanism will then attempt to resolve the hypothesised instances with actual instances from the text – making up for deficiencies in parsing. Euromap Text Mining Seminar

PASTA System: Discourse Interpretation (cont) • The three-dimensional structure of Endo H has been determined … • A shallow curved cleft runs across thesurface of the molecule from … • This cleft contains the putative catalytic residues Asp130 and Glu132 … • From 1, Endo H is identified as a protein – protein(e1),name(e1,”Endo H”) – and added to the discourse model • From 2, the cleft is identified – cleft(e23) – and the molecule – molecule(e25) • Ontology records that proteins are molecules and coreference resolves e25 and e1 • Domain model/ontology records that clefts are regions and that regions are located in proteins – a protein, say e42, is hypothesized and the relation located_in(e23,e42) • In the absence of full semantic analysis of “runs across the surface of”, coreference picks the closest protein and resolve e42 with e1/e25 – i.e. the cleft is assumed to be in Endo H. • From 3, the analysis is as before – the cleft is identified as, say e52, and the residue, e61 • coreference resolves the cleft e52 with the preceding e23 • The domain model allows reasoning from “contains” to establish the relation located_in(e61,e23) – the residue is located in the cleft • Transitiviy of located_in permits the conclusion: located_in(e61,e1) – Glu132 is in EndoH Euromap Text Mining Seminar

Outcomes II: Text Corpora • 1500 BIDS abstracts from 24 molecular biology journalsfrom 1994-98 • ASCII text ~250 words each • structured keywordfields in header • 300 full journal paperfrom Molecular Biologyand Structurefrom 1994-1998 • from publishers' websites (HTML/ASCII) Euromap Text Mining Seminar

Annotated Corpora • Annotated corpora are needed for system development and evaluation • For development, PASTA researchers at Sheffield manually prepared • terminology-tagged 52 journal article abstracts for the term classes: protein, species residue, site, region, secondary structure, super secondary structure, quaternary structure, chain, base, atom, non-protein, interactions (1376 term occurrences) • filled templates derived from 25 abstracts used for training • For final blind evaluation, independent domain experts prepared • 62 terminology-tagged abstracts for the term classes • 20 texts annotated by both annotators • interannotator agreement is low (as assessed by MUC scorer) • filled templates from 30 abstracts • 10 annotated by both annotators Euromap Text Mining Seminar

Evaluation • To evaluate system’s performance is measured against manually annotated corpora using automatic scorer developed in the DARPA MUC evaluations • On development texts • terminology evaluation results: Recall: 88% Precision: 94% P & R: 91% • In final blind evaluation • terminology evaluation results: Recall: 82% Precision: 84% P & R: 83% • Template filling evaluation results: Recall: 68% Precision: 71% P & R: 69% Euromap Text Mining Seminar

Outcomes III: Browser-based Interface • Raw templates or texts annotated with identifiers for protein and residue names are not of much use to the working biologist • Most effective delivery platform is a Web-browser • Therefore we have designed and implemented a browser-based interface to allow a user to browse the results • has added benefit that links to source texts can easily be added – can help to overcome IE system’s errors Euromap Text Mining Seminar

Outcomes III: Browser-based Interface Euromap Text Mining Seminar

A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications