1 / 146

Text Mining for Biomedicine: Techniques & tools

Text Mining for Biomedicine: Techniques & tools. Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki, Yoshimasa Tsuruoka School of Computer Science National Centre for Text Mining www.nactem.ac.uk Sophia.Ananiadou@manchester.ac.uk. Outline. Challenges / objectives of TM in biomedicine

issac
Télécharger la présentation

Text Mining for Biomedicine: Techniques & tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Mining for Biomedicine:Techniques & tools Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki, Yoshimasa Tsuruoka School of Computer Science National Centre for Text Mining www.nactem.ac.uk Sophia.Ananiadou@manchester.ac.uk

  2. Outline • Challenges / objectives of TM in biomedicine • Terminology processing • Term extraction, term variation, named entity recognition • Resources for TM in biomedicine • Document classification • Information Extraction approaches • Levels of Text Mining Processing • Biomedical text mining services and systems @ NaCTeM • TerMine, AcroMine, Smart dictionary look up, Phenetica • Medie, InfoPubMed, KLEIO

  3. Material • Further background on TM for Biology Ananiadou, S. & McNaught, J. (eds) (2006) Text Mining for Biology and Biomedicine. Boston, MA: Artech House • Numerous papers on line from bibliography • See BLIMP http://blimp.cs.queensu.ca/ • Biomedical Literature (and text) mining publications

  4. Text Mining in biomedicine • Why biomedicine? • Consider just MEDLINE: 16,000,000 references,40,000 added per month • Dynamic nature of the domain: new terms (genes, proteins, chemical compounds, drugs) constantly created • Impossible to manage such an information overload

  5. From Text to Knowledge: tackling the data deluge through text mining Unstructured Text (implicit knowledge) Information Retrieval Information extraction Knowledge Discovery Semantic metadata Structured content (explicit knowledge) Advanced Information Retrieval

  6. Information deluge • Bio-databases, controlled vocabularies and bio-ontologies encode only small fraction of information • Linking text to databases and ontologies • Curators struggling to process scientific literature • Discovery of facts and events crucial for gaining insights in biosciences: need for text mining

  7. The solution: The UK National Centre for Text Mining www.nactem.ac.uk • Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk • First publicly funded text mining centre in the world.. • Focus: biology, medicine, social sciences…

  8. We don’t just press a button… • TM involves • Many components (converters, analysers, miners, visualisers, ...)‏ • Many resources (grammars, ontologies, lexicons, terminologies, thesauri, CVs)‏ • Many combinations of components and resources for different applications • Many different user requirements and scenarios, training needs • The best solutions are customised

  9. People behind NaCTeM • Text Mining Team: 14 members • Close collaboration with University of Tokyo, Tsujii Lab http://www-tsujii.is.s.u-tokyo.ac.jp/

  10. What NaCTeM is building: • Resources: ontologies, lexicons, terminologies, thesauri, grammars, annotated corpora • BOOTStrep project http://www.nactem.ac.uk/bootstrep.php • Tools: tokenisers, taggers, chunkers, parsers, NE recognisers, semantic analysers • NaCTeM is also providing services • Our related bio-text mining projects • REFINE http://dbkgroup.org/refine/ • Representing Evidence For Interacting Network Elements • ONDEX (data integration, workflows, text mining)

  11. Individual tools for user data • Splitters, taggers, chunkers, parsers, NER, term extractors • Modes of use • Demonstrators: for small-scale online use • Batch mode: upload data, get email with link to download site when job done • Web Services • Integration into Workflows (Taverna) • Some services are compositions of tools

  12. Aims • Text mining: discover & extract unstructured knowledge hidden in text • Hearst (1999) • Text mining aids to construct hypotheses from associations derived from text • protein-protein interactions • associations of genes – phenotypes • functional relationships among genes

  13. Impact of text mining • Extraction of named entities (genes, proteins, metabolites, etc) • Discovery of concepts allows semantic annotation of documents • Improves information access by going beyond index terms, enabling semantic querying • Construction of concept networks from text • Allows clustering, classification of documents • Visualisation of concept maps

  14. Impact of TM • Extraction of relationships (events and facts) for knowledge discovery • Information extraction, more sophisticated annotation of texts (event annotation) • Beyond named entities: facts, events • Enables even more advanced semantic querying

  15. Hypothesis generation from literature • Swanson experiments (1986) influenced conceptual biology • rapid ‘mining’ of candidate hypotheses from the literature • migraine and magnesium deficiency (Swanson, 1988) • indomethacin and Alzheimer’s disease (Swanson and Smalheiser 1994), • Curcuma longa and retinal diseases, Crohn's disease and disorders related to the spinal cord (Srinivasan and Libbus 2004). • (Weeber M, Rein et al. 2003) thalidomide for treating a series of diseases such as acute pancreatitis, chronic hepatitis C.

  16. Text mining steps • Information Retrieval yields all relevant texts • Gathers, selects, filters documents that may prove useful • Finds what is known • Information Extraction extracts facts & events of interest to user • Finds relevant concepts, facts about concepts • Finds only what we are looking for • Data Mining discovers unsuspected associations • Combines & links facts and events • Discovers new knowledge, finds new associations

  17. Text Annotation Tools Structured Knowledge Knowledge Extraction Tools From Text to Knowledge: NLP and Knowledge Extraction Lexicons and ontologies

  18. Challenge: the resource bottleneck • Lack of large-scale, richly annotated corpora • Support training of ML algorithms • Development of computational grammars • Evaluation of text mining components • Lack of knowledge resources: lexica, terminologies, ontologies.

  19. Annotation IE system Annotation & Information Extraction Biomedical Knowledge Biomedical Literature • Semantic annotation simulates an ideal performance of IE system. • IE systems can be developed by referencing annotated corpus. • The performance of IE systems can be evaluated by being compared to the annotated corpus. (Kim & Tsujii, Text Mining Workshop, Manchester, 2006)

  20. Task-oriented Annotation Application annotated text User system development Defined by specific tasks Specific curation tasks in specific environments Mapping of Protein names to database IDs in specific text types Specific event types such as Protein-Protein Interaction Disease-Gene Association of specific diseases Task-neutral Annotation GENIA Corpus [U-Tokyo, NaCTeM] Development of generic tools Defined by theories Linguistics Tokens POS Phrase Structure Dependency Structure Deep Syntax (PAS) Biology Named Entities of various semantic types Events Linguistics + Biology Co-references Interoperable Tools Text Annotation

  21. Part-of-speechannotation2,000 abstracts Annotation of GENIA corpus – Term&POS Term (entity)annotation2000+400abstracts

  22. Text semantic annotation • annotation of events and involved named entities • Example: “Regulation of Transcription events” • BOOTSTrep project http://www.nactem.ac.uk/bootstrep.php • two different types of annotation levels • linguistic annotation levels • biological annotation level, in charge of marking the biological knowledge contained in the text • Linking text with biological knowledge

  23. Events and variables • Biological events can be centred on: • verbs, e.g. activate, • nouns with verb-like meanings (nominalised verbs), e.g. transcription • Different parts of sentence correspond to different types of variables in the event e.g. • What caused event • The narL gene productactivates the nitrate reductase operon • What was affected by event • Analysisof mutants… • Where event took place • These fusions were formedon plasmid cloning vectors

  24. activate Verb Frame Example “The narL gene productactivates the nitrate reductase operon” Theme Characteristics operon Agent Characteristics protein

  25. the agent The narL gene product protein operon the nitrate reductase operon the theme (what is acted upon) Example 1 activates

  26. Linguistically Annotated Corpora • GENIA • Domain • Mesh term: Human, Blood Cells, and Transcription Factors. • Annotation: POS, named entity, parse tree • Penn BioIE • Domain • the molecular genetics of oncology • the inhibition of enzymes of the CYP450 class. • Annotation: POS, named entity, parse tree • Yapex • GENETAGa corpus of 20K MEDLINE® sentences for gene/protein NER

  27. The GENIA annotation • Linguistic annotation • Reveals linguistic structures behind the text • Part-of-speech annotation • annotates for the syntactic category of each word. • Syntactic Tree annotation • annotates for the syntactic structure of sentences. • Semantic annotation • Reveals knowledge pieces delivered by the text. • Term annotation • annotates domain-specific terms • Event annotation • annotates events on biological entities. Ontology-drivenannotation

  28. Annotation Tool • WordFreak http://wordfreak.sourceforge.net/ • Java-based linguistic annotation tool developed at University of Pennsylvania • Extensible to new tasks and domains • Customised visualisation and annotation specification • Allows annotation process to be made as simple as possible

  29. Resources

  30. What about existing resources? • Ontologies important for knowledge discovery • They form the link between terms in texts and biological databases • Can be used to add meaning, semantic annotation of texts

  31. Link between text and ontologies Adding new knowledge KEGG Ontological resources UMLS text Supporting semantics GO GENIA

  32. Bridging the Gap– Integrating data, text and knowledge Databases Semantic Interpretation of data Adding new knowledge Ontological resources UMLS text Supporting semantics GO GENIA KEGG Semantic Interpretation of models in Systems Biology Mathematical Models

  33. Resources for Bio-Text Mining • Lexical / terminological resources • SPECIALIST lexicon, Metathesaurus (UMLS) • Lists of terms / lexical entries (hierarchical relations) • Ontological resources • Metathesaurus, Semantic Network, GO, SNOMED CT, etc • Encode relations among entities Bodenreider, O. “Lexical, Terminological, and Ontological Resources for Biological Text Mining”, Chapter 3, Text Mining for Biology and Biomedicine, pp.43-66

  34. SPECIALIST lexicon • UMLS specialist lexicon http://SPECIALIST.nlm.nih.gov • Each lexical entry contains morphological (e.g. cauterize, cauterizes, cauterized, cauterizing), syntactic (e.g. complementation patterns for verbs, nouns, adjectives), orthographic information (e.g. esophagus – oesophagus) • General language lexicon with many biomedical terms (over 180,000 records) • Lexical programs include variation (spelling), base form, inflection, acronyms

  35. {base=Kaposi's sarcoma spelling_variant=Kaposi sarcoma entry=E0003576 cat=noun variants=uncount variants=reg variants=glreg} Kaposi’s sarcoma Kaposi’s sarcomas Kaposi’s sarcomata Kaposi sarcoma Kaposi sarcomas Kaposi sarcomata Lexicon record The SPECIALIST Lexicon and Lexical Tools Allen C. Browne, Guy Divita, and Chris Lu PhD 2002 NLM Associates Presentation, 12/03/2002, Bethesda, MD

  36. Hodgkin Disease HODGKIN DISEASE Hodgkin’s Disease Hodgkin’s disease Disease, Hodgkin ... disease hodgkin Normalisation (lexical tools) normalise

  37. Steps of Norm Remove genitive Hodgkin’s Diseases Replace punctuation with spaces Hodgkin Diseases Remove stop words Hodgkin Diseases Lowercase hodgkin diseases Uninflect each word hodgkin disease Word order sort disease hodgkin Lexical tools of the UMLS http://lexsrv3.nlm.nih.gov/SPECIALIST/index.html

  38. The Gene Ontology (GO) • Controlled vocabulary for the annotation of gene products http://www.geneontology.org/ 19,468 terms. 95.3% with definitions 10391 biological_process 1681 cellular_component7396 molecular_function

  39. Gene Ontology • GOA database (http://www.ebi.ac.uk/GOA/) assigns gene products to the Gene Ontology • GO terms follow certain conventions of creation, have synonyms such as: • ornithine cycle is an exact synonym of urea cycle • cell division is a broad synonym of cytokinesis • cytochrome bc1 complex is a related synonym of ubiquinol-cytochrome-c reductase activity

  40. GO terms, definitions and ontologies in OBO id: GO:0000002 name: mitochondrial genome maintenance namespace: biological_process def: "The maintenance of the structure and integrity of the mitochondrial genome.“ [GOC:ai] is_a: GO:0007005 ! mitochondrion organization and biogenesis

  41. Metathesaurus • organised by concept • 5M names, 1M concepts, 16M relations • built from 134 electronic versions of many different thesauri, classifications, code sets, and lists of controlled terms • "source vocabularies“ • common representation

  42. Are the existing knowledge resources sufficient for TM? No! Why? • Limited lexical & terminological coverage of biological sub-domains • Resources focused on human specialists GO, UMLS, UniProt ontology concept names frequently confused with terms

  43. Naming conventions • Update and curation of resources • FlyBase gene name coverage 31% (abstracts) to 84% (full texts) • Naming conventions and representation in heterogeneous resources • Term formation guidelines from formal bodies e.g. HUGO, IPI not uniformly used • Problems with integration of resources dystrophin used for 18 gene products “Dystrophin (muscular dystrophy, Duchenne and Becker types), included DXS143, DXS164, DXS206, …” HUGO

  44. Term variation • Terminological variation and complexity of names • High correlation between degree of term variation and dynamic nature of biomedicine • Variation occurs in controlled vocabularies and texts but discrepancy between the two • Exact match methods fail to associate term occurrences in texts with databases

  45. What’s in a name? Terms, named entities in biology

  46. What’s in a name? • Breast cancer 1 (BRCA1) • p53 • Ribosomal protein S27 • Heat shock protein 110 • Mitogen activated protein kinase 15 • Mitogen activated protein kinase kinase kinase 5 From K. Cohen, NAACL 2007

  47. Worst gene names • sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A K. Cohen NAACL 2007

More Related