The Research Assistant for Biological Text Mining

Software for Biotech and Pharma Research The Research Assistant for Biological Text Mining Luc Dehaspe Other Members of the BioMinT Consortium

Text Mining in the biological domain • Emerging field of research and development • 40+ articles in “Bioinformatics 2004” • Dedicated workshops, competitions and interest groups • Information retrieval and extraction to deal with information overflow • 12 million citations in Medline from 4600 journals • Many more resources on the web • Essential link in the semantic integration of the numerous biological resources.

Use of text mining for database annotation • curated protein sequence database • high level of annotation of proteins • high level of integration with other databases Swiss-Prot Entry Creation Flowchart

Use of database annotations for text mining • Tools for information retrieval, filtering, classification, extraction rely on • Corpora of examples used by machine learning methods; • Linguistic analysis and controlled vocabularies, (ontologies, thesauri, biological dictionaries). • Databases provide semi-structured information that could be used • for corpus elaboration • as specific vocabulary resources

University of Antwerp (BE) Artificial Intelligence Austrian Research Institute for AI Biological Sciences University of Manchester (UK) Coordinator PharmaDM (BE) Swiss Institute of Bioinformatics University of Geneva (CH) • 3 year FP5 European Project, started in January 2003 • Official web site: www.biomint.org • Interdisciplinary consortium:

The goals of BioMinT • To develop a generic text mining tool that: • interprets different types of queries • retrieves relevant documents from the biological literature • extracts the required information • outputs the result as a database slot filler or as a structured report • The tool thus provides two essential research supportservices: • Curator's Assistant:accelerate, by partially automating, the annotation and update of databases; • Researcher's Assistant: generate readable reports in response to queries from biological researchers.

Comments Definition Gene name Reference content Reference comments Keywords Sequence features Curator’s Assistant forSwiss-Prot Annotation

Family Super-family Domain-family High level function High level structure Disease associations Subcellular location Tissue distribution etc… Low level function Super-family structure Disease associations Number of subtypes etc… Domain structure Domain function Curator’s Assistant for PRINTS annotation • PRINTS deals with groups of proteins • Annotation of 3 types of protein fingerprints Extracted Information

Swiss-Prot Entry Creation Flowchart Biological Researcher’s Literature Screening Flowchart The Biological Research Assistant • Overlap with Curator’s Assistant • All biologists occasionally in the curator’s seat • Keep ahead of Swiss-Prot in research area of interest • Include private (confidential) document collections

G U I IR Query expansion PubMed search Document filtering/ranking Document organisation IE Sentence extractor NLP tools Case frame generator Information retrieval and extraction modules

Information retrieval and extraction modules G U I IR Query expansion PubMed search Document filtering/ranking Document organisation IE Sentence extractor NLP tools Case frame generator

Information Retrieval • A meta-query engine built round PubMed • Expansion of the initial query with synonyms using a gene/protein synonym database (GPSDB) • the goal being to retrieve an exhaustive set of documents containing information on a protein. • Filtration and ranking of the retrieved documents • Pre-classification according to information topics.

GPSDB • Database for synonym expansion of gene and protein names • Populated by the main resources on model organisms • Contains 559’294 synonyms referring to 292’472 proteins

LocusLink TWIST1 H-twist LocusLink BPES2 SCS ACS3 HUGO HUGO ACSL3 BPES3 TWIST ACS3 twist PRO2194 ACSL3 TWIST1 FACL3 FACL3 H-twist ACS3 BPES2 SCS PRO2194 ACS3 BPES3 TWIST Swiss-Prot OMIM ACSL3 FACL3 TWIST1 TWIST OMIM Swiss-Prot TWISTTWIST1 ACS3 LACS3 FACL3 GPSDB • Cross-reference links are used to connect database entries that refer to a same gene/protein entity, thus pointing out the problem of homonymy when it occurs

GPSDB screenshot lap2 is a synonym of three separate protein entities Erbin HSP 86 Thymopoietin

GPSDB screenshot

GPSDB used for query expansion lap2 Original user query: Query expansion based on GPSDB

Document filtering and ranking • Interactive modules which permit a flexible selection of relevant documents for the IE process. • Algorithmic approaches • Query dependent: • Lucene Ranker: java-based indexing engine giving a ranked output of queried documents • Query independent: • Naive Bayes Ranker: using pre-trained classification of relevant documents on specific topics

Document filtering and ranking Output of query dependent ranking

Document filtering and ranking Output of query independent ranking with respect to topic “Disease”

Information retrieval and extraction modules G U I IR Query expansion PubMed search Document filtering/ranking Document organisation IE Sentence extractor NLP tools Case frame generator

Sentence extractor • Goal: extract sentences with information relevant for protein annotation • Method: machine learning from corpora with manually labeled sentences • Data representation: bag-of-words approach • Best results with Support Vector Machines (linear/Radial Basis Function)

Sentence extractorSample output • set of sentences extracted from the top 5 ranked papers • query-terms are highlighted • sentences classified according to topics (function, structure, disease) • sentences linked to the PubMed abstract they originate from

Case frame generator A protein containing the N-terminal domain with the first transmembrane segment of MAN1 is retained in the inner nuclear membrane. TARGETED_TO {X: MAN1} {Y: inner nuclear membrane}

Case frame generator • Goal: Automatic identification of selected types of entities, relations, or events in free text • Methods: • Given a set of pre-labeled sentences, learn IE templates with Inductive Logic Programming (ILP) • Background knowledge: • Syntactic & semantic information from shallow-parser • Ontologies providing entities in a given domain • Text analysis tools • Shallow Parser (MBSP) based on Machine Learning (TiMBL) • Shallow parser adapted to biomedical field using Genia corpus

subject object object Case frame generatorSample output shallow parser The mouse lymphoma assay (MLA) utilizing the Tk gene is widely used to identify chemical mutagens. Cell-line The mouse lymphoma assay MLA DNA part to identify utilizing chemical mutagens the TK gene

Case frame generatorSample output • Information extracted by the Case Frame Generator, which applied machine learned IE rules to output of the Shallow Parser

Summary • The BioMinT prototype is a workingunified system for Biological Text Mining • Information Retrieval: • query expansion • doc filtering/ranking • Information extraction • Extraction of sentences on user-specified topics • Extraction of relationships between entities (Case frames) • Based on variety of resources/technologies/expertises • Biological sciences: corpus annotation, database annotation, fingerprints, ontologies, … • Artificial intelligence: IR, machine learning (SVM, ILP, …), Natural Language Processing (Shallow Parser), Case Frames, … • Software development: databases, web-server, GUI, …

Future BioMinT developments • Integration of BioMinT prototype in the future annotation environment of Swiss-Prot & PRINTS • Release Q4-2005 • Free web-based version, with restrictions on • Simultaneous users • Resources per user (computing & storage) • Customization services provided by PharmaDM • Integration into researcher’s IT environment (E-mail alerts …) • Mining in-house document collections • Combination with DMax data analysis software • Incorporation of highly specialized background knowledge (ontologies, thesauri, biological dictionaries, etc…) • Custom reports and GUI, etc…

WWW • BioMinT home page: http://www.biomint.org • GPSDB synonyms database: http://biomint.oefai.at • BioMinT prototype Quick Tour: http://biomint-server.pharmadm.com:8080/xwiki/bin/view/BioMinT/ProtopQuickTour

Melanie Hilario Jee-Hyub Kim Walter Daelemans Jo Meyhi Frederik Durant Terri Attwood Alex Mitchell Paul Bradley Kurt De Grave Fred Lefever Walter Luyten Kristof Van Belleghem Andre Vandecandelaere Johann Petrak Alexander Seewald Anne-Lise Veuthey Marc Zehnder Violaine Pillet Swiss-Prot Curators Acknowledgements Artificial Intelligence Biological Sciences Interested? Demo? Leave your card at POSTER 49

The Research Assistant for Biological Text Mining

The Research Assistant for Biological Text Mining

Presentation Transcript

Text Mining

NLP for Text Mining

Biological literature mining

Text Mining in Biomedical Research

Text mining- text analytics- data mining

Text Mining

EECS 800 Research Seminar Mining Biological Data

Text Mining

Text Mining

Research Opportunities in Biomedical Text Mining

Text Mining

The National Centre for Text Mining

Biological Data Mining

Biological Data Mining

Text Mining

Biological Data Mining

Biological Data Mining

Text Mining