220 likes | 323 Vues
This study at ISMB/ECCB 2007 explores the process of mapping proteins to disease terminologies in databases to aid in drug development, clinical patient care, and bioinformatics research. The aim is to improve the indexing of UniProtKB/Swiss-Prot with MeSH terminology, enhancing access to biomedical knowledge for clinical researchers.
E N D
ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20 UniProt to MeSHmapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey
Basic research: • what is the mechanism? • Epidemiological studies • Basic research: • what is the mechanism? • Epidemiological studies Drug development Clinical trials • up-to-date knowledge and large-scale results: • research direction • New hypothesis Basic research results stored in databases Health problem in a patient • Bioinformatics: • Data storage and representation • Large-scale data generation • Large-scale data analysis Clinical patient care: Doctor prescribes an individualized treatment plan. Treatment outcome Molecular-level decision-support tools: • Structured knowledge representations • ‘Filtered’ information on fundamental biological mechanisms and significant The role of bioinformatics in biomedical research and future clinical patient care Bio-Ontologies –ISMB 2007
Biomedical knowledge: a protein-centric view Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Biological processes: Biological pathway/network, Protein-protein interaction Proteins: Sequence, Function, structure, modifications Genes: Sequence, chromosomal location, regulation, expression Bio-Ontologies –ISMB 2007
Biomedical knowledge: a protein-centric view • Disease annotation: • Link to 12,603 OMIM entries • Link to other specialized databases • 32,921 variants (or polymorphisms) • >3’000 associated diseases Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Biological processes: Biological pathway/network, Protein-protein interaction High quality manual annotation. Protein name, sequence, function, Domain, features and references. 16,702 human proteins • Biological process/proteomic: • Pathway annotation • Protein-protein interaction (DIP, INTACT) • protein 2D gel (Swiss-2DPAGE) Proteins: Sequence, Function, structure, modifications • Genomic data: • Genew, GeneCards, GenAtlas • Expression data (e.g. CleanEx) • Genome details: Ensembl Genes: Sequence, chromosomal location, regulation, expression References Links to >100 other databases Over 82’420 journal references Bio-Ontologies –ISMB 2007
Objective Increase the accessibility of molecular biology resources to clinical researchers by indexing UniProtKB/Swiss-Prot with the MeSH terminology Bio-Ontologies –ISMB 2007
Why UniProt KB/Swiss-Prot ? • Most comprehensivewarehouseof protein sequences • With a high level of annotation and highly cross-linked with other biological databases. • Includes data on more than 30’000variants, mostly c-SNPs (coding SNPs) or SAPs (Single Amino-acid Polymorphisms) • More than 3’000 Diseases associated with a protein are also described (mostly genetic diseases associated with SAPs) http://beta.uniprot.org/ Bio-Ontologies –ISMB 2007
Disease annotation UniProtKB/Swiss-Prot entry P35240 Bio-Ontologies –ISMB 2007
Why MeSH? • Controlled vocabulary thesaurus structured in a hierarchy of concepts • Each concept includes a set of terms -synonyms and lexical variants • MeSH is part of the UMLS, and, thus, linked to other medical terminologies • MeSH is used to index the biomedical literature Bio-Ontologies –ISMB 2007
The structure of MeSH Bio-Ontologies –ISMB 2007
Mapping procedure UniProtKB/Swiss-Prot entry Disease comment line Extracted disease name OMIM: title/alternative titles Exact match Exact match Partial match Partial match Same descriptor MeSH Bio-Ontologies –ISMB 2007
Disease extraction Extraction using regular expressions ‘are the cause of’ ‘involved in’ etc. MeSH ‘Neurofibromatosis 2’ Bio-Ontologies –ISMB 2007
Term matching procedure • Exact matches: same length, same word order, case insensitive • Partial matches: calculation of a similarity score between terms based of the IDF used in information retrieval: The term with the highest score was chosen. Bio-Ontologies –ISMB 2007
Benchmark 92 disease names from 43 Swiss-Prot entries manually mapped to MeSH terms • Used to evaluate the procedure in terms of recall and precision • Used to set up a score threshold Bio-Ontologies –ISMB 2007
Results on the Benchmark Bio-Ontologies –ISMB 2007
Analysis of the results (1/3) • Problems in granularity difference Disease ‘muscle-eye-brain disease’ Manual mapping Automatic mapping ‘abnormalities, multiple’ ‘muscle liver brain eye nanism’ MeSH term Bio-Ontologies –ISMB 2007
Analysis of the results (2/3) • Problems in disease name extraction Disease (extracted) ‘hematopoietic tumors such as b-cell lymphomas’ Manual mapping Automatic mapping ‘hematologic neoplasms’ ‘b-cell lymphoma’ MeSH term Bio-Ontologies –ISMB 2007
Analysis of the results (3/3) • Problems inherent to the resources ‘epidermolysis bullosa simplex, Weber-Cockayne type’ Disease SP Disease (OMIM alternative title) ‘epidermolysis bullosa dystrophica, Cockayne-Touraine type’ Manual mapping Automatic mapping ‘epidermolysis bullosa dystrophica’ ‘epidermolysis bullosa simplex’ MeSH term Bio-Ontologies –ISMB 2007
Results on all Swiss-Prot Bio-Ontologies –ISMB 2007
Discussion • The mapping system was tuned for high precision to provide a fully automated procedure. • But we need to improve therecall by: • Including NLP techniques in the disease extraction and matching procedures; • Refining the score with other parameters (e.g. coming from information from the hierarchical structure of the MeSH) • Permitting a mapping to several MeSH terms; • Trying to map to other terminologies such as ICD-10, SnoMed-CT; • Using information from the literature which is indexed with MeSH terms. Bio-Ontologies –ISMB 2007
Work in progress • Benchmark extended to 200 diseases Bio-Ontologies –ISMB 2007
Work in progress • Extract MeSH terms using full text from disease comment lines + references in Swiss-Prot + references in OMIM calculate frequency • This frequency is used to refine the score for partial match Preliminary results: The recall was successfully increased to 62 % without losing precision. Bio-Ontologies –ISMB 2007
Conclusion • We developped a generic terminology mapping procedure which can be used to link various biomedical resources. • Indexing UniProtKB with medical terms opens new possibilities of searching and mining data relevant for clinical research. • These results will help improve the interoperability between medical informatics and bioinformatics Bio-Ontologies –ISMB 2007