920 likes | 947 Vues
This research paper explores the methodology for discovering biologically meaningful associations between controlled vocabulary terms in the biomedical web. It focuses on human genes, diseases, and genomics, using metrics and semantic knowledge to identify meaningful associations.
 
                
                E N D
1 April 2009 Woei-Jyh (Adam) Lee, Ph.D. Center for Bioinformatics and Computational Biology University of Maryland, U.S.A. National Center for Biotechnology Information National Institutes of Health, U.S.A.
Research Experiences • New York University • Morgan Stanley • AT&T Labs, AT&T • Bell Laboratories, Lucent Technologies • University of Southern California • University of Maryland • National Institutes of Health, U.S.A. Distributed computing, parallel computing Web performance measurement Component object model, quality of service Fault tolerance, Internet protocols, policy based management Media streaming, error correction, video on demand Data management and mining, bioinformatics Protein domain parsing, genomics and genetics W.-J. Lee
Mining associations in the annotated biomedical web Woei-Jyh (Adam) Lee, Ph.D. University of Maryland Department of Computer Science; and Institute for Advanced Computer Studies; and Center for Bioinformatics and Computational Biology
LSLink - Life Science Link • http://www.cbcb.umd.edu/research/lslink/ W.-J. Lee
LSLink - Life Science Link • Background • Many Web accessible data resources for biologists. • publications, genes, diseases, sequences, structures, … • Data records are linked. • genes linked to publications, diseases linked to genes, … • Data records are annotated with controlled vocabulary (CV) terms. • Entrez Gene with GO, PubMed with MeSH, … • Goal: to discover biologicallymeaningfuland yetunknownassociations between pairs of CV terms. W.-J. Lee
Entrez Gene W.-J. Lee
PubMed W.-J. Lee
Links from Entrez Gene to PubMed W.-J. Lee
Links from PubMed to Entrez Gene W.-J. Lee
Gene Ontology (GO) W.-J. Lee
Medical Subject Headings (MeSH) W.-J. Lee
GO Annotations in Entrez Gene W.-J. Lee
MeSH Annotations in PubMed W.-J. Lee
Web of Entrez Gene, OMIM and PubMed GO Gene Nomenclature MeSH Lash CV Entrez Gene PubMed OMIM Legend: data resource link Clinical Synopsis SNOMED CT annotation controlled vocabulary W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Methodology to Generate Datasets • Links and Termlinks • Background and User Query Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Approach • Generate and analyze background and user query datasets. • Apply two classes of metrics (association rule mining and hypergeometric distribution) • Filter CV terms (by Major Topic, Semantic Type, etc.). • Rank association pairs of terms from two CVs. • Perform scientist evaluation. W.-J. Lee
Methodology Collect data records and links Entrez Gene (E) PubMed (P) e1 p1 e2 p2 Extract annotations E P GO (G) MeSH (M) Generate termlink instances e1 p1 g1 m1 g2 m2 W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Methodology to Generate Datasets • Links and Termlinks • Background and User Query Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Links versus Termlinks annotations annotations GO Entrez Gene PubMed MeSH links g1 m1 e1 p1 g2 g3 m2 e2 p2 g4 m3 e3 p3 g5 g6 m4 1 link: (e1, p2) 4 termlinks: (g1, m2, e1, p2) (g1, m2, e1, p2) (g6, m3, e1, p2) (g6, m3, e1, p2) GO Entrez Gene PubMed MeSH g1 e1 m2 p2 m3 g6 W.-J. Lee
Example Links and Termlinks Entrez Gene PubMed … GeneID: 672 PMID: 12242698 MeSH: BRCA1 Protein GO: DNA repair MeSH: BRCA2 Protein GO: positive regulation of DNA repair … Legend: GeneID: 675 PMID: 10749118 data resource data record link MeSH: Mitosis GO: DNA repair controlled vocabulary term MeSH: Neoplasm Proteins GO: mitotic checkpoint termlink … W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Methodology to Generate Datasets • Links and Termlinks • Background and User Query Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Human Genes Background Dataset • Retrieve all active human gene records in Entrez Gene. • Filter out records been replaced and discontinued. • Filter out records without GO annotations or links to PubMed. • Extract their GO annotations. • Follow all links from these records to PubMed records. • Extract MeSH annotations for PubMed records reached for the prior step. • Use the most relevant Descriptors/Qualifiers identified as Major Topic. • Filter with selected Semantic Types. W.-J. Lee
User Query Dataset • We support multiple user scenarios for querying the background dataset. • A user query dataset is a subset of the background dataset of scientists’ interest. • Individual human gene record: APOE, CFTR, etc. • BRCA1/BRCA2-containing complex: Early Onset Breast Cancer • Human genes and genetic disorders: Breast Cancer, Colorectal Cancer, Prostate Cancer, etc. • (G,M,E’,P’) (G,M,E,P) where E’  E and P’  P W.-J. Lee
Example Human Gene User Query Datasets W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Two Classes of Metrics Used to IdentifyPotential Meaningful Associations • Association rule mining. • Used in data mining. • Support and confidence scores [Agrawal et al. 1993]. • Hypergeometric distribution. • Used in hypothesis testing. • P-value [Sokal and Rohlf 1969]. W.-J. Lee
Definition of Probabilities • Term probability: • Link probability: • Conditional probability: W.-J. Lee
Association Rule Mining • Support score reflects the probability of an association annotated with some pair of CV terms. • Confidence score reflects the conditional probability of an association annotated with some pair of CV terms, given that associations are annotated with either of the CV terms. W.-J. Lee
Support and Confidence with Correction • We incorporate term-frequency correction and apply log operator (both are novel to our research). W.-J. Lee
Hypergeometric Distribution • P-value tests the over-representation of an association in a user query dataset. W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Example User Query Datasets for Evaluation of Metrics W.-J. Lee
Relationship among Subsets of Associations in a User Query Dataset User query dataset: early onset breast cancer in human W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Distribution of confidence scores forEarly Onset Breast Cancer in Human Confidence scores of most associations in singleton and local-non-singleton subsets are higher than 3. W.-J. Lee
Distribution of P-values forEarly Onset Breast Cancer in Human P-values in singleton and local-non-singleton subsets are appeared as step-function. W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Overlap between Confidence Score and P-value Ranks For 50%, we observe that the overlap is significant and ranged from 83.8% (F5) to 92.6% (CTNNB1). W.-J. Lee
Overlap between Top-X Confidence Scoreand Top-K% P-value Ranksfor Early Onset Breast Cancer in Human Overlap b/w Top-X confidence score and Top-20% P-value ranks is mostly larger than 50% (of X). W.-J. Lee
Kendall’s  between Confidence Score and P-value Ranks For 50%, we observe that the Kendall’s  is smaller than 0.5 and ranged from 0.452 (F5) to 0.485 (FGD4). W.-J. Lee
Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Discovery Tool • Human Expert Evaluation • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee
Select A Human Gene Symbol http://www.cbcb.umd.edu/research/lslink/lodgui/ W.-J. Lee
Select CV Type: GO or MeSH W.-J. Lee
Select A GO or MeSH Term W.-J. Lee
View Associations as A Group W.-J. Lee
Association with the Highest Score W.-J. Lee
Association around Average Score average average W.-J. Lee
Associations above A Cutoff Score W.-J. Lee