Enhanced Similarity Measures for Gene Matching in Biomedicine

ATM 12917635- Oncogene (6.737) 12970738-Oncogene (6.737) 14500819-Nucleic Acids Res. (6.373) 14499692-Science (23.329) STK11 12183403 – Cancer Res (8.30) 12234250 – Biochem J (4.326) 12805220 - EMBO J. (12.459) 11853558- Biochem J (4.326) Set Similarity Measures for Gene Matching Mihail Popescu#, James Keller+, Joyce Mitchell# # Department of Health Management and Informatics;+Department of Electrical and Computer Engineering; University of Missouri-Columbia, Columbia, MO 65211 Example of Similarity Calculation for the Gene Ontology (GO) Dimension Why Similarity Measures? • For a unified clustering approach in a 4D gene space • Gene space dimensions (4D): sequence, microarray expression, literature abstracts (articles), gene ontology (GO) • Two dimensions are numeric (sequence, expression) and two symbolic • The existent symbolic measures are not adequate: • Dice, Jaccard: do not consider the weight of the elements • Maximum and average usually overestimates the or underestimates the similarity, respectively • Example: ATM (human ataxia telangiectasia mutated) and STK11 (serine/threonine kinase 11.) The geneticist assessed these two genes as quasi-similar (similarity ~0.5) because: • they both have protein serine/threonine kinase enzyme activity (they share a kinase domain) • They both cause cancers when mutated, including breast cancer. • s(ATM, STK11)=? (GO dimension) • Algorithm: • 1. Retrieve LocusLink GO annotations: • ATM={4674: “ protein serine/threonine kinase activity”, 3677: ” DNA binding”, 4428 ” inositol/phosphatidylinositol kinase activity”, 7131 : ” meiotic recombination”, 6281 : ” DNA repair”, 7165: ” signal transduction”, 5634: ” nucleus”, 16740: ” transferase activity”, 45786: ” negative regulation of cell cycle”} • STK11={5524: “ ATP binding”, 4674: ” protein serine/threonine kinase activity”, 6468: ” protein amino acid phosphorylation”, 16740: ” transferase activity”} • 2. Compute GO term densities using the Resnik formula [4], the normalized version [.] or the depth in the hierarchy (.) • Calculate the confidence of the pair g(A1, A2) =g(A1)*g(A2) and normalize using maximum value: • The pair-wise similarity values calculated using FMS are: • Similarity calculation: • Using weighted average: s(ATM, STK11)=0.37 • Using Choquet integral: s(ATM, STK11)=0.53 Conclusions • For the GO dimension, the best method of assigning densities was normalizing the information content [4] by the maximum value • The proposed fuzzy similarity measure (FMS) agrees better with our intuition of similarity: if the common elements have a high confidence, then the similarity is stronger. In addition, the non common terms have also a contribution to the similarity since the measure is computed apriori for each term set. • The Choquet similarity measure is much more general, depending only on the fuzzy measure. In addition the optimal fuzzy measure can be learned from examples. • 3. Compute the similarity: Possible similarity measures Example of Similarity Calculation for the Retrieved Abstracts Dimension Expression dimension: real measures (Euclidian measure, etc…). Sequence dimension: sequence similarity measure (Smith-Waterman, Needleman-Wunsch, etc…) GO and Abstract dimension: set similarity. Acknowledgements This research was supported by National Library of Medicine Biomedical and Health Informatics Research Training grant 2-T15-LM07089-11. Set similarity measures References • Set similarity: given two gene products, G1 and G2, we can consider them as being represented by collections of terms: Based on the two sets, the goal is to define a natural similarity between G1 and G2 and , denoted as : • Two types of set similarity: • element based (Dice, Jaccard, Cosine, fuzzy measure) • pair of elements based (Maximum, Average, OWA, Choquet) [1] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 2001. [2] R. Yager, “Criteria Aggregation Functions Using Fuzzy Measures and the Choquet Integral”, Int. Jour. of Fuzzy Systems, Vol.1, No. 2, December 1999. [3] J.J. Jiang, D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Ontology”, Proc. of Int. Conf. Research on Comp. Linguistics X, 1997, Taiwan. [4] P.W. Lord, R.D. Stevens, A. Brass, C.A. Goble, “Semantic similarity measure as a tool for exploring the gene ontology”, In Pacific Symposium on Biocomputing, pages 601-612, 2003. [5] M. Sugeno, Fuzzy measures and fuzzy integrals: a survey, (M.M. Gupta, G. N. Saridis, and B.R. Gaines, editors) Fuzzy Automata and Decision Processes, pp. 89-102, North-Holland, New York, 1977. [6] S. Raychaduri, R.B. Altman, “A literature-based method for assessing the functional coherence of a gene group”, Bioinformatics, 19(3), pp. 396:401, Feb. 2003. [7]. M. Grabisch, T. Murofushi, and M. Sugeno (eds.), Fuzzy Measures and Integrals: Theory and Applications, Springer-Verlag, 2000. [8]. Hvidsten TR, Komorowski J, Sandvik AK, Laegreid A. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput. 2001;:299-310. [9]. Trupti Joshi. Cellular function prediction for hypothetical proteins using high-throughput data. MS thesis, University of Tennessee, Knoxville, 2003. [10]. Keller J, Popescu M, Mitchell J. Soft Computing Tools for Gene Similarity Measures in Bioinformatics, FLINT-CIBI 2003, Berkeley, Dec 15-18, 2003. • s(ATM, STK11)=? (Abstract dimension) • Algorithm: • Retrieve PubMed abstracts for ATM, STK11 • Calculate all the pair-wise distances based on the MeSH indexing • Keep the 4 best-matching pairs • Find the impact factor for each journal: g(Ai), i=1…8

Enhanced Similarity Measures for Gene Matching in Biomedicine

Enhanced Similarity Measures for Gene Matching in Biomedicine

Presentation Transcript

Intermediate Algebra 098A

Regulation of Gene Expression

MEDICAL ANTHROPOLOGY

Statistics

Warm Up Evaluate each expression for x = 1 and y =–3. 1. x – 4 y 2. –2 x + y

Gene Expression Analysis

Regulation of Gene Expression

Chapter 18

Chapter 18

Chapter 18

4. Gene Expression Data Analysis

Chapter 3

Chapter 4 Trigonometry

From DNA to Protein: Gene Expression

CHAPTER 3 Data Description

Data Description

Welcome