1 / 20

Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester Presented by 임 동혁 July 22, 2005. Contents. Introduction

Télécharger la présentation

Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Investigating semantic similarity measures across the Gene Ontology:the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester Presented by 임 동혁 July 22, 2005

  2. Contents • Introduction • Semantic Similarity Measures • Validating Semantic Similarity • Investigating Semantic and Sequence Similarity • Semantic Searching of GO Annotated Resources • Discussion

  3. Introduction • Bioinformatics resources • In form of sequence, which are then annotated • In scientific natural language as text • Human readable and understandable • Not easy to interpret computationally • Ontologies • Provide a mechanism for capturing a view of a domain in a shareable form • Both accessible by humans and computationally amenable • Provide a set of vocabulary terms that label concepts in the domain • “is-a” relationship between parent and child • “part-of” relationship between part and whole

  4. Gene Ontology(1/2) • GO comprises three orthogonal taxonomies of aspects • Molecular function • Biological process • Cellular component • GO is a rapidly growing collection of about 11000 phrases, representing terms or concepts • Directed Acyclic Graph(DAG)

  5. Gene Ontology(2/2) • Allow improved querying of databases • Different resources queried with the same term • Shared understanding improve retrieval consistency across resources and recall and precision • One obvious alternative way • Ask for proteins semantically similar to a query protein • Semantic similarity • Taxonomy of biomedical terms • Ex) Medical Subject Heading(MeSH) : similar content(by words)

  6. the Gene Ontology Receptor-associated protein GO:0016962 p=0.00159 Transmembrane receptor GO:0004888 p=0.0997 isa isa signal transducer GO:0004871 p=0.208 isa receptor GO:0004872 p=0.124 isa photoreceptor GO:0009881 p=0.000433 isa molecular function GO:0003674 p=1 isa Receptor signaling protein GO:0005057 p=0.0281 isa isa chaperone GO:0003754 p=0.0102 ligrand GO:0005102 p=0.0460 Two proteins are both annotated as “transmembrane receptor” (GO:0004888) Similar semantic description Semantically less similar One as just “receptor”(GO:0004872)

  7. Semantic Similarity Measure(1/3) • Early techniques (Rada et al, 1989) • Path distances between terms • Assumes that all of semantic links are of equal weight • Poor assumption • Ex) “photoreceptor” and “transmembrane receptor” are semantically more closely related than “chaperone” and “signal transducer”

  8. Semantic Similarity Measure(2/3) • Edge could be weighted • The greater distance from root of the graph, the more specific the terms • However, GO varies widely in the distance of nodes from the root • Ex) (GO:0005300) is 14 terms deep, (GO:0008435) is only 3 terms deep • Not significantly less semantically precise

  9. Semantic Similarity Measure(2/3) • Usage of terms within the corpus (Resnik, 1999) • Use the notion of “information content” • Familiar from most internet search engines • Ex) “chaperone” is a more informative term than “signal transducer” • The former is used several times, the later thousand times • GO:0004872 occurs, GO:0004871 and GO:0003674 have also occurred (“is-a” link are considered) More informative

  10. Probabilities in the Gene Ontology Each node is annotated with its GO accession and the probability of this term occurring in the SWISS-PROT-Human database 1. Count the number of times each concept occurrs, 2. A concept occurs if a term, or any node its children occur 3. The probability, p(c), for each node is this value, divided by the number of times (the probability of root node will be 1)

  11. Semantic Similarity between terms • Use simplest of measure (Resnik, 1999) • Based on the information content of shared parents of the two terms • S(c1, c2) is the set of parental concepts shared by both c1 and c2 • Minimum p(c) : GO allows multiple path • Pms(probability of the minimum subsumer) • Similirity score between two terms As probability increase, informativeness decrease

  12. Validating Semantic Similarity • How do we validate such a measure? • Protein’s sequence relates to its function • Highly similar sequences should be highly semantically similar • Protein sequences in pairs and plotting sequence similarity against semantic ssimilarity should a relationship

  13. Adapting the Similarity Measures to GO and SWISS-PROT • “part-of” relationship • Orphan term • Linked them directly to the root • Ex) GO:0009542 • Is-a’s links alone • Proteins may be annotated with more than a single term • Wordnet : Maximum similarity • GO : average similarity

  14. Comparing Semantic Similarity Across GO Aspects • There is a good correlation between sequence similarity and semantic similarity • The correlation is greater when measured against the “molecular function”

  15. The Relationship Between Semantic Similarity and Evidence Codes • TAS : regarded as the highest standard of evidence • When only TAS GO annotation are considered, the correlation is much greater

  16. Effect of Using Semantic Links in Semantic Similarity • Consider only links of a single type “is-a” or “part-of” • Little difference between all link and “is-a” : almost link are of “is-a” type (6167 / 6202) • No links drop in the middle part : proteins share similar (links are included in semantic similarity measure)

  17. Analysis(1/2) • Very high semantic similarity but little sequence similarity • “Polymorphic” groups • Two or more classes of protein involved in the same process • Heterodimerize or sub-families • Hyper variable protein families • arbitrary • Mis-annotations • SWISS-PROT “x-like” but in GO “x” • Spelling mistake

  18. Analysis(2/2) - Example

  19. Semantic Searching of GO Annotated Resources • Develop a search tool • Given query protein against all the others in SWISS-PROT-Human • Generates a ranked list of semantically similar proteins • Ex) “OPSR_HUMAN”

  20. Discussion • Investigated semantic similarity measure • All cases semantic similarity is correlated with sequence similarity • GO aspect : molecular funstion • Evidence code : “Traceable Author Statement” • Future work • Effect of the different semantic links in ontologies • Co-expression as revealed by microarray experiments • Expect that biological process aspect would be of great use

More Related