Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

Investigating semantic similarity measures across the Gene Ontology:the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester Presented by 임 동혁 July 22, 2005

Contents • Introduction • Semantic Similarity Measures • Validating Semantic Similarity • Investigating Semantic and Sequence Similarity • Semantic Searching of GO Annotated Resources • Discussion

Introduction • Bioinformatics resources • In form of sequence, which are then annotated • In scientific natural language as text • Human readable and understandable • Not easy to interpret computationally • Ontologies • Provide a mechanism for capturing a view of a domain in a shareable form • Both accessible by humans and computationally amenable • Provide a set of vocabulary terms that label concepts in the domain • “is-a” relationship between parent and child • “part-of” relationship between part and whole

Gene Ontology(1/2) • GO comprises three orthogonal taxonomies of aspects • Molecular function • Biological process • Cellular component • GO is a rapidly growing collection of about 11000 phrases, representing terms or concepts • Directed Acyclic Graph(DAG)

Gene Ontology(2/2) • Allow improved querying of databases • Different resources queried with the same term • Shared understanding improve retrieval consistency across resources and recall and precision • One obvious alternative way • Ask for proteins semantically similar to a query protein • Semantic similarity • Taxonomy of biomedical terms • Ex) Medical Subject Heading(MeSH) : similar content(by words)

the Gene Ontology Receptor-associated protein GO:0016962 p=0.00159 Transmembrane receptor GO:0004888 p=0.0997 isa isa signal transducer GO:0004871 p=0.208 isa receptor GO:0004872 p=0.124 isa photoreceptor GO:0009881 p=0.000433 isa molecular function GO:0003674 p=1 isa Receptor signaling protein GO:0005057 p=0.0281 isa isa chaperone GO:0003754 p=0.0102 ligrand GO:0005102 p=0.0460 Two proteins are both annotated as “transmembrane receptor” (GO:0004888) Similar semantic description Semantically less similar One as just “receptor”(GO:0004872)

Semantic Similarity Measure(1/3) • Early techniques (Rada et al, 1989) • Path distances between terms • Assumes that all of semantic links are of equal weight • Poor assumption • Ex) “photoreceptor” and “transmembrane receptor” are semantically more closely related than “chaperone” and “signal transducer”

Semantic Similarity Measure(2/3) • Edge could be weighted • The greater distance from root of the graph, the more specific the terms • However, GO varies widely in the distance of nodes from the root • Ex) (GO:0005300) is 14 terms deep, (GO:0008435) is only 3 terms deep • Not significantly less semantically precise

Semantic Similarity Measure(2/3) • Usage of terms within the corpus (Resnik, 1999) • Use the notion of “information content” • Familiar from most internet search engines • Ex) “chaperone” is a more informative term than “signal transducer” • The former is used several times, the later thousand times • GO:0004872 occurs, GO:0004871 and GO:0003674 have also occurred (“is-a” link are considered) More informative

Probabilities in the Gene Ontology Each node is annotated with its GO accession and the probability of this term occurring in the SWISS-PROT-Human database 1. Count the number of times each concept occurrs, 2. A concept occurs if a term, or any node its children occur 3. The probability, p(c), for each node is this value, divided by the number of times (the probability of root node will be 1)

Semantic Similarity between terms • Use simplest of measure (Resnik, 1999) • Based on the information content of shared parents of the two terms • S(c1, c2) is the set of parental concepts shared by both c1 and c2 • Minimum p(c) : GO allows multiple path • Pms(probability of the minimum subsumer) • Similirity score between two terms As probability increase, informativeness decrease

Validating Semantic Similarity • How do we validate such a measure? • Protein’s sequence relates to its function • Highly similar sequences should be highly semantically similar • Protein sequences in pairs and plotting sequence similarity against semantic ssimilarity should a relationship

Adapting the Similarity Measures to GO and SWISS-PROT • “part-of” relationship • Orphan term • Linked them directly to the root • Ex) GO:0009542 • Is-a’s links alone • Proteins may be annotated with more than a single term • Wordnet : Maximum similarity • GO : average similarity

Comparing Semantic Similarity Across GO Aspects • There is a good correlation between sequence similarity and semantic similarity • The correlation is greater when measured against the “molecular function”

The Relationship Between Semantic Similarity and Evidence Codes • TAS : regarded as the highest standard of evidence • When only TAS GO annotation are considered, the correlation is much greater

Effect of Using Semantic Links in Semantic Similarity • Consider only links of a single type “is-a” or “part-of” • Little difference between all link and “is-a” : almost link are of “is-a” type (6167 / 6202) • No links drop in the middle part : proteins share similar (links are included in semantic similarity measure)

Analysis(1/2) • Very high semantic similarity but little sequence similarity • “Polymorphic” groups • Two or more classes of protein involved in the same process • Heterodimerize or sub-families • Hyper variable protein families • arbitrary • Mis-annotations • SWISS-PROT “x-like” but in GO “x” • Spelling mistake

Analysis(2/2) - Example

Semantic Searching of GO Annotated Resources • Develop a search tool • Given query protein against all the others in SWISS-PROT-Human • Generates a ranked list of semantically similar proteins • Ex) “OPSR_HUMAN”

Discussion • Investigated semantic similarity measure • All cases semantic similarity is correlated with sequence similarity • GO aspect : molecular funstion • Evidence code : “Traceable Author Statement” • Future work • Effect of the different semantic links in ontologies • Co-expression as revealed by microarray experiments • Expect that biological process aspect would be of great use

Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

Presentation Transcript

The University of Manchester

Chris Parkes University of Manchester

Manchester Metropolitan University

Manchester Metropolitan University

School Direct University of Manchester

July 2003

July, 2003

ChemEd 2003 Auburn University, July 27-31, 2003

Manchester University

The University of Manchester

Angela McLachlan, University of Manchester Gee Macrory, Manchester Metropolitan University

Gina-Anne Levow University of Chicago July 7, 2003

The University of Manchester (1824)

The University of Manchester

July 2003

Ian Fairweather, University of Manchester

July 2003

Liang HAN University of Manchester

July 2003

Ward Manchester University of Michigan

The University of Manchester

The University of Manchester