Semantic Similarity Computation on the Web of Data

Semantic Similarity Computation on the Web of Data Jin Guang Zheng Tetherless World Constellation, Computer Science Department RPI

Outline • Introduction • Research Problem • Historical review • Contribution Overview • Contribution I: Information Entropy and Weighted Similarity Model • Semantic Similarity Computation Intuitions • IEWS Model • Contribution II: Semantic Similarity based Entity Matcher • Entity Matching Problem • System • Contribution III: Semantic Similarity based Entity Linking Tool • Entity Linking Problem • System • Evaluation • Summary 2

Background • Entity • A thing on the Web of Data that has an URL as identifier • E.g. Organization, Location, Person • http://dbpedia.org/resource/George_Washington • Triple • Subject, Predicate(Property), Object • :George_Washington dbpediaProp:birthDate 1732 • subject: :Geroge_Washington • predicate: dbpediaProp:birthDate • Object: 1732 • You can read it as: George Washington’s birth date is 1732. • Object can also be an URL which can be described by another set of triples • :Geroge_Washington dbpediaProp:birthPlace :Virginia • :Virginia dbpediaProp:area 42774 sq mi • You can read it as: George Washington’s birth place is Virginia. • Virginia has area 42774 sq mi 3

Semantic Similarity • Semantic similarity: how likely two things are semantically the same base on the likeliness of the semantic contents. • Car, Automobile • http://dbpedia.org/resource/New_York_City, http://data.nytimes.com/N46020133052049607171 (New York City) 4

The Problem • Entities on the Web of Data • Entities on the Web of Data are from different sources, heterogeneous content • Some entities are similar to each other, or refer to same real world object • How can we know if any entities are similar to each other? How to compute similarity score among the entities on the Web of Data? • Limits possible applications: data integration, data aggregation, data clustering

The Problem • Entity Matching • How to tell if two entities are referring to same real world object/concept and create “same as” type of link automatically • Enables Data integration, data interoperability • Entity Recognition • How to find the “correct” entity from the Web of data to annotate entity mentions in the free text. • Machines can process texts in “smart” way • Knowing “Geroge Washington” is referring to president George Washington not George Washington University. 6

Historical Review • Semantic Similarity Computation • Ontology based Edge-counting method: • Similarity between words is computed by applying a function to the length of the path linking the words in an ontology. [8][2] • Information-content based method:[4][5][6][7] • Similarity between documents is computed by using a corpus to compute the amount of information they shared • Hybrid method [1][3] • Similarity between documents/words are computed by using combination of above approaches. Computing similarity between documents and words as opposed to entities. 7

Historical Review • Semantic Similarity Based Entity Matching • Ontology Matching & Instance Matching • ASMOV computes “children”, “parent” and lexical similarity [9] • Duan use “Jaccard” and “Edit distance” similarity and performs clustering [10] • User configured information as a guide and computes similarity with the information provided by user [11] • Rong et al. [12] extracted literal information from the entities and represented this information as vectors Computing information entropy and learning the importance of the properties that describe the entities in similarity computation 8

Historical Review • Semantic Similarity Based Entity Recognition • Bagga et al. [14][15] use Vector Space Model (VSM) to represent the context of the entity mention and use cosine similarity to suggest possible annotation for the entity mention • Minkov et al. [16] and Jiang et al. [17] suggest to use graph based algorithms to further the similarity computation • Linden [18] leverages information from Wikipedia and taxonomy from the knowledge base to compute similarity between the Wikipedia concepts and entity mentions to suggest annotation Computing similarity between entity mentions in free text and Wikipedia documents as opposed to entity mentions in free text and entities on the Web of Data. 9

Challenges of ComputingSimilarity on the Web of Data • Challenge III: Extra information are not necessarily to differentiate entities http://dbpedia.org/resource/New_York_City, (>100 triples) http://data.nytimes.com/N46020133052049607171 (< 20 triples) • Challenge IV: The amount of Linked Open Data on the Web is already in the order of billions of entities and triples and is still increasing 11

Advantages of Computing Similarity on the Web of Data • Advantages: • Entities on the Web of Data are well-structured • There are typed links among the entities on the Web of Data • rdf:type, foaf:name, etc. 12

Overview of Contributions • Contribution I: Information Entropy and Weighted Similarity Model (IEWS) • We developed a new semantic similarity computation model which is more suitable for similarity computation among entities on the Web of Data. • Contribution II: Semantic Similarity based Entity Matcher • We developed a new Entity Matcher based on IEWS Model which outperforms existing systems in terms of precision, and recall. • Contribution III: Semantic Similarity based Entity Linking Tool • We develop a new Entity Linking tool based on IEWS Model. 13

Contribution I:Information Entropy and Weighted Similarity Model 14

Assumptions • Assumption 1: The entities are described using same language. • Assumption 2: The descriptions of an entity are consistent. • Assumption 3: Closed world assumption. All descriptions of an entity are provided. • Assumption 4: Entities that are similar to each other must have some literal content that are similar.

Intuitions • Intuition 1: The similarity between entities A and B is related to their commonality and difference. The more commonality they share, the more similar they are. The more difference they have, the less similar they are. • Pair 1: • :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person • :Entity1 ex:lives_in “NY” :Entity2 ex:lives_in “NY” • Pair 2: • :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person • :Entity1 ex:lives_in “NY” :Entity2 ex:lives_in “RI” • Sim(Pair 1) > Sim(Pair 2) 16

Intuitions • Intuition 2: The commonality and difference between entities A and B are related to the amount of information that descriptions of A and B deliver. The more amount of information the descriptions deliver, the more it affects the similarity score. • Given SSN is unique identifier and there are many people in the dataset • Pair 1: • :Entity1 ex:SSN “123-45-6789” Entity2 ex:SSN “123-45-6789” • Pair 2: given • :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person • Sim(Pair 1) > Sim (Pair 2) 17

Intuitions • Intuition 3: The commonality and difference between entities A and B are related to the importance of their descriptions. The more important a description is, the more it affects the similarity score. • Given people can travel to many places and gender is disjoint property • Pair 1: • :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person • :Entity1 ex:travel_to “UK” :Entity2 ex:travel_to “Canada” • Pair 2: • :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person • :Entity1 ex:gender “female” :Entity2 ex:gender “male” • Sim(Pair 1) > Sim (Pair 2) 18

Intuitions • Intuition 4: The similarity between entities A and B is in range of 0 to 1. 1 is reached when A and B are semantically the same. 0 is reached when A and B are semantically different. 19

Semantic Similarity Between Entities • Given semantic similarity computation intuitions, how can we compute the similarity among the entities? • Entities are described by set of triples • Similarity between entities can be computed by comparing their triples 20

Triple-wise Similarity • Simpv computation process • Both Objects are String • Apply Jaccard similarity computation algorithm • Object 1 is URL, Object 2 is String • Extract String content of object 1, and then apply Lexical similarity computation algorithm • ex. _:A _:category _:B, _:B _:label “country”; _:C _:category “country” • Objects are URL • Get the semantic content of both objects and then compute similarity • Stop traverse down if (IE > 0.9 || delta(IE) < 0.05) • Use last part of URL and treat as String • Different property describe same information • Perform property mapping • Schema that described the entities are available (OWL, SKOS) 21

Information Entropy • Intuition 2: The commonality and difference between entities A and B are related to the amount of information that descriptions of A and B deliver. The more amount of information the descriptions deliver, the more it affects the similarity score. • Information Theory: • Information Entropy is a quantified measure of the uncertainty of the information content -> quantified the expected amount of the information in a description 22

Information Entropy 23

Information Entropy • Joint Entropy • Given a set of triples, how much information are given by these triples • Conditional Entropy • If already given a triple A, how much information will triple B provide • Chain Rules for Information Entropy • Use chain rule to compute joint entropy • Scalability is a problem • Approximate Information Entropy • Pick only the properties that have high information entropy 24

Importance of Property • Intuition 3: The commonality and difference between entities A and B are related to the importance of their descriptions. The more important a description is, the more it affects the similarity score. • Importance is different from information Entropy: • Property ex:gender is important description even though its entropy is low compare to other property. • if the values are different, it is strong indication that two objects are not the same. • We can use “weight” to describe the Importance of a property 25

Weight Learning Problem • Weight Learning Problem(WLP): given a training set T = {(δ1,δ1’,s1), (δ2,δ2’,s2) ... (δn,δn’,sn)}, where δi and δi’ are two sets of triples describe the entities ei and ei’,and si is the similarity score between ei and ei’ • Find a vector of weights for all properties that are used to describe all entities to be compared, so that computed similarity score is as close to S’s as possible 26

Binary Classification Problem • Binary Classification Problem(BCP): given a training set T = {(x1,y1), (x2,y2) … (xn,yn)}, where xi ∈ Rd and y is a set of classification labels {-1, +1} • find an optimal separating hyperplanes W*Ф(x) + b = 0 that separates xs correctly 27

Reduce WLP to BCP • Defining W and y. • W: a vector of weights for all properties that are used to describe all entities to be compared • y: a set of classes that represents the level of similarity between entity e and e’ • y =[low (simW <= 0.5), high (0.5<simW)] 28

Reduce WLP to BCP • We need to make sure the size of Ф(x) is same as the size of W • Ф(x): a vector of property-based similarities between entities e and e’ • During Simpv computation process, we obtain a vector of triple-wise similarities between two entities • A property can be used multiple times to describe an entity • _:Entity1 rdf:type _:Location ; _:Entity1 rdf:type _:Place • Take the average of triple-wise similarity to get property-based similarity • For any properties that are not used to describe entities e and e’, we assigned a 0 29

Informaiton Entropy and Weighted Similarity Model 30

Contribution II:IEWS Model based Entity Matcher 31

Entity Matching Problem • Entity Matching • Given two sets of Entities E and E’, decide if a “same as” type of link should be created between entity e in E and entity e’ in E’ • Use Semantic similarity as a metric to decide whether a “same as” type of link should be created 32

Entity Match • Types of Entity Matching • Instance Matching: • Focus on instance level of matching. • Matching instance data that refer to real world object • Ontology Matching • Focus on schema level of matching • Matching concepts and properties that are mean to describe same idea 33

Entity Match System Flow 34

Blocking Algorithm • Giving two large sets of entities, pairwise similarity computation becomes so expansive. • index entities/creating blocks for entities have same keywords • filter the index by removing the block if lw > lb Consider the following four entities and their corresponding LDs. w = {A,B,C,E,K,L} x = {C,D,E,L} y = {B,K,E,L} z = {A,B,L} If lb = 2, Then the prefixes and corresponding blocks are A : {w, z} C : {w, x} D : {x} K : {w, y}

Match Selection • Base on the matching task, final match can be select use different configuration • Threshold base: th > 0.9 • For all entity select top matched entities

Contribution III:IEWS Model based Entity Linking Tool 37

Entity Recognition System Flow Entity Mentions m1, m2, … mn m1 m2 IEWS Model m5 m10 similarity matrix mn Final Match Candidate Entities Entities From Knowledge base 40 40

Structure Representation of Entity Mentions • Get structured representation of entity mentions • “George Washington is the first president of the United States” Entity1 rdfs:label “George Washington” Entity1 ?p2 Entity2 Entity1 ?p3 Entity3 Entity2 rdfs:label “President” Entity3 rdfs:label “The United States” 41

The Knowledge Base • Entities Base • We use entities from Billion Triple Challenge 2009 to construct our entity base • Triples in BTC 2009 described these entities. • Surface Form Base • Given a surface form of an entity, we need to know what entities are possible candidate entities. Ex. “Washington” • We collect a set of surface form data from BTC: rdfs:label, foaf:name, dbpedia:redirects, etc.

Candidate Entities • Given entity mention with surface form sf, what entities from WOD are candidates • Can't select all possible entities due to large entity base. • Pre-rank entities use Link Frequency Analysis (similar to page rank) + TFIDF computation. • Top 10 candidates for each entity mention are selected for further analysis.

Entities with stronger relation are selected • Final Similarity Computation • Final similarity score between constructed entities representation from free text and candidate entities are computed using IEWS Model • Information Entropy of properties are computed using BTC 2009 dataset • No weights learning performed in this task • Only direct descriptions of the candidate entity are analyzed.

Evaluation • Overview • Human study • Applications (Entity Matcher, Entity Linking Tool) of IEWS Model. • Evaluated Weighted Learning and Information Entropy (Intuitions 2 and 3) • Blocking algorithm • Study IE based stop traverse algorithm • System • PC with 8 Intel Xeon processors of speed 2.40 GHz and 32 GB memory. Each processor has a 12M cache.

Evaluation • Human Suervey • Purpose: study how well does the scores compute by IEWS Model is close to the scores given by human • Metric: high correlation between computed scores and human evaluated scores indicates that the similarity scores computed by the model are more accurate 46

Evaluation • Evaluation Dataset Design • 1. Test all semantic similarity computation intuitions, mainly focus on intuition II and III • SSN is told to be unique (Test intuitions II and III) • gender is a disjoint property (Test intuitions II) • All data are consistent • 2. Covered various challenges of the real world dataset on the Web of Data • Different property describe same information (Challenge I) • Same information structured differently (Challenge II) • Extra information does not mean to differentiate entities (Challenge III) 47

Evaluation 48

Evaluation • Conference Ontology Dataset • 99 training cases • 21 evaluation cases • Systems • Compare with 20 systems • Result • SEM+: 0.82, the highest among all systems Comparing F-measure of the systems 51

Evaluation • Instance Matching Dataset • 2839 possible matching • Manually create training dataset by randomly pair unmatch entities (100 pairs) and randomly select 100 matched entity pair • Systems • Compare with 17 systems • Result • SEM+: 0.785

Evaluation • Instance Matching Dataset • Sandbox case as training set • 120 evaluation cases • Systems • Compare with 4 systems • Result • SEM+: 0.94 Comparing F-measure of the systems 53 53

Evaluation Dataset: OAEI NYTimes to DBpedia instance matching dataset. NYTimes: 9943 entities, 335198 triples Dbpedia: 8862 entities, 4315062 triples

Evaluation Purpose: study the impact of Weight Learning and Information Entropy Dataset: Conference Ontology

Semantic Similarity Computation on the Web of Data

Semantic Similarity Computation on the Web of Data

Presentation Transcript

The Semantic Web: A Web of Machine Processible Data

Trust on the Semantic Web

Finding knowledge, data and answers on the Semantic Web

Data Integration on the Semantic Sensor Web

Data on the (Semantic) Web

Paths towards the Sustainable Consumption of Semantic Data on the Web

Linear-Time Computation of Similarity Measures for Sequential Data

The Ontological Semantic Perspective on the Semantic Web

Semantic Web Fred Automated Goal Resolution on the Semantic Web

LCA data on the Semantic Web

Cognitive Computation Group Resources for Semantic Similarity

Semantic Data lives everywhere on the Web

Algorithmic Detection of Semantic Similarity

Agents on the Semantic Web

Languages on the Semantic Web

Measuring the Semantic Similarity of Texts

Data Quality on the Semantic Web

Instance Data Evaluation on the Semantic Web

Searching for Knowledge and Data on the Semantic Web

Multimedia on the Semantic Web

Linear-Time Computation of Similarity Measures for Sequential Data