1 / 17

Information Integration Entity Resolution – 21.7

Information Integration Entity Resolution – 21.7 . Presented By: Deepti Bhardwaj Roll No: 223_103 . Contents. 21.7 Entity Resolution 21.7.1 Deciding Whether Records Represent a Common Entity 21.7.2 Merging Similar Records

cathal
Télécharger la présentation

Information Integration Entity Resolution – 21.7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103

  2. Contents • 21.7 Entity Resolution • 21.7.1 Deciding Whether Records Represent a Common Entity • 21.7.2 Merging Similar Records • 21.7.3 Useful Properties of Similarity and Merge Functions • 21.7.4 The R-Swoosh Algorithm for ICAR Records • 21.7.5 Other Approaches to Entity Resolution

  3. Introduction • Determining whether two records or tuples do or do not represent the same person, organization, place or other entity is called ENTITY RESOLUTION.

  4. Deciding whether Records represent a Common Entity • Two records represent the same individual if the two records have similar values for each of the fields associated with those records. • It is not sufficient that the values of corresponding fields be identical because of following reasons: 1. Misspellings 2. Variant Names 3. Misunderstanding of Names

  5. Continue: Deciding whether Records represent a Common Entity 4. Evolution of Values 5. Abbreviations Thus when deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies and use the test that measures the similarity of records.

  6. Deciding Whether Records Represents a Common Entity - Edit Distance • First approach to measure the similarity of records is Edit Distance. • Values that are strings can be compared by counting the number of insertions and deletions of characters it takes to turn one string into another. • So the records represent the same entity if their similarity measure is below a given threshold.

  7. Deciding Whether Records Represents a Common Entity - Normalization • To normalize records by replacing certain substrings by others. For instance: we can use the table of abbreviations and replace abbreviations by what they normally stand for. • Once normalize we can use the edit distance to measure the difference between normalized values in the fields.

  8. Merging Similar Records • Merging means replacing two records that are similar enough to merge and replace by one single record which contain information of both. • There are many merge rules: 1. Set the field in which the records disagree to the empty string. 2. (i) Merge by taking the union of the values in each field (ii) Declare two records similar if at least two of the three fields have a nonempty intersection.

  9. Continue: Merging Similar Records Name Address Phone 1. Susan 123 Oak St. 818-555-1234 2. Susan 456 Maple St. 818-555-1234 3. Susan 456 Maple St. 213-555-5678 After Merging Name Address Phone (1-2-3) Susan {123 Oak St.,456 Maple St} {818-555-1234, 213- 555-5678}

  10. Useful Properties of Similarity and Merge Functions The following properties say that the merge operation is a semi lattice : • Idempotence : That is, the merge of a record with itself should surely be that record. • Commutativity : If we merge two records, the order in which we list them should not matter. • Associativity : The order in which we group records for a merger should not matter.

  11. Continue: Useful Properties of Similarity and Merge Functions There are some other properties that we expect similarity relationship to have: • Idempotence for similarity : A record is always similar to itself • Commutativity of similarity : In deciding whether two records are similar it does not matter in which order we list them • Representability : If r is similar to some other record s, but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record.

  12. R-swoosh Algorithm for ICAR Records • Input: A set of records I, similarity function and a merge function. • Output: A set of merged records O. • Method: • O:= emptyset; • WHILE I is not empty DO BEGIN • Let r be any record in I; • Find, if possible, some record s in O that is similar to r; • IF no record s exists THEN move r from I to O • ELSE BEGIN delete r from I; delete s from O; add the merger of r and s to I; • END; • END;

  13. Other Approaches to Entity Resolution The other approaches to entity resolution are : • Non- ICAR Datasets • Clustering • Partitioning

  14. Other Approaches to Entity Resolution - Non ICAR Datasets Non ICAR Datasets : We can define a dominance relation r<=s that means record s contains all the information contained in record r. If so, then we can eliminate record r from further consideration.

  15. Other Approaches to Entity Resolution - Clustering Clustering: Some time we group the records into clusters such that members of a cluster are in some sense similar to each other and members of different clusters are not similar.

  16. Other Approaches to Entity Resolution - Partitioning Partitioning: We can group the records, perhaps several times, into groups that are likely to contain similar records and look only within each group for pairs of similar records.

  17. Thank You

More Related