190 likes | 208 Vues
Understand the importance of data linkage for deduplication, merge/purge, enhancement, and householding. Learn about similarity measures, edit distance, and phonetic similarity. Explore how to determine thresholds and matching criteria for data quality applications.
E N D
Data Quality Class 10
Agenda • Review of Last week • Cleansing Applications • Guest Speaker
Review of Similarity • We use measures of similarity to establish measures and thresholds for linkage • Edit Distance • Phonetic Similarity • Ngrams
Linkage • Critical component of data quality applications • Linkage involves finding a link between a pair of records, either through an exact match or through an approximate match • Linkage is useful for • Deduplification • Merge/Purge • Enhancement • Householding
Linkage 2 • Two records are linked when they match with enough weighted similarity • Matching can be a combination of exact matching on particular fields, to approximat ematching with some degree of similarity • For example: two customer records can be linked via an account number, if account numbers are uniquely assigned to customers
Thresholding • We have functions to determine similarity • Tune these functions to return a value between 0 and 1 (i.e., 0% and 100%) • 0 = absolutely no match • 1 = absolute match
Threhsolding 2 • For any pair of values, we can apply the similarity function and get a score • For different kinds of data, we can set a minimum threshold, above which the two values are said to match • For example, for an n-gram match, we can set the threshold at 75%
Thresholding 3 • When comparing 2 records, we apply the similairty functions in a pairwise manner to each of the important attribute value pairs • We assign some weight to each attribute value pair score • Total similarity score for each pair of records is the sum of individual attribute value pairs scores, adjusted by weight • We can assign an overall threshold indicating a match
Thresholds and Matches • We actually assign two thresholds, to partition scores into three groups: • Definite matches • User review • Definite no-matches • Scores above the high threshold are definite matches • Scores between low and high thresholds are user review matches • All others are not matches
Deduplification • Duplicates are sets of records that represent the same entity • Duplicate elimination is the process of finding records that belong to the same equivalence class based on similarity above a specified threshold • When duplicates are found, one record is created to replace all the suplicates • That record is composed of the “best” data gleaned from all equivalence class members
Merge/Purge • Similar to duplicate elimination • Application used when merging two or more different database • Goal: find all the records associated with the same entity • Example: when two banks merge, find all accounts owned by the same person in both banks
Enhancement • We can enhance data by merging it with other data sets • Linkage may be based on profile information extracted from each record
Householding • We link on address as well as some permutation of the entity name • Look to establish a location match and some relation between entities at that location
Naive Algorithm • Goal: find all possible matches between any pair of records • Approach: Perform a pairwise similarity score for all record pairs • Downside: O(n2)
Improvements • Desire to reduce the number of candidates for pairwise similarity testing • We can use fast matching for fixed pivot values when merging data sets • We use a concept called blocking to reduce the search set
Fast Matching for Merging • We have looked at Bloom filters for fast matching • We can load all (recordID,attribute value) tuples from the first data set into the Bloom filter O(n) • We can then test each of the values from the second set to see if the pivot matches in the first set
Blocking • Goal: reduce number of match candidates by using some form of “compression” on the records to be linked • Example: phonetic encodings • Example: limit by fixing one of the attributes • Example: find a pivot attribute and use that for affinity
Blocking 2 • Example: if we want to perform householding on a mailing list: • Block by ZIP code, since we don’t expect to find members of the same household living in different locations
Linkage Algorithms • All linkage algorithms make use of these ideas: • A blocking mechanism • We must choose based on the data available and what makes sense for the application • Similarity functions • Every data type and data domain should have an associated similarity function, even if it is a 0/1 exact match test • Weights for the similarity functions • This requires more insight into the problem, to see how each attribute’s scores should weigh for the overall score