1 / 35

Data Quality

Data Quality. Class 7. Agenda. Record Linkage Data Cleansing. Record Linkage. Critical component of data quality applications Linkage involves finding a link between a pair of records, either through an exact match or through an approximate match Linkage is useful for data cleansing

tsaylor
Télécharger la présentation

Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality Class 7

  2. Agenda • Record Linkage • Data Cleansing

  3. Record Linkage • Critical component of data quality applications • Linkage involves finding a link between a pair of records, either through an exact match or through an approximate match • Linkage is useful for • data cleansing • data correction • enhancement • householding

  4. Record Linkage 2 • Two records are linked when they match with enough weighted similarity • Matching can be a combination of exact matching on particular fields, to approximate matching with some degree of similarity • For example: two customer records can be linked via an account number, if account numbers are uniquely assigned to customers

  5. Record Linkage 3 • Example: • David Loshin • 633 Evergreen Terrace • Montclair, NJ • 201-765-8293 vs. • H. David Loshin • 633 Evergreen • Montclair, NJ • 201-765-8293 • In this case, we can establish a link based solely on telephone number

  6. Record Linkage 4 • Frequently, pivot attributes exists and can be used for exact matching (such as social security number, account number, telephone number, student ID) • Often, there is no pivot attribute, and therefore approximate matching techniques must be used • Approximate matching is a process of looking for similarity

  7. Deduplification • Duplicates are sets of records that represent the same entity • Duplicate elimination is the process of finding records that belong to the same equivalence class based on similarity above a specified threshold • When duplicates are found, one record is created to replace all the duplicates • That record is composed of the “best” data gleaned from all equivalence class members

  8. Merge/Purge • Similar to duplicate elimination • Application used when merging two or more different database • Goal: find all the records associated with the same entity • Example: when two banks merge, find all accounts owned by the same person in both banks

  9. Enhancement • We can enhance data by merging it with other data sets • Linkage may be based on profile information extracted from each record

  10. Householding • We link on address as well as some permutation of the entity name • Look to establish a location match and some relation between entities at that location

  11. Similarity • We use measures of similarity to establish measures and thresholds for linkage • Most interesting areas for similarity are in string matching

  12. Similarity and Distance • To link 2 values, we need a notion of distance and similarity • Obviously, similarity is measured as a function of distance • All items to be explored are projected into an n-dimensional space • Distance can be measured based on traditional distance functions, or specialized distance functions

  13. Distance Measures • Euclidean distance: the distance between any two points is the square root of the sum of the squares of the difference between the corresponding data points • City Block distance: this measures the distance in terms of walking along the line segments in a grid • Exact-match distance: For variables whose values are not allocated along continuous dimensions, the values assigned are either 1 for an exact match, or 0 for not a match

  14. String Similarity • How do we characterize similarity between strings? • We can see it with our eyes, or hear it inside our heads: • example: Smith, Smythe • example: John and Mary Jones, John Jones • How do we transfer this to automation?

  15. Edit Distance • Edit distance operations • Insertion, where an extra character is inserted into the string • Deletion, where a character has been removed from the string • Transposition, in which two characters are reversed in their sequence • Substitution, which is an insertion followed by a deletion

  16. Edit Distance 2 • Strings with a small edit distance are likely to be similar • Edit distance is measured as a count of edit distance operations from one string to another: • internatianl • transposition • internatinal • insertion • international • internatianl to international has an edit distance of 2

  17. Computing Edit Distance • Use dynamic programming • Given two strings, x = x1x2 .. xn, and y = y1y2 ..ym • the edit distance f(i, j) is computed as the best match of two substrings x1x2 .. xi and y1y2 ..yjwhere • f(0,0) = 0 • f(i, j) = min[f(i-1, j) + 1), f(i, j-1) +1, f(i-1, j-1) + d(xi,yj)]

  18. Phonetic Similarity • Words that sound the same may be misspelled • Phonetic similarity reduces the complexity of the strings • Effectively, it compresses the strings, with some loss, then performs a similarity test

  19. Soundex • The first character of the name string is retained, and then numbers are assigned to following characters • The numbers are assigned using this breakdown: • 1 = B P F V • 2 = C S K G J Q X Z • 3 = D T • 4 = L • 5 = M N • 6 = R • Vowels are ignored

  20. Soundex – 2 • Examples: • Fitzgerald = F322 • Smith = S530 • Smythe = S530 • Loshin = L250

  21. Soundex – 3 • Regular soundex is flawed • geared towards English names • can’t account for incorrect first letter • longer names are truncated • Options • encode entire string, not just the first 4 consonant sounds • reverse the words, then encode both forward and backward • Use different phonetic encodings

  22. Other Phonetic Encoding • NYSIIS • Similar to soundex, but • does not use numeric encoding, instead uses mapping to smaller set of consonants • replaces all vowels with “A” • Metaphone • Tries to be more exact with multiple-letter sounds (sh, tio, th, etc.)

  23. N-gramming • Another means of representing a “compressed” form of a string • An n-gram is a chunk of text of length n • We slide a window of size n across a string to generate the set of n-grams for that string

  24. N-gram Example • INTERNATIONAL is comprised of these 2-grams: • IN, NT, TE, ER, RN, NA, AT, TI, IO,ON, NA, AL • Compare this with INTERNATIANL • IN, NT, TE, ER, RN, NA, AT, TI, IA, AN, NL • These two strings share 8 2-grams

  25. N-gram Measures • 1)      Absolute overlap – this is the absolute ratio of matching n-grams to the total number of n-grams. This is equal to (2  (|ngram(X) ngram(Y)|))  (|ngram(X)| + |ngram(Y)|) • 2)      Source overlap – this is the number of matching n-grams divided by the number of n-grams in the source string X: (|ngram(X) ngram(Y)|)  |ngram(X)| • 3)      Search overlap: this is the number of matching n-grams divided by the number of n-grams in the search string Y: (|ngram(X) ngram(Y)|)  |ngram(Y)|

  26. Thresholding • We have functions to determine similarity • Tune these functions to return a value between 0 and 1 (i.e., 0% and 100%) • 0 = absolutely no match • 1 = absolute match

  27. Thresholding 2 • For any pair of values, we can apply the similarity function and get a score • For different kinds of data, we can set a minimum threshold, above which the two values are said to match • For example, for an n-gram match, we can set the threshold at 75%

  28. Thresholding 3 • When comparing 2 records, we apply the similarity functions in a pairwise manner to each of the important attribute value pairs • We assign some weight to each attribute value pair score • Total similarity score for each pair of records is the sum of individual attribute value pairs scores, adjusted by weight • We can assign an overall threshold indicating a match

  29. Thresholds and Matches • We actually assign two thresholds, to partition scores into three groups: • Definite matches • User review • Definite no-matches • Scores above the high threshold are definite matches • Scores between low and high thresholds are user review matches • All others are not matches

  30. Naive Algorithm • Goal: find all possible matches between any pair of records • Approach: Perform a pairwise similarity score for all record pairs • Downside: O(n2)

  31. Improvements • Desire to reduce the number of candidates for pairwise similarity testing • We can use fast matching for fixed pivot values when merging data sets • We use a concept called blocking to reduce the search set

  32. Fast Matching for Merging • We have looked at Bloom filters for fast matching • We can load all (recordID,attribute value) tuples from the first data set into the Bloom filter O(n) • We can then test each of the values from the second set to see if the pivot matches in the first set

  33. Blocking • Goal: reduce number of match candidates by using some form of “compression” on the records to be linked • Example: phonetic encodings • Example: limit by fixing one of the attributes • Example: find a pivot attribute and use that for affinity

  34. Blocking 2 • Example: if we want to perform householding on a mailing list: • Block by ZIP code, since we don’t expect to find members of the same household living in different locations

  35. Linkage Algorithms • All linkage algorithms make use of these ideas: • A blocking mechanism • We must choose based on the data available and what makes sense for the application • Similarity functions • Every data type and data domain should have an associated similarity function, even if it is a 0/1 exact match test • Weights for the similarity functions • This requires more insight into the problem, to see how each attribute’s scores should weigh for the overall score

More Related