440 likes | 553 Vues
Deterministic Record Linking. University of North Carolina, Chapel Hill Hye-Chung Kum. Example. Exact Match. Approximate Matching I : SSN. Approximate Matching II : DOB. Approximate Matching III : Name. Deterministic Record Linking. Allow for approximate matching
E N D
Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum
Deterministic Record Linking • Allow for approximate matching • Use explicit approximate rules • Pros : can control the linkage process • Con: difficult to implement • Alternative : Probabilistic record linking • Also approximate matching • However, uses general rules specified by users • Based on total probability • Con: can not control exactly what to consider a match or not • Pros: can use specialized software
Approximate Matching : DOB • element to element match : date, month, year • Allow for one element difference • Allow for month and day transposed
Approximate Matching : Name • First name soundex match • First name is approx • one letter different • insert or replace • and/or substr • lsound equal • or lname approx • MI=FI • FI equal • Fsound & Lsound swapped
Match on ssn (ssn equal) • 1 : dob, fsound equal • dob approx • 2 : dob approx, fsound equal • 3 : dob approx, fname approx • 4 : dob approx, lsound equal, & fsound diff, but MI=FI • 5 : dob approx, lsound equal, & fsound diff, but FI equal • 6 : dob approx, lsound and fsound swapped • 7 : dob approx, lname approx & fsound diff • but MI=FI (4 with lname approx rather than equal) • or FI equal (5 with lname approx rather than equal) • dob mismatch • 8 : fname approx, lsound equal, and dob diff • 9 : fname approx, lsound approx, and dob diff
Match on ssn (ssn equal) • 1 : dob, fsound equal • dob approx • 2 : dob approx, fsound equal • 3 : dob approx, fname approx • 4 : dob approx, lsound equal, & fsound diff, but MI=FI • 5 : dob approx, lsound equal, & fsound diff, but FI equal • 6 : dob approx, lsound and fsound swapped • 7 : dob approx, lname approx & fsound diff • but MI=FI (4 with lname approx rather than equal) • or FI equal (5 with lname approx rather than equal) • dob diff • 8 : fname approx, lsound equal, and dob diff • 9 : fname approx, lsound approx, and dob diff
Approximate Matching : SSN • Digit to digit match • Allow for one digit difference • Allow for two digit difference if transposed
ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing Match on ndob (dob+fsound)
ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing Match on ndob (dob+fsound)
Match on name (fname+lname) • ssn missing & dob approx • 1: MI equal • 7: MI missing • 8: MI not equal • ssn approx • 3: dob equal • dob approx • 4: one element • 5: transpose
Match on name (fname+lname) • ssn missing & dob approx • 1: MI equal • 7: MI missing • 8: MI not equal • ssn approx • 3: dob equal • dob approx • 4: one element • 5: transpose
link • Put together all links found • Identify indirect duplicates (type2>10000) • i.e. both EISID1 & EISID2 link to identical SISID1 • Consider indirect duplicates on both EIS & SIS • Create unique link and indirect duplicate files • Keep only the first id in data file link • Create indirect duplicates files • dupeis2 & dupsis2 • TODO : explore indirect duplicates
Create unique list of EIS & SIS • Generate unique full list of each set of ids • use linkage info • Link in the duplicates (dupeis & dupsis) • TODO : link in the indirect duplicates • eis & sis
1,888,747 4,308,863 eisid.sas7bdat sisid.sas7bdat 31,461 dupeis.sas7bdat duplicates unduplicated unique records dupsis.sas7bdat 250,635 1,638,112 87% 4,277,402 99% 27% 1,173,404 72% link.sas7bdat 1270 dupeis2.sas7bdat eis.sas7bdat sis.sas7bdat 493 dupsis2.sas7bdat 4,308,863 28% 1,888,747 74% ueis.sas7bdat usis.sas7bdat Link eis to sis Data flow
Implementation details • Ndob & name must be looped • multiple matches • Too many match on name • use half of ssn • Overlap for transpose
Basic Process • Unduplicate EIS (dupeis) • Unduplicate SIS (dupsis) • Link unduplicated EIS & SIS (link) • Generate unique full list of each set of ids (list) • use linkage info • Link in the duplicates • eis & sis
Unduplication • Same as matching between different system • Except, match the database to itself • i.e. EIS to EIS, SIS to SIS • Randomly select one as Primary • TODO: for those not linked using primary ID, try with duplicate ID • TODO: explore indirect duplicate links
Conclusion • Future work : • indirect duplicates • Link using duplicates • SSN have been changed from real data
Type of id • first letter: • P : primary id with duplicates • D : duplicates (primary info given with prefix ‘l’) • X : no duplicates • second letter: link status • L: linked • X: no linked id • third letter: duplicates status of the linked id • D: duplicates exist for the linked id • X: no duplicates for the linked id
EIS & SIS Table • Unique full is of EIS (or SIS) ids • Type : type of id (XXX) – see next slide • All eis info have no prefix • All sis info have prefix ‘k’ • Prefix ‘l’ is the link id info • freqeis & freqsis : # of duplicate ids • Pindid (eis) & pkindid (sis) is the primary id • indid1-indid3 & kindid1-kindid8
Link type • sdiff : # digits different in ssn • -1 : one or both ssn is missing • 2 : two digits are transposed • 10 : two digits are different but not transposed • ddiff : diff in dob • -1 : one or both dob is missing • 2 : date and month is transposed • 3 : date, month and year are different • 4 : date and month are different • Fdiff (ldiff) : difference in first (last) name • -1 : one or both are missing • 1 : one letter difference (INDEL or REPL) • 100 : one is a substring of the other • 101 : one letter diff & substring
Duplicate type • If duplicate id • Primary id info is given with prefix “l” • Duplicate type • Lsdiff, lddiff, lfdiff, & lldiff • If primary id • # of duplicates : freqeis & freqsis • Duplicate ids • Indid1-indid3 (eis) & kindid1-kindid8 (sis)
Other tables • Link • Linkage between the primary eis & sis ids • dupeis & dupsis • List of duplicates with primary id
Data flow • eisid: 4,308,863 • ueis (4,277,402)+dupeis (31,461) : 99% • sisid: 1,888,747 • usis (1,638,112)+dupsis (250,635) : 87% • Link : 1,173,404 (eis: 27%, sis: 72%) • dupeis2 (1,270) + dupsis2(493) • EIS: 4,308,863 (28%) • SIS: 1,888,747 (74%)