A Century Of Progress On Information Integration: A Mid-Term Report

A Century Of Progress On Information Integration:A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon University

Linkage Queries • Querying integrated information sources (e.g. queries to views, execution of web-based queries, …) • Data mining & analyzingintegrated information (e.g., collaborative filtering/classification learning using extracted data, …) • Discovering information sources (e.g. deep web modeling, schema learning, …) • Gathering data (e.g., wrapper learning & information extraction, federated search, …) • Cleaning data (e.g., de-duping and linking records) to form a single [virtual] database Information Integration

[Science 1959] Record linkage: bringing together of two or more separately recorded pieces of information concerning a particular individual or family (Dunn, 1946; Marshall, 1947).

… … Very much like inverse document frequency (IDF) rule used in information retrieval. …

Motivations for Record Linkage c. 1959 Record linkage is motivated by certain problems faced by a small number of scientists doing data analysis for obscure reasons.

In 1954, Popular Mechanics showed its readers what a home computer might look like in 2004 …

Information integration in 1959 • Many of the basic principles of modern integration work are recognizable. • Fellegi and Sunter, "A theory for record linkage", Journal of the American Statistical Society, 1969 • Manualengineering of distance features (e.g., last names as Soundex codes) that are then matched probabilistically. • DB1 + DB2 DB12+ Pr(matches) + elbowGrease DB12 • Applied to records from pairs of datasets • “Smallest possible scale” for integration (one one dimension) • Computationally expensive • Relative to ordinary database operations • Narrowly used • Only for scientists in certain narrow areas (e.g., public health) • Where have we come to now, in 2005? • [Hector’s heckling “how to we know when we’re finished?”]

Ted Kennedy's “Airport Adventure” [2004] Washington -- Sen. Edward "Ted" Kennedy said Thursday that he was stopped and questioned at airports on the East Coast five times in March because his name appeared on the government's secret "no-fly" list…Kennedy was stopped because the name "T. Kennedy" has been used as an alias by someone on the list of terrorist suspects. “…privately they [FAA officials] acknowledged being embarrassed that it took the senator and his staff more than three weeks to get his name removed.”

Florida Felon List [2000,2004] The purge of felons from voter rolls has been a thorny issue since the 2000 presidential election. A private company hired to identify ineligible voters before the election produced a list with scores of errors, and elections supervisors used it to remove voters without verifying its accuracy… The new list … contained few people identified as Hispanic; of the nearly 48,000 people on the list created by the Florida Department of Law Enforcement, only 61 were classified as Hispanics. Gov. Bush said the mistake occurred because two databases that were merged to form the disputed list were incompatible. … when voters register in Florida, they can identify themselves as Hispanic. But the potential felons database has no Hispanic category… The glitch in a state that President Bush won by just 537 votes could have been significant — because of the state's sizable Cuban population, Hispanics in Florida have tended to vote Republican… The list had about 28,000 Democrats and around 9,500 Republicans…

Information dealing with such matters as violent crime, organized crime, fraud and other white-collar crime may take days to be shared throughout the law enforcement community, according to an FBI official. The new software program was supposed to allow agents to pass along intelligence and criminal information in real time…. In a response contained in the inspector general's report, the FBI pointed to its Investigative Data Warehouse…that provides … access to 47 sources of counterterrorism data, including information from FBI files, other government agencies and open-source news feeds.

..counter asymmetric threats by achieving total information awareness…

Fishing in a sea of information • Suppose you discover a pattern of events between a group of three people such that Pr(group is terroristCell | pattern) = 0.99999 • If you apply it to all three-person groups in the US, how many false positives will there be? And how many true positives? (250,000,000 * 50 * 49) * (1 – 0.99999) = _______ 612,500

Chinese Embassy Bombing [1999] • May 7, 1999: NATO bombs the Chinese Embassy in Belgrade with five precision-guided bombs—sent to the wrong address—killing three. “The Chinese embassy was mistaken for the intended target…located just 200 yards from the embassy. Reliance on an outdated map, aerial photos, and the extrapolation of the address of the federal directorate from number patterns on surrounding streets were cited … as causing the tragic error…despite the elaborate system of checks built-into the targeting protocol, the coordinates did not trigger an alarm because thethree databases used in the process all had the old address.” [US-China Policy Foundation summary of the investigation] “BEIJING, June 17 –– China today publicly rejected the U.S. explanation … [and]saidthe U.S. report ‘does not hold water.’” [Washington Post] “The Chinese embassy was clearly marked on tourist maps that are on sale internationally, including in the English language. … Its address is listed in the Belgrade telephone directory…. For the CIA to have made such an elementary blunder is simply not plausible.” [World Socialist Web Site] “Many observers believe that the bombing was deliberate…it if you believe that the bombing was an accident, you already believe in the far-fetched” [disinfo.com, July 2002].

Information integration in 2005 • Apparently, we still have work to do. • We fail to integrate information correctly • “Ted Kennedy (D-MA)” ≠ “T. Kennedy, (T-IRA)” • Crucial decisions are affected by these errors • Who can/can’t vote (felon list) • Where bombs are sent (Chinese embassy) • Storing, linking, and analyzing information is a double-edged sword: • Loss of privacy and “fishing expeditions”

Information integration, 2005-2059 • Understanding sources of uncertainty, and propagating uncertainty to the end user. • “Soft” information integration • Driven by user’s goal and user’s queries • Information integration for the great unwashed masses • Personal information, small stores of scientific information, … • Using moreinformation in linkage: • Text, images, multiple interacting “hard” sources • Anonymous secure linkage; non-technical limitations of how information can be combined, as well as distributed.

Information integration, 2005-2059 • Understanding sources of uncertainty, and propagating uncertainty to the end user. • “Soft” information integration • Driven by user’s goal and user’s queries • When does a particular user believe that “X is the same thing as Y”? Does “the same thing” always mean the same thing? • Is “X is the same entity as Y” always transitive?

Bell Labs Bell Telephone Labs AT&T Bell Labs A&T Labs AT&T Labs—Research AT&T Labs Research, Shannon Laboratory Shannon Labs Bell Labs Innovations Lucent Technologies/Bell Labs Innovations When are two entities the same? [1925] History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers…. [www.research.att.com] Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925… [bell-labs.com]

= ≠ Bell Telephone Labs =

When are two entities are the same? “Buddhism rejects the key element in folk psychology: the idea of a self (a unified personal identity that is continuous through time)… King Milinda and Nagasena (the Buddhist sage) discuss … personal identity… Milinda gradually realizes that "Nagasena" (the word) does not stand for anything he can point to: … not … the hairs on Nagasena's head, nor the hairs of the body, nor the "nails, teeth, skin, muscles, sinews, bones, marrow, kidneys, ..." etc… Milinda concludes that "Nagasena" doesn't stand for anything… If we can't say what a person is, then how do we know a person is the same person through time? … There's really no you, and if there's no you, there are no beliefs or desires for you to have… The folk psychology picture is profoundly misleading and believing it will make you miserable.” -S. LaFave

Passing linkage decisions along to the user Usual Goal: link records and create a single highly accurate database for users query. • Equality is often uncertain, given available information about an entity • “name: T. Kennedy occupation: terrorist” • The interpretation of “equality” may change from user to user and application to application • Does “Boston Market” = “McDonalds” ? • Alternate goal: wait for a query, then answer it, propogating uncertainty about linkage decisions on that query to the enduser

Traditional approach: Linkage Queries Uncertainty about what to link must be decided by the integration system, not the end user

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b Link items as needed by Q Query Q WHIRL vision: Strongest links: those agreeable to most users Weaker links: those agreeable to some users even weaker links…

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Query Q WHIRL vision: DB1 + DB2 ≠ DB Link items as needed by Q Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine.

WHIRL queries • Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing

WHIRL queries • “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) • “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

WHIRL queries • Similarity is based on TFIDF rare wordsare most important. • Search for high-ranking answers uses inverted indices….

Years are common in the review archive, so have low weight WHIRL queries • Similarity is based on TFIDF rare wordsare most important. • Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune “unimportant terms”

WHIRL results • This sort of worked: • Interactive speeds (<0.3s/q) with a few hundred thousand tuples. • For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains. • Average precision better than 90% on 5-way joins

WHIRL worked for a number of web-based demo applications. e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor WHIRL could link many data types reasonably well, without engineering WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001) WHIRL was relational But see ELIXIR (SIGIR2001) WHIRL users need to know schema of source DBs WHIRL’s query-time linkage worked only for TFIDF, token-based distance metrics  Text fields with few misspellimgs WHIRL was memory-based all data must be centrally stored—no federated data.  small datasets only WHIRL and soft integration

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Query Q WHIRL vision: very radical, everything was inter-dependent Link items as needed by Q Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine.

Information integration, 2005-2059 • Understanding sources of uncertainty, and propagating uncertainty to the end user. • “Soft” information integration • Driven by user’s goal and user’s queries • Information integration for the great unwashed masses • Personal information, small stores of scientific information, … • Using moreinformation in linkage: • Text, images, multiple interacting “hard” sources • Anonymous secure linkage; non-technical limitations of how information can be combined, as well as distributed.

Information integration, 2005-2059 • Understanding sources of uncertainty, and propagating uncertainty to the end user. • “Soft” information integration • Driven by user’s goal and user’s queries • Information integration for the great unwashed masses • Personal information, small stores of scientific information, … • Needed: • Robust distance metrics that work “out of the box” • Methods to tune and combine these metrics

Robust distance metrics for strings • Kinds of distances between s and t: • Edit-distance based (Levenshtein, Smith-Waterman, …): distance is cost of cheapest sequence of edits that transform s to t. • Term-based (TFIDF, Jaccard, DICE, …): distance based on set of words in s and t, usually weighting “important” words • Which methods work best when?

Robust distance metrics for strings SecondString (Cohen, Ravikumar, Fienberg, IIWeb 2003): • Java toolkit of string-matching methods from AI, Statistics, IR and DB communities • Tools for evaluating performance on test data • Used to experimentally compare a number of metrics

Results: Edit-distance variants Monge-Elkan (a carefully-tuned Smith-Waterman variant) is the best on average across the benchmark datasets… 11-pt interpolated recall/precision curves averaged across 11 benchmark problems

Results: Edit-distance variants But Monge-Elkan is sometimes outperformed on specific datasets Precision-recall for Monge-Elkan and one other method (Levenshtein) on a specific benchmark

SoftTFDF: A robust distance metric • We also compared edit-distance based and term-based methods, and evaluated a new “hybrid” method: • SoftTFIDF, for token sets S and T: • Extends TFIDF by including pairs of words in S and T that “almost” match—i.e., that are highly similar according to a second distance metric (the Jaro-Winkler metric, an edit-distance like metric).

Comparing token-based, edit-distance, and hybrid distance metrics SFS is a vanilla IDF weight on each token (circa 1959!)

SoftTFIDF is a Robust Distance Metric

Information integration, 2005-2059 • Understanding sources of uncertainty, and propagating uncertainty to the end user. • “Soft” information integration • Driven by user’s goal and user’s queries • Information integration for the great unwashed masses • Personal information, small stores of scientific information, … • Needed: • Robust distance metrics that work “out of the box” • Methods to tune and combine these metrics

Tuning and combining distance metrics using learning • Why use a single distance metric? • Can you learn how to combine several distance metrics—either distances across several fields (name, address) or several ways of measuring distance on the same field (edit distance, TFIDF?) • [Bilenko & Mooney, KDD2003; Ravikumar, Cohen & Fienberg, UAI 2004] • Can you learn the (many) parameters of an edit distance metric? (e.g., what is the cost of replacing “M” with “N” vs “M” with “V”?) • [Ristad and Yianolis, PAMI’98; Bilenko & Mooney, KDD 2003]: learning edit distances using pair HMMs

Tuning and combining distance metrics using learning • Pair HMM: a probabilistic automata that randomly emits a pair of letters at each clock tick. • A sequence of letter pairs corresponds to a pair of strings: COHEN, COH_M = COHEN, COHM • The parameters of the HMM can be tuned with extensions of the standard learning methods for HMMS • Training data is pairs of “true matches”, e.g. Cohen/Cohm (C,C),(O,O),(H,H),(E,_),(N,M),… A 1-state pair HMM

Tuning and combining distance metrics using learning Ristad & Yianolis, 98 Bilenko & Mooney, 2003 Levenshtein-like edit distance Smith-Waterman-like “affine gap” edit distance

Tuning and combining distance metrics using learning • Traditionally, multiple similarity measurements are combined using learning methods: • E.g., by clustering using a latent binary “Match?” variable • Independence assumptions are usually inappropriate: (e.g., same-address  same-last-name) • How to model dependencies? • Structural EM on a limited model (Ravikumar, Cohen, Fienberg, UAI2004) M F1 F2 F3 F4 F1: JaroWinkler(name1,name2) F2: Levenshtein(addr1,addr2) … Fk: SoftTFIDF(name1,name2)

Tuning and combining distance metrics using learning • Traditionally, multiple similarity measurements are combined using learning methods: • E.g., by clustering using a latent binary “Match?” variable • Independence assumptions are usually inappropriate: (e.g., same-address  same-last-name) • How to model dependencies? • Structural EM on a limited model (Ravikumar, Cohen, Fienberg, UAI2004) latent binary variables M fixed relation X1 X2 X3 X4 F1 F2 F3 F4 monotone dependencies allowed F1: JaroWinkler(name1,name2) F2: Levenshtein(addr1,addr2) … Fk: SoftTFIDF(name1,name2)

Tuning and combining distance metrics using unsupervised structural EM

latent binary variables M fixed relation X1 X2 X3 X4 F1 F2 F3 F4 dependencies allowed Robust distance metrics, learnable using generative models (semi-/unsupervised learning) • Summary: • None of these methods are evaluated on as many integration problems as one would like • Ravikumar et al structural EM method works well, but is computationally expensive • Pair-HMM methods of Ristad & Yianolis, Bilenko & Mooney work well, but require “true matched pairs” • Claim: could combine these by using pair-HMMs in inner loop of structural EM • Practical well before 2059

Information integration, 2005-2059 • Understanding sources of uncertainty, and propagating uncertainty to the end user. • “Soft” information integration • Driven by user’s goal and user’s queries • Information integration for the great unwashed masses • Personal information, small stores of scientific information, … • Needed: • Robust distance metrics that work “out of the box” • Methods to tune and combine these metrics • Ways to rapidly integrate new information sources with unknown schemata

A Century Of Progress On Information Integration: A Mid-Term Report