Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999)

Quality-driven Integration of Heterogeneous Information Systemby Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang

Introduction • Motivation • Observation • The main user criterion for selecting sources by hand – NOT just response time, BUT the expected quality of the data • The sources have varying information quality • Results become outdated quickly • The intrinsic imprecision of many experimental techniques • Contribution • Integration of • classical query planning • the assessment and consideration of information quality (IQ)

Correctness and Completeness • For a given user query, UQ, against the mediator schema, • “Correct plan” • Combination of QCAs that are semantically contained in the UQ • Plans that compute only correct results • “Complete answer” to a UQ w.r.t. the given QCAs • Union over the answers of all correct plans • Problem • Too many correct plans!!

Example(1/2) • Global tables • sequence and gene • A user query • The sequence of a specific gene • The mediator detects from QCAs • S5 and other two sources can be used for the gene part • S1, S2, and S3 for the sequence part • We can generate 9 correct plans • Question • DO WE HAVE TO EXECUTE ALL THE 9 CORRECT PLANS?

Example (2/2) • Assuming that IQ scores are available • Sequence data on S1 • S1 copies infrequently from other sites, sometimes introducing parsing errors • Sequence data on S3 • highly up-to-date, but few annotations are provided • Reducing the number of correct plans to be executed • We may consider 3 correct plans, instead of 9 • Case 1: If the user was particularly interested in complete annotation • We conclude that plans using S3 are not very promising • Case 2: If highly up-to-date data is required • S1 could probably be ignored

“Completeness of integrated information sources” by Felix Naumann, et al. (Information systems 2004) • Implicit assumption by most information integration projects • “The mediator should always compute the complete answer” • In many cases, this assumption is wrong! • “Computing the complete answer is not always necessary” • For example, a meta-search engine does not need to download all hits from all search engines it uses; instead, taking the top ten hits usually suffices • “Computing the complete answer may be too expensive or it may take too long time” • Another assumption they have • “The most complete response to the user is the best, given some cost limits”

IQ classification • Source-specific criteria • Determine the overall quality of a data source • E.g., reputation • QCA-specific criteria • Determine the quality aspects of specific query that are computable by a source • E.g., response times • Attribute-specific criteria • Assess the quality of a source in terms of its ability to provide the attributes of a specific user query • E.g., the completeness of the annotation attribute on a source • Depending on the application domain and the structure of available sources, the classification may vary • Problem • The ability to assign IQ scores in an objective manner is difficult • Some IQ criteria are highly subjective (e.g., reputation)  Use user profiles, sets of IQ scores for all subjective criteria

Source-specific criteria • Ease of understanding • User ranking • Reputation • User ranking • Reliability • Ranking of experimental method (intrinsic error rate) • Timeliness • Average age of data

QCA-specific criteria • Availability • Percentage of time the source is accessible • Price • Monetary price of a query • Representational Consistency • Wrapper workload • E.g., a wrapper with relational export schema is always consistent with the global schema • Response time • Average waiting time for response • Accuracy • Percentage of objects with errors • Usually produced during data input • Relevancy • Percentage of real word objects represented • Usually highly user-dependent

Attribute-specific criteria • Completeness • Fullness of the relation in each attribute (horizontal fitness) • E.g., an attribute with 90% of null-values • Amount • Number of unwanted attributes (vertical fitness)

Algorithm (Three phases) • Input • User query • Sources with QCAs, IQ scores • Phase 1 • Source selection with source-specific criteria • Best sources • Phase 2 • Planning with QCAs • All correct plans • Phase 3 • Plan selection with QCA- and attribute-specific criteria • Best plans

Phase 1: Source selection • Goal • Use the source-specific IQ criteria to “weed-out” sources that are qualitatively not as good as others (non-good sources) • We completely disregard non-good sources for further planning • Method used • Data Envelopment Analysis (DEA) developed by Charnes et al. • A general method to classify a population of observations • Avoids the problems of scaling and weighting • Do not remove a source S with low IQ • If S is the only source providing a certain attribute of the global schema • If S exclusively provide certain extensions of an attribute

Phase 2: Plan creation UQ with the user weightings for each attribute Plans, each possibly producing a different set of correct tuples for UQ

Phase 3: Plan selection • Goal • Qualitatively rank the plans of the previous phase • Restrict plan execution to meet stop conditions • Stop condition1: execute some best percentage of plans • Stop condition2: execute as many plans as necessary to meet certain cost- or quality- criteria • Three steps • a) QCA quality • The IQ scores of the QCAs are determined • b) Plan Quality • b1) The quality model (tree-structured) aggregates these scores along tree paths • b2) Gain an overall score at the root of the tree, which forms the score of the entire plan • c) Plan Ranking • Rank all plans using IQ score of each plan

3a) QCA quality – determine IQ vectors for the QCAs The general IQ vector for QCAs The IQ vectors for QCAs participating in the six correct plans

3b) Plan Quality The IQ vector for an inner join node Merging IQ vectors in join nodes Up to this point, the scores are neither scaled nor weighted, making a comparison or ranking of plans impossible The six plans have aggregated IQ vectors

3c) Plan ranking • Method used • The Simple Additive Weighting (SAQ) method • Scaling • Positive criteria • Availability, accuracy, relevancy, completeness • Negative criteria • Price, representational consistency, response time, amount • Computing the weighted sum • Needs a user-specific weight vector • Reflects the importance of the individual criteria to the user • Stored in the user profile IQ scores of plans obtained by the indifferent weight vector (Each weight value is 1/8)

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999)