Adaptive Information Integration

Adaptive Information Integration 3 ET-I Subbarao Kambhampati http://rakaposhi.eas.asu.edu/i3 Thanks to Zaiqing Nie, Ullas Nambiar & Thomas Hernandez Talk at USC/Information Sciences Institute; November 5th 2004.

Plan-Yochan Automated Planning Temporal planning Multi-objective optimization Partial satisfaction planning Conditional/Conformant/Stoch-astic planning Heuristics using labeled planning graphs OR approaches to planning Applications to Autonomic computing, Web service composition, Workflows Db-Yochan Information Integration Adaptive Information Integration Learning source profiles Learning user interests Applications to Bio-informatics Anthropological sources Service and Sensor Integration Yochan Research Group

Services Source Catalog Ontologies; statistics Probing Queries Webpages Structured data Learned Statistics Sensors (streaming Data) Query planner Multi-objective Anytime Handle services, Sensors (streams) Source Calls Query Annotated Plan Utility Metric Updating Statistics Replanning Requests Executor Monitor Answers Our focus: Query Processing

Adaptive Information Integration • Query processing in information integration needs to be adaptive to: • Source characteristics • How is the data spread among the sources? • User needs • Multi-objective queries (tradeoff coverage for cost) • Imprecise queries • To be adaptive we need, profiles (meta-data) about sources as well as users • Challenge: Profiles are not going to be provided.. • Autonomous sources may not export meta-data about data spread! • Lay users may not be able to articulate the source of their imprecision! Need approaches that gather (learn) the meta-data they need

Three contributions to Adaptive Information Integration • BibFinder /Statminer • Learns and uses source coverage and overlap statistics to support multi-objective query processing • [VLDB 2003; ICDE 2004; TKDE 2005] • COSCO • Adapts the Coverage/Overlap statistics to text collection selection • Supports imprecise queries by automatically learning approximate structural relations among data tuples • [WebDB 2004; WWW 2004] Although we focus on avoiding retrieval of duplicates, Coverage/Overlap statistics can also be used to look for duplicates

Adaptive Integration of Heterogeneous Power Point Slides • Different template “schemas” • Different Font Styles • Naïve “concatenation” approaches don’t work!

BibFinder: A popular CS bibliographic mediator Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect, Network Bibliography, CSB, CiteSeer More than 58000 real user queries collected Mediated schema relation in BibFinder: paper(title, author, conference/journal, year) Primary key: title+author+year Focus on Selection queries Q(title, author, year) :- paper(title, author, conference/journal, year), conference=SIGMOD Part I: BibFinder

Background & Motivation • Sources are incomplete and partially overlapping • Calling every possible source isinefficient and impolite • Need coverage and overlap statistics to figure out what sources are most relevant for every possible query! • We introduce a frequency-based approach for mining these statistics

Challenges of gathering coverage and overlap statistics • It’s impractical to assume that the sources will export such statistics, because the sources are autonomous. • It’s impractical to learn and store all the statistics for every query. • Necessitate different statistics, is the number possible queries, is the number of sources • Impractical to assume knowledge of entire query population a priori Challenges • We introduce StatMiner • A threshold based hierarchical mining approach • Store statistics w.r.t. query classes • Keep more accurate statistics for more frequently asked queries • Handling the efficiency and accuracy tradeoffs by adjusting the thresholds

BibFinder/StatMiner

Query List & Raw Statistics Given the query list, we can compute the raw statistics for each query: P(S1..Sk|q)

AV Hierarchies and Query Classes

StatMiner Raw Stats

Using Coverage and Overlap Statistics to Rank Sources

BibFinder/StatMiner Evaluation • Experimental setup with BibFinder: • Mediator relation: Paper(title,author,conference/journal,year) • 25000 real user queries are used. Among them 4500 queries are randomly chosen as test queries. • AV Hierarchies for all of the four attributes are learned automatically. • 8000 distinct values in author, 1200 frequent asked keywords itemsets in title, 600 distinct values in conference/journal, and 95 distinct values in year.

Learned Conference Hierarchy

Plan Precision Fraction of true top-K sources called • Here we observe the average precision of the top-2 source plans • The plans using our learned statistics have high precision compared to random select, and it decreases very slowly as we change the minfreq and minoverlap threshold.

Number of Distinct Results • Here we observe the average number of distinct results of top-2 source plans. • Our methods gets on average 50 distinct answers, while random search gets only about 30 answers.

Plan Precision on Controlled Sources We observer the plan precision of top-5 source plans (totally 25 simulated sources). Using greedy select do produce better plans. See Section 3.8 and Section 3.9 for detailed information

Towards Multi-Objective Query Optimization (Or What good is a high coverage sourcethat is off-line?) • Sources vary significantly in terms of their response times • The response time depends both on the source itself, as well as the query that is asked of it • Specifically, what fields are bound in the selection query can make a difference • Hard enough to get a high coverage or a low response time plan. But now we have to combine them… • Challenges: • How do we gather response time statistics • How do we define an optimal plan in the context of both coverage/overlap and response time requirements? Response times of BibFinder Tuples

Response time can depend on the query type Range queries on year Effect of binding author field --Response times can also depend on the time of the day, and the day of the week [Raschid et. al. 2002].

Multi-objective Query optimization • Need to optimize queries jointly for both high coverage and low response time • Staged optimization won’t quite work. • An idea: Make the source selection be dependent on both (residual)coverage and response time Some possible utility functions we experimented with: [CIKM, 2001]

Results on BibFinder

Part II: Text Collection Selection with ,,,,,,,,,,

Selecting among overlapping collections Results 1. …… 2. …… 3. …… . . Collection Query Results Selection Execution Merging WSJ WP CNN NYT FT • Overlap between collections • News meta-searcher, bibliography search engine, etc. • Objectives: • Retrieve variety of results • Avoid collections with irrelevant or redundant results “bank mergers” • Collections: • FT • CNN Existing work (e.g. CORI) assumes collections are disjoint!

The Approach Collection Selection System Gather coverage User query Map the query to and overlap frequent item sets information for past queries Compute statistics Identify frequent for the query using item sets among Coverage / Overlap mapped item sets queries Statistics Collection Order Determine Compute statistics collection order for for the frequent 1. …… 2. …… query item sets . Online Component Offline Component “COllection Selection with Coverage and Overlap Statistics” Queries are keyword sets; Query classes are frequent keyword subsets

Challenge: Defining & Computing Overlap Collection C Collection C 1 2 1. Result A 1. Result V 2. Result B 2. Result W 3. Result C 3. Result X 4. Result D 4. Result Y 5. Result E 5. Result Z 6. Result F 7. Result G Collection C Collection C 1 2 1. Result A 1. Result V 2. Result B 2. Result W 3. Result C 3. Result X 4. Result D 4. Result Y 5. Result E 5. Result Z 6. Result F 7. Result G Collection C 3 1. Result I 2. Result J 3. Result K 4. Result L 5. Result M • Collection overlap may be non-symmetric, or “directional”. (A) • Document overlap may be non-transitive. (B) A. B.

Gathering Overlap Statistics • Solution: • Consider query result set of a particular collection as a single bag of words: • Approximate overlap as the intersection between the result set bags: • Approximate overlap between 3+ collections using only pairwise overlaps

Controlling Statistics • Objectives: • Limit the number of statistics stored • Improve the chances of having statistics for new queries • Solution: • Identify frequent item sets among queries (Apriori algorithm) • Store statistics only with respect to these frequent item sets

The Online Component Collection Selection System Gather coverage User query Map the query to and overlap frequent item sets information for past queries Compute statistics Identify frequent for the query using item sets among Coverage / Overlap mapped item sets queries Statistics Collection Order Determine Compute statistics collection order for for the frequent 1. …… 2. …… query item sets . Online Component Offline Component Map the query to frequent item sets Compute statistics for the query using mapped item sets Determine collection order for query • Purpose: determine collection order for user query • 1. Map query to stored item sets • 2. Compute statistics for query • 3. Determine collection order

Creating the Collection Test Bed • 6 real collections were probed: • ACM Digital Library, Compendex, CSB, etc. • Documents: authors + title + year + conference + abstract • top-20 documents from each collection • 9 artificial collections were created: • 6 were proper subsets of each of the 6 real collections • 2 were unions of two subset collections from above • 1 was the union of 15% of each real collection 15 overlapping, searchable collections

Training our System • Training set: 90% of the query list • Gathering statistics for training queries: • Probing of the 15 collections • Identifying frequent item sets: • Support threshold used: 0.05% (i.e. 9 queries) • 681 frequent item sets found • Computing statistics for item sets: • Statistics fit in a 1.28MB file • Sample entry: network,neural 22 MIX15 0.11855 CI,SC 747AG 0.07742 AD 0.01893SC,MIX15 801.13636 …

Performance Evaluation • Measuring number of new and duplicate results: • Duplicate result: has cosine similarity > 0.95 with at least one retrieved result • New result: has no duplicate • Oracular approach: • Knowswhich collection has most new results • Retrieves large portion of new results early

Comparison with other approaches

Comparison of COSCO against CORI results dup new cumulative 20.00 160.00 20.00 160.00 18.00 18.00 140.00 140.00 16.00 16.00 120.00 120.00 14.00 14.00 100.00 100.00 Cumulative number of new results Cumulative number of new results 12.00 12.00 Number of results, dup, new Number of results, dup, new 10.00 10.00 80.00 80.00 8.00 8.00 60.00 60.00 6.00 6.00 40.00 40.00 4.00 4.00 20.00 20.00 2.00 2.00 0.00 0.00 0.00 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Collection Rank using Coverage and Overlap Collection Rank using CORI CORI COSCO • CORI: constant rate of change, as many new results as duplicates, more total results retrieved early • COSCO: globally descending trend of new results, sharp difference between # of new and duplicates, fewer total results first

Summary of Experimental Results • COSCO… • displays Oracular-like behavior. • consistently outperforms CORI. • retrieves up to 30% more results than CORI when test queries reflect training queries. • can map at least 50% of queries to some item sets, even in worst-case training queries. • is a step towards Oracular-like performance, but still some room for improvement

Part III: Answer Imprecise Queries with [WebDB, 2004; WWW, 2004]

A Feasible Query Make =“Toyota”, Model=“Camry”, Price ≤ $7000 • Toyota • Camry • $7000 • 1999 Want a ‘sedan’ priced around $7000 • Toyota • Camry • $7000 • 2001 • Camry • Toyota • $6700 • 2000 • Toyota • Camry • $6500 • 1998 • ……… What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries Why Imprecise Queries ?

Dichotomy in Query Processing • Databases User knows what she wants User query completely expresses the need Answers exactly matching query constraints • IR Systems • User has an idea of what she wants • User query captures the need to some degree • Answers ranked by degree of relevance

Existing Approaches • Similarity search over Vector space • Data must be stored as vectors of text WHIRL, W. Cohen, 1998 • Enhanced database model • Add ‘similar-to’ operator to SQL. Distances provided by an expert/system designer VAGUE, A. Motro, 1998 • Support similarity search and query refinement over abstract data types Binderberger et al, 2003 • User guidance • Users provide information about objects required and their possible neighborhood Proximity Search, Goldman et al, 1998 • Limitations: • User/expert must provide similarity measures • New operators to use distance measures • Not applicable over autonomous databases • Our Objectives: • Minimal user input • Database internals not affected • Domain-independent & applicable to Web databases

AFDs based Query Relaxation

An Example • Relation:-CarDB(Make, Model, Price, Year) • Imprecise query • Q :− CarDB(Model like “Camry”, Price like “10k”) • Base query • Qpr :− CarDB(Model = “Camry”, Price = “10k”) • Base set Abs • Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” • Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”

Obtaining Extended Set • Problem: Given base set, find tuples from database similar to tuples in base set. • Solution: • Consider each tuple in base set as a selection query. e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” • Relax each such query to obtain “similar” precise queries. e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000” • Execute and determine tuples having similarity above some threshold. • Challenge: Which attribute should be relaxed first ? • Make ? Model ? Price ? Year ? • Solution:Relax least important attribute first.

Least Important Attribute • Definition: An attribute whose binding value when changed has minimal effect on values binding other attributes. • Does not decide values of other attributes • Value may depend on other attributes • E.g. Changing/relaxing Price will usually not affect other attributes • but changing Model usually affects Price • Dependence between attributes useful to decide relative importance • Approximate Functional Dependencies & Approximate Keys • Approximate in the sense that they are obeyed by a large percentage (but not all) of tuples in the database • Can use TANE, an algorithm by Huhtala et al [1999]

Attribute Ordering • Given a relation R • Determine the AFDs and Approximate Keys • Pick key with highest support, say Kbest • Partition attributes of R into • key attributes i.e. belonging to Kbest • non-key attributes I.e. not belonging toKbest • Sort the subsets using influence weights • where Ai ∈ A’ ⊆ R, j ≠ i & j =1 to |Attributes(R)| • Attribute relaxation order is all non-keysfirst then keys • Multi-attribute relaxation - independence assumption CarDB(Make, Model, Year, Price) Key attributes: Make, Year Non-key: Model, Price Order: Price, Model, Year, Make 1- attribute: { Price, Model, Year, Make} 2-attribute: {(Price, Model), (Price, Year), (Price, Make)….. }

Tuple Similarity • Tuples obtained after relaxation are ranked according to their • similarity to the corresponding tuples in base set where Wi = normalized influence weights, ∑ Wi = 1 , i = 1 to |Attributes(R)| • Value Similarity • Euclidean for numerical attributes e.g. Price, Year • Concept Similarity for categorical e.g. Make, Model

JaccardSim(A,B) = Concept (Value) Similarity • Concept: Any distinct attribute value pair. E.g. Make=Toyota • Visualized as a selection query binding a single attribute • Represented as a supertuple • Concept Similarity:Estimated as the percentage of correlated values common to two given concepts • where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R • Measured as the Jaccard Similarity among supertuples representing the concepts Supertuple for Concept Make=Toyota

Concept (Value) Similarity Graph Dodge Nissan 0.15 0.11 BMW Honda 0.12 0.22 0.25 Ford 0.16 Chevrolet Toyota

Adaptive Information Integration

Adaptive Information Integration

Presentation Transcript

Information Integration

Adaptive Information Filtering

Information Integration

Information Integration

INFORMATION INTEGRATION

Information Integration

From Adaptive Educational Hypermedia to Adaptive Information Access

Complete Information Integration

Adaptive Information Systems: From Adaptive Hypermedia to the Adaptive Web

Adaptive Routing with Stale Information

Information Integration

Sovereign Information Integration

Information Integration

Adaptive Information Cluster

Information Integration

INFORMATION INTEGRATION

Information Integration - Information as a Service

Information Integration

INFORMATION INTEGRATION

EDI Integration Information

Information Integration

Maintaining Information Integration Ontologies