Issues in Bridging DB & IR

4/29 Issues in Bridging DB & IR Announcements: Next class: Interactive Review (Come prepared) Homework III solutions online Demos tomorrow (instructions will be mailed by the end of the class)

First did some discussion of BibFinder—how queries are mapped etc.

CEAS Online Evaluations • You can do them at • https://intraweb.eas.asu.edu/eval • Will be available until the end of day May 5th • (so the exam is unfettered by what you might think about it ) • Instructors get to see it only after the grades have been given • (so you don’t need to feel compelled to be particularly nice) • Your feedback would be appreciated (especially the written comments) • Last semester I got 2,196 words of comments; let us see if we can break the record ;-)

Integration of autonomous data sources Data/information integration Technically has to handle heterogeneous data too But we will sort of assume that the sources are “quasi-relational” Supporting heterogeneous data (combining DB/IR) This can be tackled in the presence of a single database The issues are How to do effective querying in the presence of structured and text data E.g. Stuff I have Seen project How to support IR-style querying on DB Because users seem to know IR/keyword style querying more (notice the irony here—we said structure is good because it supports structured querying) How to support imprecise queries The popularity of Web brings two broad challenges to Databases

DBs allow structured querying Queries and results (tuples) are different objects Soundness & Completeness expected User is expected to know what she is doing IR only supports unstructured querying Queries and results are both documents! High Precision & Recall is hoped for User is expected to be a dunderhead. DB vs. IR

Some specific problems • How to handle textual attributes? • How to support keyword-based querying? • How to handle imprecise queries? (Ullas Nambiar’s work)

1. Handling text fields in data tuples • Often you have database relations some of whose fields are “Textual” • E.g. a movie database, which has, in addition to year, director etc., a column called “Review” which is unstructured text • Normal DB operations ignore this unstructured stuff (can’t join over them). • SQL sometimes supports “Contains” constraint (e.g. give me movies that contain “Rotten” in the review

Soft Joins..WHIRL [Cohen] • We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form—with the rank proportional to the similarity • Neat idea… but does have some implementation difficulties • Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples

2. Supporting keyword search on databases How do we answer a query like “Soumen Sunita”? Issues: --the schema is normalized (not everything in one table) --How to rank multiple tuples which contain the keywords?

What Banks Does The whole DB seen as a directed graph (edges correspond to foreign keys) Answers are subgraphs Ranked by edge weights

BANKS: Keyword Search in DB

Want cars priced ‘around’ $7000 3. Supporting Imprecise Queries • Increasing number of Web accessible databases • E.g. bibliographies, reservation systems, department catalogs etc • Support for precise queries only – exactly matching tuples • Difficulty in extracting desired information • Limited query capabilities provided by form based query interface • Lack of schema/domain information • Increasing complexity of types of data e.g. hyptertext, images etc • Often times user wants ‘about the same’ instead of ‘exact’ • Bibliography search — find similar publications Solution: Provide answers closely matching query constraints

Relaxing queries… • It is obvious how to relax certain type of attribute values • E.g. price=7000 is approximately the same as price=7020 • But how do we relax categorical attributes? • How should we relax Make=Honda? • Two possible approaches • Assume that domain specific information about similarity of values is available (difficult to satisfy in practice) • Attempt to derive the similarity between attribute values directly from the data • Qn: How do we compute similarity between “Make=Honda” and “Make=Chevrolet” • Idea: Compare the set all tuples where Make=Honda to the set of all tuples where Make=Chevrolet • Consider the set of tuples as a vector of bags (where bags correspond to the individual attributes) • Use IR similarity techniques to compare the vectors

Finding similarities between attribute values

5/4

Challenges in answering Imprecise Queries • Challenges: • Extracting additional tuples with minimal domain knowledge • Estimating similarity with minimal user input We introduce IQE (Imprecise Query Engine): • Uses query workload to identify other precise queries • Extracts additional tuples satisfying a query by issuing similar precise queries • Measures distance between queries using Answerset Similarity

Answerset Similarity • Answerset A(Q): Set of all answer tuples of query Q given by relation R. • Query Similarity: • Sim(Q1,Q2) :- Sim(A(Q1), A(Q2)) • Measuring answerset similarity • Relational model • exact match between tuples • captures complete overlap • Vector space model • match keywords • also detects partial overlaps • Problem: Vector Space model representation for answersets • Answer: SuperTuple Answerset for Q(Author=Widom) Answerset for Q(Author=Ullman)

Similarity Measures • Jaccard similarity metric with bag semantics • SimJ(Q1,Q2) = |Q1 ∩ Q2| / |Q1 U Q2| • Doc-Doc Similarity • Equal importance to all attributes • Supertuple considered as “single bag” of keywords • Simdoc-doc(Q1, Q2) = SimJ(STQ1, STQ2) • Weighted-Attribute Similarity • Weights assigned to attributes signify importance to user • Simwatr(Q1,Q2) = ∑ wi x SimJ(STQ1(Ai), STQ2(Ai))

Empirical Evaluation • Goal • Evaluate the efficiency and effectiveness of our approach • Setup • A database system extending the bibliography mediator BibFinderprojecting relation Publications( Author, Title, Conference, Journal, Year) • Query log consists of 10K precise queries • User study: • 3 graduate students • 90 test queries - 30 chosen by each student • Platform: Java 2 on a Linux Server – Intel Celeron 2.2 Ghz, 512 MB

Answering Imprecise Query • Estimating query similarity • For each q Є Qlog • Compute Sim(q,q’) for all q’ Є Qlog • Simdoc-doc(q, q’) = SimJ(STq, STq’) • Simwatr(q,q’) = ∑ wi x SimJ(STq(Ai), STq’(Ai)) • Extracting similar answers • Given a query Q • Map Q to a query q Є Qlog • Identify ‘k’ queries similar to q • Execute the ‘k’ new queries

Some Results

Relevance of Suggested Answers Are the results precise?Average error in relevance estimation is around 25%

User Study – Summary • Precision for top-10 related queries is above 75% • Doc-Doc similarity measure dominates Weighted-attribute similarity • Lessons: • Queries with popular keywords difficult • Efficiently and effectively capturing user interest is difficult • A solution requiring less input more acceptable

What’s Next ? Open Issues: • Most similar query may not be present in the workload. • Answers to a similar query will have varying similarity depending on the affected attributes Solution: • Given an imprecise query generate the most similar query. • Use attribute importance and value-value similarity to order tuples. • Challenges: • Estimating attribute importance • Estimating value-value similarity

Learning the Semantics of the data • Estimate for value-value similarity • Similarity between values of categorical attribute • Sim(v11,v12) = ∑ wi x Sim(Co-related_value(Ai,v11), Co-related_value(Ai,v12)) where Ai Є Attributes(R), Ai <> A • Euclidean distance for numerical attributes • Use the Model of the database – AFDs, Keys, Value correlations to • Identify an implicit structure for the tuple. • Show other tuples that least break the structure. • CarDb(Make,Model, Year, Price, Mileage, Location, Color) • Approximate Keys • Model, Mileage, Location – uniquely decides 90% cars in Db Model, Mileage, Color - uniquely decides 84% cars in Db Approximate Functional Dependencies (AFDs) • Model -> Make • Year -> Price • Mileage -> Year

Query relaxation

Finding similarities between attribute values

Issues in Bridging DB & IR