630 likes | 643 Vues
This paper discusses the problem of keyword proximity search in complex data graphs and proposes a system and algorithm for extracting meaningful parts of data without knowing the schema. The paper also addresses the challenges of redundancy and repeated information in highly cyclic data graphs.
E N D
Keyword Proximity Search in Complex Data Graphs • Konstantin Golenberg • Benny Kimelfeld •Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science
Schema-Free Extraction of Data • Exposure to many databases • Different types (relational, XML, RDF…) • Different schemas Nowadays… • Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema • Goal:Enable users to instantly pose (inaccurate) queries without knowing the schema The natural (and popular) option:Keyword Search • Problem: Inherently different from standard IR
Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords • Data have varying degrees of structure • Relational (w/ foreign keys), XML (w/ id-references) • Natural representation by a graph • Usually, data-centric rather than document-centric • A query is a set of keywords • No structural constraints • Agrawal et al. ICDE’02 • Hristidis et al., VLDB’02,03, ICDE’03 • Bhalotia et al. VLDB’05 • Kacholia al., VLDB’06 • Ding et al., ICDE’07 • Liu et al., SIGMOD’06 • Wang et al., VLDB’06 • Luo et al., SIGMOD’07 …
Example: Search in RDB Belgium, Brussels search Cities Organizations Countries Memberships
Brussels is the capital city of Belgium Belgium, Brussels search Cities Organizations Countries Memberships
Brussels hosts EU and Belgium is a member Belgium, Brussels search Cities Organizations Countries Memberships
Example: Search in XML Yannakakis, Approximation search
Yannakakiswrote a paper aboutApproximation Yannakakis, Approximation search
Yannakakisis cited by a paperaboutApproximation Yannakakis, Approximation search
Data Graphs • Structuralandkeyword nodes • Edges and nodes may have weights • – Weak relationships are penalized by large weights Each keyword has one occurrence in the data graph (technical)
Queries Queries are sets of keywords from the data graph Q={ Summers ,Cohen ,coffee}
An Answer is a Reduced Subtree This paper An answer is a subtree of the data graph • Contains all keywordsof the query • Has no redundant edges(and nodes) 3 variants: directed, undirected, strong(undirected, kw’s are leaves);
Previous Solutions • Lack of guarantees • Highly relevant answers might be missed, and / or • Inefficient algorithms • Rather simple data sets – a (very) small number of relevant answers • They considered data that are essentially collections of entities, namely, DBLP, IMDB, Lyrics, etc. • An answer is usually within the scope of an entity → e.g., the keywords appear in a single movie • Crucial problems ignored • In particular, the “repeated information” problem • Especially pervasive in complex data graphs
Contributions • A system for keyword proximity search • An algorithm for generating answers with guarantees • Does not miss (valuable) answers • Efficient (polynomial delay) • Answers generated in a 2-approximate order by height • A ranking technique that is aware of therepeated-information problem • Gives preference to answers with low similarity to earlier ones • Experimentation over a highly-cyclic data graph • The Mondial database • Many “meaningful” connections among keywords
The MONDIAL Database • Institute for Informatics • Georg-August-Universität Göttingen http://www.dbis.informatik.uni-goettingen.de/Mondial/
Challenges We employ a two-phase architecture: • Huge no. of answers; not instantiated! • Not simple to generate all relevant answers, even if ranking is ignored • For practical ranking functions, enumerating the answers in ranked order is probably impossible • For example, finding the smallest answer is the intractable Steiner-tree problem • Redundancy / repeated information • Many answers are very similar (altogether provide a low amount information) • Crucial in complex (highly cyclic) data graphs
Architecture: Generator + Ranker Simplified ranking at first [Bhalotia et al., ICDE’02, VLDB’05] Answer Generator Generates next M·k answers (simplified ranking function) Ranker Ranks all answers generated up to now (- printed ones) top-k answers (relative to those that have already been printed) • search(keywords) • next k answers
Generating the Top Answers: Not Trivial! To demonstrate the difficulty of generating the “good” (top) answers, let’s seehow existing approaches operate on a simple example:
The BANKS Approach Answers are directed subtrees [Bhalotia et al., ICDE’02, VLDB’05] • ∀nodes v (in a “good” order) and keyword occurrences: • Generate the min-height subtree emanating from v
The BANKS Approach Answers are directed subtrees [Bhalotia et al., ICDE’02, VLDB’05] What about this answer? Never generated! • ∀nodes v (in a “good” order) and keyword occurrences: • Generate the min-height subtree emanating from v
The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v
The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] This node is redundant It is actually the previous answer! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v
The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] This node is redundant Again, the previous answer! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v
The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] What about this answer? Severe limit on # of generated answers! (≤ one per node) Never generated! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v
The DISCOVER / DBXplorer Approach [Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02] Easy to implement! All answers are generated in ranked order! DBMS queries–No in-memory graph algorithms • ∀possible queries Q (from the schema) in inc. size: • Evaluate Q over the database
The DISCOVER / DBXplorer Approach [Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02] Worst case: exponential in the data Inefficient! But many queries do not generate any answer at all! Limited Ranking! by the query (rather than the answer) weight • ∀possible queries Q (from the schema) in inc. size: • Evaluate Q over the database
We Need Generators w/ Guarantees! • All answers are generated • In particular, each of the “relevant” answers is produced at some point (100% recall is achievable) • Controlled order of answers • For instance, increasing weight, increasing height, approximate (what is the ratio?) / heuristic order • Efficiency • The top-k answers should be generated efficiently • Bound on time between successive answers
Order by Increasing Weight / Height If ≤ Then Top-k Answers
Approximate and Heuristic Orders Heuristic order Approximate order Intuitively, expected to be close to the optimal order, but there is no guarantee There is a provable bound on the extent to which the actual order can deviate from the optimal one
C-Approximate Order (inc. Weight / Height) If Then C ≤ C-Approximation of the Top-k Answers [Fagin et al., PODS’01]
Our Approach • PODS’06: Enum. by (exact / approx) inc. weight • Problem: Repeated application of Steiner-tree alg’s • “Heavy” – hard to implement efficiently • Here: Follow the basic approach of PODS’06 • But, we adopt the BANKS idea of using height (≠ weight) for the enumeration order • Recall: BANKS might miss highly relevant answers • Thus, we bypass Steiner trees and obtain a much faster algorithm • Our alg. has all 3 guarantees: answers are not missed, approximate order, poly. delay
An Overview of the Algorithm Task: • Lawler / Yen method • Types of Constraints: • Inclusion: “include edge e” • Exclusion: “exclude edge e” Enum. by (2-approx.) increasing height Task: Find (a 2-approx. of) the shortest answer under constraints The intricate part … Task: Find the shortest answer (w/o constraints) Backward-search (Dijkstra) iterators (~ BANKS)
Finding an Answer under Constraints • Inclusion: “include edge e” • Exclusion: “exclude edge e” Handling exclusion constraints is easy Simply remove the excluded edges from the graph
Inclusion Constraints are the Problem redundant edge • Inclusion: “include edge e” • Exclusion: “exclude edge e” The shortest subtree that contains the kw’s and satisfies the const’s But it is not an answer! • Not reduced (has redundancy) • Moreover, includes a previously printed answer • Sometimes, no answer at all!
The Correct Answer • Inclusion: “include edge e” • Exclusion: “exclude edge e” • Technique: • 1. Generate a min-height subtree (as in the wrong solution) • 2. Not an answer? → modify • Intricate to guarantee 2-approx. • Details in the proceedings
Running Times Each entry is an avg. of 4 queries
Alg. Order vs. Weight Order How many answers are generated in order to obtain the top-k (among 1000) according to weight? Each entry is an avg. of 4 queries
Effective Approx. Ratio: Height ↑ % k(answers) Effective approx. ratio worst / best (among first k) 2 keywords 3 keywords
Effective Approx. Ratio: Height ↑ % k(answers) Effective approx. ratio worst / best (among first k) 4 keywords 5 keywords
Effective Approx. Ratio: Weight ↑ % k(answers) Effective approx. ratio worst / best (among first k) 2 keywords 3 keywords
Effective Approx. Ratio: Weight ↑ % k(answers) Effective approx. ratio worst / best (among first k) 4 keywords 5 keywords
The Basic Ranking Function weight(a) = Σweight(node) + Σweight(edge) node∊a edge∊a
Determining the Weight of an Edge org. enters many countries→ weak connection (large weight) Many org’s enter country→ weak connection (large weight) Strong connection (small weight) Strongest!
The Basic Ranking Function (cont’d) weight(a) = Σweight(node) + Σweight(edge) node∊a edge∊a # t2 nodes with edges from v1 # t1 nodes with edges to v2 Relevant answers but … weight(node) = fixed (1) weight(edge) = log(1 + α·out(v1→t2) + (1 − α)·in(t1→v2)) edge = (v1,v2) tag(vi) = ti