Keyword Proximity Search in Complex Data Graphs: Schema-Free Data Extraction

Keyword Proximity Search in Complex Data Graphs • Konstantin Golenberg • Benny Kimelfeld •Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science

Schema-Free Extraction of Data • Exposure to many databases • Different types (relational, XML, RDF…) • Different schemas Nowadays… • Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema • Goal:Enable users to instantly pose (inaccurate) queries without knowing the schema The natural (and popular) option:Keyword Search • Problem: Inherently different from standard IR

Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords • Data have varying degrees of structure • Relational (w/ foreign keys), XML (w/ id-references) • Natural representation by a graph • Usually, data-centric rather than document-centric • A query is a set of keywords • No structural constraints • Agrawal et al. ICDE’02 • Hristidis et al., VLDB’02,03, ICDE’03 • Bhalotia et al. VLDB’05 • Kacholia al., VLDB’06 • Ding et al., ICDE’07 • Liu et al., SIGMOD’06 • Wang et al., VLDB’06 • Luo et al., SIGMOD’07 …

Example: Search in RDB Belgium, Brussels search Cities Organizations Countries Memberships

Brussels is the capital city of Belgium Belgium, Brussels search Cities Organizations Countries Memberships

Brussels hosts EU and Belgium is a member Belgium, Brussels search Cities Organizations Countries Memberships

Example: Search in XML Yannakakis, Approximation search

Yannakakiswrote a paper aboutApproximation Yannakakis, Approximation search

Yannakakisis cited by a paperaboutApproximation Yannakakis, Approximation search

Data Graphs • Structuralandkeyword nodes • Edges and nodes may have weights • – Weak relationships are penalized by large weights Each keyword has one occurrence in the data graph (technical)

Queries Queries are sets of keywords from the data graph Q={ Summers ,Cohen ,coffee}

An Answer is a Reduced Subtree This paper An answer is a subtree of the data graph • Contains all keywordsof the query • Has no redundant edges(and nodes) 3 variants: directed, undirected, strong(undirected, kw’s are leaves);

Previous Solutions • Lack of guarantees • Highly relevant answers might be missed, and / or • Inefficient algorithms • Rather simple data sets – a (very) small number of relevant answers • They considered data that are essentially collections of entities, namely, DBLP, IMDB, Lyrics, etc. • An answer is usually within the scope of an entity → e.g., the keywords appear in a single movie • Crucial problems ignored • In particular, the “repeated information” problem • Especially pervasive in complex data graphs

Contributions • A system for keyword proximity search • An algorithm for generating answers with guarantees • Does not miss (valuable) answers • Efficient (polynomial delay) • Answers generated in a 2-approximate order by height • A ranking technique that is aware of therepeated-information problem • Gives preference to answers with low similarity to earlier ones • Experimentation over a highly-cyclic data graph • The Mondial database • Many “meaningful” connections among keywords

The MONDIAL Database • Institute for Informatics • Georg-August-Universität Göttingen http://www.dbis.informatik.uni-goettingen.de/Mondial/

Challenges We employ a two-phase architecture: • Huge no. of answers; not instantiated! • Not simple to generate all relevant answers, even if ranking is ignored • For practical ranking functions, enumerating the answers in ranked order is probably impossible • For example, finding the smallest answer is the intractable Steiner-tree problem • Redundancy / repeated information • Many answers are very similar (altogether provide a low amount information) • Crucial in complex (highly cyclic) data graphs

Architecture: Generator + Ranker Simplified ranking at first [Bhalotia et al., ICDE’02, VLDB’05] Answer Generator Generates next M·k answers (simplified ranking function) Ranker Ranks all answers generated up to now (- printed ones) top-k answers (relative to those that have already been printed) • search(keywords) • next k answers

Generating the Top Answers: Not Trivial! To demonstrate the difficulty of generating the “good” (top) answers, let’s seehow existing approaches operate on a simple example:

Find the Answers in this Example!

The BANKS Approach Answers are directed subtrees [Bhalotia et al., ICDE’02, VLDB’05] • ∀nodes v (in a “good” order) and keyword occurrences: • Generate the min-height subtree emanating from v

The BANKS Approach Answers are directed subtrees [Bhalotia et al., ICDE’02, VLDB’05] What about this answer? Never generated! • ∀nodes v (in a “good” order) and keyword occurrences: • Generate the min-height subtree emanating from v

The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] This node is redundant It is actually the previous answer! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] This node is redundant Again, the previous answer! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

The NUITS Approach Answers are undirected subtrees [Ding et al., ICDE’07] What about this answer? Severe limit on # of generated answers! (≤ one per node) Never generated! • ∀nodes v (in a “good” order): • Generate the min-weight subtree that includes v

The DISCOVER / DBXplorer Approach [Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02] Easy to implement! All answers are generated in ranked order! DBMS queries–No in-memory graph algorithms • ∀possible queries Q (from the schema) in inc. size: • Evaluate Q over the database

The DISCOVER / DBXplorer Approach [Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02] Worst case: exponential in the data Inefficient! But many queries do not generate any answer at all! Limited Ranking! by the query (rather than the answer) weight • ∀possible queries Q (from the schema) in inc. size: • Evaluate Q over the database

We Need Generators w/ Guarantees! • All answers are generated • In particular, each of the “relevant” answers is produced at some point (100% recall is achievable) • Controlled order of answers • For instance, increasing weight, increasing height, approximate (what is the ratio?) / heuristic order • Efficiency • The top-k answers should be generated efficiently • Bound on time between successive answers

Order by Increasing Weight / Height If ≤ Then Top-k Answers

Approximate and Heuristic Orders Heuristic order Approximate order Intuitively, expected to be close to the optimal order, but there is no guarantee There is a provable bound on the extent to which the actual order can deviate from the optimal one

C-Approximate Order (inc. Weight / Height) If Then C ≤ C-Approximation of the Top-k Answers [Fagin et al., PODS’01]

Our Approach • PODS’06: Enum. by (exact / approx) inc. weight • Problem: Repeated application of Steiner-tree alg’s • “Heavy” – hard to implement efficiently • Here: Follow the basic approach of PODS’06 • But, we adopt the BANKS idea of using height (≠ weight) for the enumeration order • Recall: BANKS might miss highly relevant answers • Thus, we bypass Steiner trees and obtain a much faster algorithm • Our alg. has all 3 guarantees: answers are not missed, approximate order, poly. delay

An Overview of the Algorithm Task: • Lawler / Yen method • Types of Constraints: • Inclusion: “include edge e” • Exclusion: “exclude edge e” Enum. by (2-approx.) increasing height Task: Find (a 2-approx. of) the shortest answer under constraints The intricate part … Task: Find the shortest answer (w/o constraints) Backward-search (Dijkstra) iterators (~ BANKS)

Finding an Answer under Constraints • Inclusion: “include edge e” • Exclusion: “exclude edge e” Handling exclusion constraints is easy Simply remove the excluded edges from the graph

Inclusion Constraints are the Problem redundant edge • Inclusion: “include edge e” • Exclusion: “exclude edge e” The shortest subtree that contains the kw’s and satisfies the const’s But it is not an answer! • Not reduced (has redundancy) • Moreover, includes a previously printed answer • Sometimes, no answer at all!

The Correct Answer • Inclusion: “include edge e” • Exclusion: “exclude edge e” • Technique: • 1. Generate a min-height subtree (as in the wrong solution) • 2. Not an answer? → modify • Intricate to guarantee 2-approx. • Details in the proceedings

Running Times Each entry is an avg. of 4 queries

Alg. Order vs. Weight Order How many answers are generated in order to obtain the top-k (among 1000) according to weight? Each entry is an avg. of 4 queries

Effective Approx. Ratio: Height ↑ % k(answers) Effective approx. ratio worst / best (among first k) 2 keywords 3 keywords

Effective Approx. Ratio: Height ↑ % k(answers) Effective approx. ratio worst / best (among first k) 4 keywords 5 keywords

Effective Approx. Ratio: Weight ↑ % k(answers) Effective approx. ratio worst / best (among first k) 2 keywords 3 keywords

Effective Approx. Ratio: Weight ↑ % k(answers) Effective approx. ratio worst / best (among first k) 4 keywords 5 keywords

The Basic Ranking Function weight(a) = Σweight(node) + Σweight(edge) node∊a edge∊a

Determining the Weight of an Edge org. enters many countries→ weak connection (large weight) Many org’s enter country→ weak connection (large weight) Strong connection (small weight) Strongest!

The Basic Ranking Function (cont’d) weight(a) = Σweight(node) + Σweight(edge) node∊a edge∊a # t2 nodes with edges from v1 # t1 nodes with edges to v2 Relevant answers but … weight(node) = fixed (1) weight(edge) = log(1 + α·out(v1→t2) + (1 − α)·in(t1→v2)) edge = (v1,v2) tag(vi) = ti

Answers with High Similarity

Keyword Proximity Search in Complex Data Graphs: Schema-Free Data Extraction

Keyword Proximity Search in Complex Data Graphs: Schema-Free Data Extraction

Presentation Transcript

School of Engineering and Computer Science Development Update

School of Computer Science and Software Engineering

School of Computer Science and Software Engineering

School of Science and Computer Engineering

School of Computer Science and Engineering

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

The Selim and Rachel Benin School of Engineering and Computer Science

School of Computer Science and Electrical Engineering

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

Erik Jonsson School of Engineering and Computer Science

The Selim and Rachel Benin School of Engineering and Computer Science