Integrating Database and Information Retrieval Techniques for Enhanced Knowledge Discovery

Databases & Information Retrieval Maya Ramanath (Further Reading: Combining Database and Information-Retrieval Techniques for Knowledge Discovery. G. Weikum, G. Kasneci, M. Ramanath and F.M. Suchanek, CACM, April 2009 DB & IR: Both Sides Now. G. Weikum, Keynote at SIGMOD 2007)

DB and IR: Different Motivations • Both deal with large amounts of information, but…

Why Combine Now? • The applications drive the need • The need to manage both structured and unstructured data in an integrated manner • Healthcare example • Find young patients in central Europe who have been reported, in the last two weeks, to have symptoms of tropical virus diseases and an indication of anomalies. • Newspaper archives, product catalogues, etc.

Integrating DB & IR top-k processing, keyword search on graphs IR Systems Untructured queries / ranked results (keywords/top-k) query processing for text search, effective query interfaces, ranking for structured data DB Systems extracting entities and relationships, ranking for entities Structured queries / boolean match results (SQL) Structured data (relational) Unstructured data (text)

Modules • Top-k processing • Query Processing and Interfaces • Keyword Search on Graphs • Entity and Relationship Extraction • Ranking and Structured Data

1. Top-k Processing (1/2) • Structured data, with scores in multiple dimensions • Return the top-k “objects”

1. Top-k Processing (2/2) • Top-k Joins • Example: Return the best house-school pair

2. Query Processing and Interfaces (1/3) • Given: Database of text documents and a text-centric task. • Extract information about disease outbreaks • Strategies • Scan all documents – very expensive • Filter promising documents – affects recall • Develop cost models and execution strategies appropriate for this setting

2. Query Processing and Interfaces (2/3) Querying with “typed” keywords • Keyword querying: Easy to use • Structured queries: Precise Find the middle ground… Instead of “german has won nobel award” q(X) :- GERMAN(x), hasWonPrize(x,y), NOBEL_PRIZE(y) • “german, has won (nobel award)”

2. Query Processing and Interfaces (3/3) • Does the output have to be a boring list of ranked results? • Nope !

3. Keyword Search on Graphs (1/3) • Lots of graphs around • Relational DB (tuples+foreign keys) • XML data (elements/sub-elements/id/idrefs) • RDF (graph-structured knowledge-bases) • Easy to query with keywords, instead of SQL/XQuery/SPARQL • Results are the top-k interconnections between the keywords

3. Keyword Search on Graphs (2/3)

3. Keyword Search on Graphs (3/3) Query: “Einstein”, “Bohr” Tom Cruise vegetarian isa isa bornIn Einstein won 1962 won Nobel Prize Bohr diedIn

4. Entity and Relationship Extraction (1/2) Information Extraction (or Knowledge Harvesting) Apple was established on April 1, 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne. Infosys was founded on 2 July 1981 by seven entrepreneurs: N. R. Narayana Murthy, NandanNilekani, … Bill Gates was the founder of Microsoft and later it’s CEO.

4. Entity and Relationship Extraction (2/2) • How to build a knowledge-base of facts? • Structurize Wikipedia • Construct rules for extraction • How do I acquire all the facts in the world? • Extract “everything” • Don’t stop extracting

5. Ranking and Structured Data • Not the same as top-k processing • Given: Data with stucture in it • Relational tables (flat) • XML (trees/graphs) • Text documents consisting of entities • Task: Rank the query results • SQL/Xquery/”typed” keywords

Questions?

Integrating Database and Information Retrieval Techniques for Enhanced Knowledge Discovery

Integrating Database and Information Retrieval Techniques for Enhanced Knowledge Discovery

Presentation Transcript