170 likes | 312 Vues
This paper explores the convergence of database (DB) and information retrieval (IR) methodologies, focusing on their application in handling both structured and unstructured data. Highlighting the significance of combining these disciplines, particularly in fields like healthcare, the work outlines strategies for effective query processing, top-k processing, and entity relationship extraction. It discusses practical examples, such as identifying young patients in central Europe with recent tropical virus symptoms, and emphasizes the necessity for integrated systems to improve information management and access efficiency.
E N D
Databases & Information Retrieval Maya Ramanath (Further Reading: Combining Database and Information-Retrieval Techniques for Knowledge Discovery. G. Weikum, G. Kasneci, M. Ramanath and F.M. Suchanek, CACM, April 2009 DB & IR: Both Sides Now. G. Weikum, Keynote at SIGMOD 2007)
DB and IR: Different Motivations • Both deal with large amounts of information, but…
Why Combine Now? • The applications drive the need • The need to manage both structured and unstructured data in an integrated manner • Healthcare example • Find young patients in central Europe who have been reported, in the last two weeks, to have symptoms of tropical virus diseases and an indication of anomalies. • Newspaper archives, product catalogues, etc.
Integrating DB & IR top-k processing, keyword search on graphs IR Systems Untructured queries / ranked results (keywords/top-k) query processing for text search, effective query interfaces, ranking for structured data DB Systems extracting entities and relationships, ranking for entities Structured queries / boolean match results (SQL) Structured data (relational) Unstructured data (text)
Modules • Top-k processing • Query Processing and Interfaces • Keyword Search on Graphs • Entity and Relationship Extraction • Ranking and Structured Data
1. Top-k Processing (1/2) • Structured data, with scores in multiple dimensions • Return the top-k “objects”
1. Top-k Processing (2/2) • Top-k Joins • Example: Return the best house-school pair
2. Query Processing and Interfaces (1/3) • Given: Database of text documents and a text-centric task. • Extract information about disease outbreaks • Strategies • Scan all documents – very expensive • Filter promising documents – affects recall • Develop cost models and execution strategies appropriate for this setting
2. Query Processing and Interfaces (2/3) Querying with “typed” keywords • Keyword querying: Easy to use • Structured queries: Precise Find the middle ground… Instead of “german has won nobel award” q(X) :- GERMAN(x), hasWonPrize(x,y), NOBEL_PRIZE(y) • “german, has won (nobel award)”
2. Query Processing and Interfaces (3/3) • Does the output have to be a boring list of ranked results? • Nope !
3. Keyword Search on Graphs (1/3) • Lots of graphs around • Relational DB (tuples+foreign keys) • XML data (elements/sub-elements/id/idrefs) • RDF (graph-structured knowledge-bases) • Easy to query with keywords, instead of SQL/XQuery/SPARQL • Results are the top-k interconnections between the keywords
3. Keyword Search on Graphs (3/3) Query: “Einstein”, “Bohr” Tom Cruise vegetarian isa isa bornIn Einstein won 1962 won Nobel Prize Bohr diedIn
4. Entity and Relationship Extraction (1/2) Information Extraction (or Knowledge Harvesting) Apple was established on April 1, 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne. Infosys was founded on 2 July 1981 by seven entrepreneurs: N. R. Narayana Murthy, NandanNilekani, … Bill Gates was the founder of Microsoft and later it’s CEO.
4. Entity and Relationship Extraction (2/2) • How to build a knowledge-base of facts? • Structurize Wikipedia • Construct rules for extraction • How do I acquire all the facts in the world? • Extract “everything” • Don’t stop extracting
5. Ranking and Structured Data • Not the same as top-k processing • Given: Data with stucture in it • Relational tables (flat) • XML (trees/graphs) • Text documents consisting of entities • Task: Rank the query results • SQL/Xquery/”typed” keywords