770 likes | 915 Vues
Schema Free Querying of Semantic Data. Lushan Han Advisor: Dr. Tim Finin May 23, 2014. Introduction Related Work SFQ Interface Schema Network and Association Models Query Interpretation Evaluation Conclusion. Road Map. Part 1. Introduction. Semantic Data.
E N D
Schema Free Querying of Semantic Data Lushan Han Advisor: Dr. Tim Finin May 23, 2014
Introduction Related Work SFQ Interface Schema Network and Association Models Query Interpretation Evaluation Conclusion Road Map
Semantic Data • A network of entities, which are annotated with types and interlinked with properties. • Increasing amount of Semantic Data • Examples: • RDF semantic data • LOD • DBpedia • Freebase
Objectives • Develop schema-free query interfaces • Works with “semantic data” in many forms, e.g., RDF, Freebase, RDBMS • Allow casual users to freely query semantic data without learning its schema • Queries should be in the user’s conceptual world • Two existing interfaces: • Natural Language Interface (NLI) • Keyword Interface • Three hard problems
P1. No Practical Interface • Natural language interface • NLP techniques are still not reliable to parse out the full relational structure from natural language questions • Keyword interface • Ambiguity and limited expressiveness • (e.g. “president children spouse”) • (e.g. Who was the author of the Adventures of Tom Sawyer • and where was he born?)
SFQ Interface • Still in the user’s conceptual world • Make implicit structure of NL questions explicit • Who was the author of the Adventures of Tom Sawyer • and where was he born?
P2. Semantic Heterogeneity Problem • Many different ways to express (model) the same meaning • Vocabulary and structure mismatches between the user’s query and the machine’s representation • Existing methods: • Labor-intensive and ad-hoc methods • Domain-specific syntactic or semantic grammars • Mapping Lexicons (Mapping rules) • Templates • Thesaurus (e.g. WordNet) is insufficient
A purely computational approach • Lexical Semantic similarity Measures • Capture flexible semantics • Statistical Association Measures • Carry out disambiguation • A novel “overall semantic similarity” or fitness metric that combines • Lexical semantic similarity measures • statistical association measures • structure features • Context-sensitive mapping algorithms
P3. Heterogeneous or unknown schema • Hard to reach consensus on a schema for the world • Open domain semantic data has heterogeneous or even unknown schema (e.g. Semantic Web data, DBpedia) • Traditional NLI systems are difficult to apply • Some modern systems • Not produce formal queries (e.g. SQL or SPARQL). • Directly search into the entity network for matchings • Computationally expensive and has ad-hoc natures
The schema network • Learn a schema statistically from the entity network by exploiting co-occurrences. • The schema itself is also represented as a network • Mapping the user’s query into the schema network, instead of the entity network. • Much more scalable • Produce formal queries • Enable joint disambiguation and context-sensitive mapping algorithm
Thesis Statement We can develop an effective and efficient algorithm to map a casual user's schema-free query into a formal knowledge base query language that overcomes vocabulary and structure mismatch problems by exploiting lexical semantic similarity measures, association degree measures and structural features.
Contributions • An intuitive SFQ interface that avoids the problem of extracting relations structure from NL queries • Novel algorithms mapping SFQ queries to KB queries addressing both vocabulary and structure mismatches • A novel approach to handle heterogeneous or unknown schemas by building a schema from an entity network • Define the probability of observing a path in a schema network and develop two novel statistical association models • An improved PMI metric and new semantic text similarity measures and algorithms
Natural Language Interface to Database (NLIDB) Systems • Early Systems in 70s, (e.g. LUNAR and LADDER) • Domain-specific syntactic or semantic grammars • Heavily customized to a particular application • Later systems in 80s and 90s. (e.g. TEAM, ASK, MASQUE) • More general parser • Require human-crafted lexicons, mapping rules and domain knowledge to interpret the parse tree • Allow knowledge engineers or end users to enrich lexicons and add new mapping rules through an interactive interface • More portable than early systems
SFQ Examples • Where was the author of the Adventures of Tom Sawyer born? • Give me authors in the CIKM conference • A more complicated one
Default Relations • The relation name can be left out • A stop word list for filtering relation names with words like in, of, has, from, belong, part of, locate and etc.
Instance Data (ABox) • Two datasets • The relation dataset (all relations between instances) • The type dataset (all type definitions for instances) • Integrate all RDF data types into five types that are familiar to users • ˆNumber, ˆDate, ˆYear, ˆText and ˆLiteral • ˆLiteral is the super type of the other four • We use DBpedia for examples in the following slides
Automatically enrich the set of types Automatically deduce types from relations • Infer attribute types from data type properties • e.g. <Beijing>, population, “20693000” => ˆPopulation • Infer classes from object properties • e.g. < Zelig>, director, <Woody Allen> => ˜Director
The Schema Network • A statistical meta description of the underlying entity network, which is a network itself.
The Schema Path • A path on the schema network is called a schema path • A schema path P represents a composite relation Example 1. Example 2.
The Schema Path Probability • Measure the reasonableness of a path • The probability of “observing” a path on the schema network • (A1) we select the starting node c0 of the path randomly from all the nodes in the schema network • (A2) observe the path in a random walk starting with c0
Compute Transition Probability 0 ≤ ≤ 1
A Property about Schema Path • A schema path P and its return path P’ represent the same relation. • Given a schema path Pand its return path P’ we have P(P) = P(P’). P P’
Schema Path Model • Supposed to store and index all the schema paths with a length no larger than a given threshold and their probabilities • The only supported function is to return all the schema paths and their probabilities between two given classes. • Put in memory for fast computation
Concept Path • Group all the edges with the same direction between two nodes into a single edge • By analogy to schema path, we have concept path probability • Concept path frequency
Concept Association Knowledge (CAK) model • Pairwise associations • (i) direct association between classes and properties • (ii) indirect association between two classes • PMI measure • Our improved PMI measure
Concept Association Knowledge (CAK) model • Direct association between a directed class and a property p • Indirect association between two directed classes
PMI* vs PMI The most associated property for “Person” in DBpedia PMI* PMI
Time Complexity of Concept Mapping Algorithm • A straightforward concept mapping algorithm • After exploiting locality – the optimal mapping choice of a property can be determined locally when the two classes it links are fixed
Relation Mapping Optimization Problem • H* : the set of top k3 concept mapping hypotheses • The reduced mapping space for the SFQ • The optimization problem
Computing the fitness of a mapping σon a relation r • Let • Two features and one parameter β • Joint lexical semantic similarity between and P • The schema path frequency of P • The parameter β adjusts the relative importance of the two features