XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries Yifei Lu1, Wei Wang1,Jianxin Li2andChengfei Liu2 1 University of New South Wales 2 Swinburne University of Technology

XML Keyword Search User: I want to find data mining paper coauthored by Jiawei Han DBLP Query: jiawiminning paper paper book title author title author author Mining concept author link mining author Eric Jiawei Han Jian Pei Jiawei Han Manning 2

Challenges • Must offer highly plausible suggestion • The suggested query should have non-empty results • Must be highly efficient 3

Poor Suggestion 4

Empty Result 5

Empty Result Query: jiawiminning • Pu and Yu [PVLDB08] will suggest “jian manning” • Worse than “jiawei mining” • No meaningful connection DBLP paper paper book title author title author author Mining concept author link mining author Eric Jiawei Han Jian Pei Jiawei Han Manning 6

Problem Definition • Data • A set of XML document trees • Form a single tree by adding a virtual root node. • Query • = { jiawi minning} • Candidate Query Space • Query Cleaning • Find top-k queries from the Candidate Query Space • Rank by jiawi minning jiawei mining jian Confusion Set: Valid words in vocabulary, with edit distance ≤ threshold manning 7

Ranking Candidate Queries • How to model • By Bayes’ Theorem • Rank by Query Likelihood Model Error Model 8

Error Model • Modeling Typographical Errors • The more similar the more likely • Similarity measured by Edit Distance • Independence Assumption binding mining running minning linking manning ed=1 finding Edit Distance ed=2 9

Query Likelihood Model • Modeling Query Generation Probability • A good query finds good results • is a set of disjoint entities (sub-trees) • Measure the query likelihood on each entity • Aggregate through all entities DBLP paper paper book r2 r3 r1 title author title author author Mining concept link mining author Jian Jiawei Han Jiawei Han Manning Entity Prior 10 (assume uniform)

Language Modeling • Modeling query likelihood on entities • Extract text in the sub-tree • Build a Language Model r1 …… DBLP booktitle paper author Data mining and knowledge discovery title Jiawei Han Mining concept drifting data Smoothing is used to avoid zero probability 11

Finding the entities • How to find the entities • Each entity is a potential search result • Different semantics can be applied • SLCA, ELCA, etc. • Specific Return Type • One for each query • Popular type • But not too deep DBLP paper paper book title author title author author Mining concept link mining author Eric Jiawei Han Jiawei Han Manning p=/DBLP/paper 12

Summary: Ranking Framework Error Model Query likelihood on each entity Entity Prior 13

Algorithm • Naïve Algorithm • Enumerate all possible candidate queries • Find the entities and compute the score for each candidate query • Problems: • Multiple passes of data • Not all candidates are needed DBLP paper paper book author title author 1. Jiawei mining 2. Jian mining 3. Jiawei Manning 4. Jian Manning author link author Jian Jiawei author Jiawei Manning Jian 14

XClean Example 1 DBLP Query: jiawiminning 1.1 1.2 1.3 paper paper book 1.1.1 1.2.1 1.2.2 1.3.1 authors author authors title 1.2.2.1 1.1.1.1 1.2.1.1 1.1.1.2 1.3.1.1 1.3.2.1 author author author author jiawei mining 1.3.2.1.1 1.3.1.1.1 1.1.1.1.1 1.1.1.2.1 manning jian jiawei jian p3 p2 p1 p4 p1 jiawei 1.1.1.1.1 1.2.2.1 p2 jian 1.1.1.2.1 1.3.1.1.1 p3 mining 1.2.1.1 p4 manning 1.3.2.1.1 15

XClean Example 1 DBLP Query: jiawiminning 1.1 1.2 1.3 paper paper book 1.1.1 1.2.1 1.2.2 1.3.1 authors author authors title 1.2.2.1 1.1.1.1 1.2.1.1 1.1.1.2 1.3.1.1 1.3.2.1 author author author author jiawei mining 1.3.2.1.1 1.3.1.1.1 1.1.1.1.1 1.1.1.2.1 manning jian jiawei jian p3 p2 p1 p4 p1 jiawei 1.1.1.1.1 1.2.2.1 “Jiawei mining” is generated “Jian mining” is skipped p2 jian 1.1.1.2.1 1.3.1.1.1 p3 mining 1.2.1.1 p4 manning 1.3.2.1.1 16

XClean Example 1 DBLP Query: jiawiminning 1.1 1.2 1.3 paper paper book 1.1.1 1.2.1 1.2.2 1.3.1 authors author authors title 1.2.2.1 1.1.1.1 1.2.1.1 1.1.1.2 1.3.1.1 1.3.2.1 author author author author jiawei mining 1.3.2.1.1 1.3.1.1.1 1.1.1.1.1 1.1.1.2.1 manning jian jiawei jian p2 p4 p1 1.1.1.1.1 1.2.2.1 jiawei “jian manning” is generated p2 jian 1.1.1.2.1 1.3.1.1.1 p3 mining 1.2.1.1 p4 manning 1.3.2.1.1 17

Experiment Settings • Algorithms • XClean • PY08: Pu and Yu [PVLDB08] • SE1: Search Engine 1 • SE2: Search Engine 2 • Measures • Mean Reciprocal Rank • Precision@N • Time 18

Experiment Settings • Datasets • Queries • Clean: original clean queries • INEX: 285 • DBLP: 49 • Random: random edit operations on each keyword • Rule: replace each word with a common misspelling 19

Experiment Results • Mean Reciprocal Rank (MRR) 20

Experiment Results • Precision@N • Percentage of queries for which the correct suggestion is in top-N suggestions 21

Experiment Results • Time • Query processing time 22

Conclusion • Contributions • A probabilistic framework for keyword query cleaning on XML database. • An Error Model based on edit distance • A Query Likelihood Model that exploits XML tree structures and keyword search semantics • Future work • Concatenation/Splitting of words • Cognitive Errors 23

Thank you! Questions? 24

XClean Algorithm • Find variants for each query keyword , and compute the error probability • Retrieve the XML nodes containing each variant through an inverted index • The nodes of all variants of form a virtual list • Find the entity nodes that have at least one child node from each virtual list • Compute the for each candidate query found in each entity • Accumulate the scores in a global hash table • Output top-k candidate queries 25

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Presentation Transcript

PowerPoint Lesson 1 PowerPoint Basics

PowerPoint Tips

Efficient Processing of Top-k Spatial Keyword Queries

Making PowerPoint Slides

Office PowerPoint 2007 Lab 1

PowerPoint

Web search engines

Supporting Location-Based Approximate-Keyword Queries

Spelling and Vocabulary

PowerPoint

Making PowerPoint Slides

Spelling is not about intelligence

eLearning Presentation

Your SEO Keyword Strategy

Making PowerPoint Work

Spatial Queries

Enter Title Here » [General Suggestions for PowerPoint]

PowerPoint: Presentation Tips

PowerPoint Presentation

Downloading Textual Hidden-Web Content Through Keyword Queries

Keyword Search on Form Results

See-To-Retrieve: Efficient Processing of Spatio-Visual Keyword Queries