280 likes | 382 Vues
Effective XML Keyword Search with Relevance Oriented Ranking. Presentation by Volker Rehberg. Paper by Zhifeng Bao , Tok Wang Ling, Bo Chen, Jiaheng Lu. Agenda. I ) Motivation and Background II) Inferring Keyword Search Intention III ) Relevance Oriented Ranking
E N D
Effective XML Keyword SearchwithRelevanceOriented Ranking Presentationby Volker Rehberg Paper by ZhifengBao, Tok Wang Ling, Bo Chen, Jiaheng Lu
Agenda I ) Motivation and Background II) Inferring Keyword Search Intention III ) RelevanceOriented Ranking IV ) Algorithms V ) Experimental Evaluation VI ) Conclusion
Motivation and Background Whatis „Effective XML Keyword SearchwithRelevance Oriented Ranking“ all about? • Keyword search Issue 1: identifysearchfornodeIssue 2: identitysearch via node Issue 3: rank each query result
Motivation and Background Ambiguities in interpretingthesearchfornodeandsearch vianode: Ambiguity 1: Keyword canappearas a xml tag nameandas a textvalueofsomeothernodes.
Motivation and Background Ambiguities in interpretingthesearchfornodeandsearch via node: Ambiguity 2: Keyword canappearasthetextvaluesof different typesofxmlnodesandcarry different meanings.
Motivation and Background Keyword query: Customer interestart SLCA returns 5 resultswithoutanyranking onlycomstumerwith ID C4 isdesiredandshouldbe top ranked
Motivation and Background Problems of SLCA: • does not considersemanticsof query and XML Data • Keyword ambiguityproblem • Norelevanceorientedranking answers irrelevant touserssearchintention • answers not meaningfulland informative enough
Motivation and Background TF *IDF (Term Frequency * Inverse DocumentFrequency) • Rule 1: Inverse DocumentFrequency • Rule 2: Term Frequency • Rule 3: Normalization
Motivation and Background query . flat document keyword Normalize document/term frequency: Number of documents occurencesof k in document d documents containing k Weightsof query q anddocument d:
Inferring Keyword Search Intention Talking about “Art”: • Intuition :elementof „interest“ node, becausemanypeopleareinterested in art • statisticsofunderlyingdatabase
Inferring Keyword Search Intention Node type Tissearchfornodeif: 1: Tisintuitivelyrelatedtoevery query keyword in q. 2: Tis informative enoughtocontainenough relevant information 3: T does not containtomuch irrelevant information numberofT – typednodesthatcontainkaseithervaluesor tag names in theirsubtrees keyword in query q reductionfactor (range 0-1) normallychosentobe 0.8
Inferring Keyword Search Intention Confidenceof a node type T tobedesiredsearchfornode: numberofT – typednodesthatcontainkaseithervaluesor tag names in theirsubtrees keyword in query q reductionfactor (range 0-1) normallychosentobe 0.8 Confidenceof a node type T tobedesiredsearch via node:
Inferring Keyword Search Intention Keyword query: Customer name rock interestart • „art“ shouldbe in interestand „rock“ shouldbesearchedfor in name • order ofkeywords in the query important
Inferring Keyword Search Intention Value TypedDistance (Dist) Max(Distq (q, v, kt, k) , Dists (q, v, kt, k) In-Query Distance (IQD) Position distancebetweenktandk in q, ifktappearsbefore k in query StructuralDistance (Distq) Depthdistancebetweenv andthenearestkt – typedancestornodeofv node keyword that matches in v keyword that matches type of an anchester node of v
Inferring Keyword Search Intention Keyword query: Customer name rock interestart
RelevanceOriented Ranking Ranking Principles Searchingforcustomer via streetnodewith keyword query: Art Street Principle 1 only search via nodes affect relevance
RelevanceOriented Ranking Ranking Principles Searchingforcustomersinterested in artusing query: „art“ Principle 1 Principle 2 only search via nodes affect relevance search via node should contain keyword
RelevanceOriented Ranking Ranking Principles Keyword query: Customer name rock interestart Principle 1 Principle 2Principle 3 only search via nodes affect relevance search via node should contain keyword Order of keywords in query is important
RelevanceOriented Ranking Capture XML‘shierarchicalstructuretocompute XML TF*IDF similarity (a) aisvaluenode (basecase) (b) aisinternalnode (recursivecase) Node similarityvaluebetween q and a query First (base) case: similaritiesbetweenleafnodeandthe query Recursivecase: recursivesimilaritiesbetweeninternalnodenandthe query
RelevanceOriented Ranking Capture XML‘shierarchicalstructuretocompute XML TF*IDF similarity (a) aisvaluenode (basecase) (b) aisinternalnode (recursivecase) Node query similarityvaluebetween q and a similar to Classic TF*IDF: query flat document keyword
RelevanceOriented Ranking Capture XML‘shierarchicalstructuretocompute XML TF*IDF similarity (a) aisvaluenode (basecase) (b) aisinternalnode (recursivecase) Node query similarityvaluebetweenqanda ConfidenceofTctobesearch via node childnodeof a Similaritybetweencandq (recursively) Overall weightofaforthegiven query q Intuition Intuition relevant ifchildrenhavehighconfidencetobe a search via nodeandare relevant toq more relevant childrenincreaserelevanceofnode type
Algorithms Parsingtheinput XML document foreachnodenvisited: (1) Assign a DeweyIDton (2) Store theprefixpathprefixPathofn in hashtable
Algorithms Build 2 indices: 1. Keyword invertedlist : (1): Dup : DeweyIDand XML TF*IDF (fa,k) (2): DupType: Dup + node type (prefixpath) (3): DupTypeNorm: DupType + normalizationfactorWa „Node“ tuple: <DeweyID, prefixPath, fa,k , Wa > 2. Frequency Table: - stores (frequencyofk in node type T)
Algorithms The Algorithm: 1. Input: keywordsof query, invertedlist, frequencytable 2. Identifythesearchintentionandsearchfornode type 3. Rank bycomputing XML TF*IDF similaritybetweennandgiven query 4. returnrankedlist
Experimental Evaluation XReal vs. SLCA vs. XSeek AimsofTesting: • Searcheffectiveness • Ranking effectiveness Datasets: • real Datasets (Washington XML Data Repository, DBLP) • syntheticdatasets (XMarkbenchmark)
Conclusion • Identifysearchintentionand rank resultswithstatistics • Confidenceleveltobesearchfor/via nodewith XML TF*IDF • XML TF*IDF similarityrankingscheme • approachtriestosolveambiguityproblem • Prototype XReal