340 likes | 491 Vues
Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Summary. Outline. Example query #1. Which professors from Saarbrücken do research on XML. Different terminology in query and Web pages.
E N D
Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Summary Outline VLDB 2005, Trondheim, Norway
Example query #1 Which professors from Saarbrücken do research on XML Different terminology in query and Web pages Director of Department 5 DBS & IS Professor at Saarland University Abstraction Awareness VLDB 2005, Trondheim, Norway
Example query #2 Information is not present on a single page, but distributed across linked pages VLDB Conference2005, Trondheim, Norway Call for Papers…XML… ? Conferences about XML in Norway 2005 Context Awareness VLDB 2005, Trondheim, Norway
Example query #3 What are the publications of Max Planck? Max Planck should be instance of concept person, not of concept institute Concept Awareness VLDB 2005, Trondheim, Norway
Unified search for unstructured, semistructured, structured data from heterogeneous sources Graph-based model, including links Annotation engines from NLP to recognize classes of named entities (persons, locations, dates, …) for concept-aware queries Flexible yet simple abstraction-aware query language with context-aware scoring Compactness-based scores SphereSearch Concepts Goal: Increase recall & precision for hard queries on linked and heterogeneous data VLDB 2005, Trondheim, Norway
Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work Outline VLDB 2005, Trondheim, Norway
Unifying Search on Heterogeneous Data Web XML Intranet Heuristics, type-spec transformations EnterpriseInformationSystems … Databases VLDB 2005, Trondheim, Norway
Headlines<h1>Experiments</h1><h2>Settings</h2>We evaluated...<h2>Results</h2>Our system... Heuristic Transformation of HTML <Experiments><Settings>...</Settings><Results>...</Results> </Experiments> <Topic>XML</Topic> Goal: Transform layout tagsto semantic annotations • Patterns<b>Topic:</b>XML • Rules for tables, lists, … VLDB 2005, Trondheim, Norway
<Professor> Gerhard Weikum<Course> IR </Course> Saarbrücken<Research> XML </Research></Professor> Generic XML Data Model person location Tags annotate content with corresponding concept docid=1tag=“Professor“ content=“Gerhard Weikum Saarbrücken“ 1 docid=1tag=“Research“content=“XML“ docid=1tag=“Course“content=“IR“ 2 3 Automatic annotation of important concepts (persons, locations, dates, money amounts) with tools from Information Extraction VLDB 2005, Trondheim, Norway
Named Entity Recognition (NER) Named Entity ~ abstract datatype, concept (location, person,…, IP-address) Mature (out-of-the-box products, e.g. GATE/ANNIE) Extensible Information Extraction (IE) The Pelican Hotel in Salvador, operated by Roberto Cardoso, offers comfortable rooms starting at $100 a night, including breakfast. Please check in before 7pm. The <company>Pelican Hotel </company> in <location> Salvador </location>, operated by <person> Roberto Cardoso </person>, offers comfortable rooms starting at <price> $100 </price> a night, including breakfast. Please check in before <time> 7pm </time>. VLDB 2005, Trondheim, Norway
Unifying Search on Heterogeneous Data Web XML Intranet Heuristics, type-spec transformations Annotation of named entitieswith IE tools (e.g., GATE) EnterpriseInformationSystems … AnnotatedXML Databases VLDB 2005, Trondheim, Norway
<Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research></Professor> Annotation-Aware Data Model Annotation with GATE:„Saarbrücken“ of type „location“ docid=1tag=„Professor“content=“Gerhard Weikum“ 1 docid=1tag=“location“ content=“Saarbrücken“ docid=1tag=“Research“ content=“XML“ docid=1tag=“Course“content=“IR“ 3 4 2 docid=1tag=“Professor“content=“Gerhard Weikum Saarbrücken“ 1 docid=1tag=“Research“ content=“XML“ docid=1tag=“Course“content=“IR“ 3 2 Annotation introduces new tags VLDB 2005, Trondheim, Norway
Data Model for Linked Documents VLDB 2005, Trondheim, Norway
Architecture Search Engine Search Engine INDEX FROM=SIGIR SUBJECT=Notification Date = 15-18 August Event=SIGIR Location= Frankfurt Location=Salvador Time = 13:15 Location= Salvador Price =89 $ Location=Salvador … Person=Schenkel IE Processor Annotators Annotation Module PRICE Annotation Module DATE … Annotation Module LOCATION … Adapters XML Adapter EMail Adapter Web Portal Adapter Web Adapter SIGIR Website Hotel Website Tourist Guide (XML) Sources Flight Schedule Graupmann Homepage VLDB 2005, Trondheim, Norway
Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work Outline VLDB 2005, Trondheim, Norway
Extended keyword queries: similarity conditions ~professor, ~Saarbrücken concept-based conditions person=Max Planck, location=Trondheim grouping join conditions Ranked results with context-aware scoring SphereSearch Queries A VLDB 2005, Trondheim, Norway
Score Aggregation: SphereScore Local score sL(e) for each element e (tf/idf, BM25,…) 1 2 2 s(1): researchXML Weighted aggregation of local scores in environment of element (sphere score): 1 Context awareness Rewards proximity of terms and compactness of term distribution VLDB 2005, Trondheim, Norway
Similarity Conditions Thesaurus/Ontology: concepts, relationships, glosses from WordNet, Gazetteers, Web forms & tables, Wikipedia alchemist primadonna director artist wizard investigator intellectual researcher professor educator HYPONYM (0.7) scientist scholar lecturer mentor teacher academic, academician, faculty member relationships quantified by statistical co-occurence measures Similarity conditions like ~professor, ~Saarbrücken disambiguation Query expansion δ-exp(x)={w|sim(x,w)>δ} Local score: weighted max over all expansion terms sL(e,~professor) =max tδ-exp(professor) {sim(professor,t)*sL(e,t)} Abstraction awareness VLDB 2005, Trondheim, Norway
Concept-based conditions docid=1tag=„location“content=“Trondheim“ e concept value sL(e,c=v)= score for concept-tag match + score for value-content-match concept-specific Goal: Exploit explicit (tags) and automatic annotations in documents location=Trondheim Allows similarity and range queries (for annotated concepts) likelocation~Trondheim1970<date<1980with concept-specific distancemeasures Concept awareness VLDB 2005, Trondheim, Norway
Group conditions that relate to the same „entity“professor teaching IR research XML professor T(teaching IR)R(research XML) SphereScore computed for each group Find compact sets with one result for each group Query Groups Goal: Related terms should occur in the same context VLDB 2005, Trondheim, Norway
Scores for Query Results A X 1 2 B 3 A X 1 2 1 3 A X 4 1 2 5 B X 3 2 1 5 B X 6 1 2 query result R: one result per query group compactness ~ 1/size of a minimal spanning tree Context awareness VLDB 2005, Trondheim, Norway
Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper) A.person=B.person Dependent on database size, application • Precomputed • Computed during query execution B A VLDB 2005 research XML Ralf Schenkel 1.0 2004 2005 R.Schenkel 1.7 • Join conditions do not change the score for a node • Join conditions create a new link with a specific weight VLDB 2005, Trondheim, Norway
Join conditionA.T=B.S: For all nodes n1 with type T, n2 with type S, add edge (n1,n2) with weight 1/sim(n1,n2)) sim(n1,n2): content-based similarity Score for Join Conditions A X 1 2 B 2 B X 3 1 2 VLDB 2005, Trondheim, Norway
Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work Outline VLDB 2005, Trondheim, Norway
Three corpora: Wikipedia (~400,000 docs) extended Wikipedia with links to IMDB (~500,000 docs,~12,000,000 links, ~28,500,000 elements) extended DBLP corpus with links to homepages (~1,000,000 docs,~3,000,000 links, ~9,500,000 elements) 50 Queries like A(actor birthday 1970<date<1980) western G(California,governor) M(movie) A(Madonna,husband)B(director)A.person=B.director Opponent: keyword queries with standard TF/IDF-based score „simplified Google“ Setup for Experiments No existing benchmark (INEX, TREC, …) fits VLDB 2005, Trondheim, Norway
Incremental Language Levels SSE-Join(join conditions) SSE-QG(query groups) SSE-CV(concept-based conditions) SSE-basic(keywords, SphereScores) VLDB 2005, Trondheim, Norway
Experimental Results on Wikipdia VLDB 2005, Trondheim, Norway
Experimental Results on Wiki++ and DBLP++ • SphereScores better than local scores • New SSE features nearly double precision VLDB 2005, Trondheim, Norway
Improve graphical user interface Refined type-specific similarity measures (like geographic distances) [SIGIR-WS 2005] Deep Web search through automatic portal queries Parameter tuning with relevance feedback Efficiency of query evaluation through precomputation and integrated top-k(TopX talk this afternoon) Current and Future Work VLDB 2005, Trondheim, Norway
Web Query Languagese.g., W3QS [VLDB95], WebOQL [ICDE95],… Web IR with thesaurie.g., Qiu et al.[SIGIR93], Liu et al.[SIGIR04],… XML IRe.g., XXL [WebDB00], XIRQL [SIGIR01],XSearch [VLDB03], XRank [SIGMOD03], … Information extractione.g., Lixto, KnowItAll, … Advanced graph IRe.g., BANKS [ICDE02], Hristidis et al.[VLDB03], … Some Related Work VLDB 2005, Trondheim, Norway
Thank you! VLDB 2005, Trondheim, Norway
Integrating TopX and SphereSearch Current top-k with[score,bestscore]intervals … … … … … … … … (G1,..,Gn) compactness-based top-k operator top-kresults distance-basedaggregationtop-k operator distance-basedaggregationtop-k operator distance-basedaggregationtop-k operator top-k top-k top-k VLDB 2005, Trondheim, Norway
XML-IR: History and Related Work Web query languages: IR on structured docs (SGML): 1995 W3QS (Technion Haifa) OED etc. (U Waterloo) Araneus (U Roma) HySpirit (U Dortmund) Lorel (Stanford U) HyperStorM (GMD Darmstadt) WebSQL (U Toronto) WHIRL (CMU) XML query languages: IR on XML: XIRQL (U Dortmund) XML-QL (AT&T Labs) XXL & TopX (U Saarland / MPI) 2000 XPath 1.0 (W3C) ApproXQL (U Berlin / U Munich) ELIXIR (U Dublin) INEX benchmark NEXI XPath & XQuery Full-Text PowerDB-IR (ETH Zurich) JuruXML (IBM Haifa ) XPath 2.0 (W3C) XSearch (Hebrew U) Timber (U Michigan) XQuery (W3C) XRank & Quark (Cornell U) FleXPath (AT&T Labs) TeXQuery (AT&T Labs) Commercial software (MarkLogic, Verity?, IBM?, Oracle?, Google?, ...) 2005 VLDB 2005, Trondheim, Norway