1 / 34

Outline

Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Summary. Outline. Example query #1. Which professors from Saarbrücken do research on XML. Different terminology in query and Web pages.

lalo
Télécharger la présentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Summary Outline VLDB 2005, Trondheim, Norway

  2. Example query #1 Which professors from Saarbrücken do research on XML Different terminology in query and Web pages Director of Department 5 DBS & IS Professor at Saarland University Abstraction Awareness VLDB 2005, Trondheim, Norway

  3. Example query #2 Information is not present on a single page, but distributed across linked pages VLDB Conference2005, Trondheim, Norway Call for Papers…XML… ? Conferences about XML in Norway 2005 Context Awareness VLDB 2005, Trondheim, Norway

  4. Example query #3 What are the publications of Max Planck? Max Planck should be instance of concept person, not of concept institute Concept Awareness VLDB 2005, Trondheim, Norway

  5. Unified search for unstructured, semistructured, structured data from heterogeneous sources Graph-based model, including links Annotation engines from NLP to recognize classes of named entities (persons, locations, dates, …) for concept-aware queries Flexible yet simple abstraction-aware query language with context-aware scoring Compactness-based scores SphereSearch Concepts Goal: Increase recall & precision for hard queries on linked and heterogeneous data VLDB 2005, Trondheim, Norway

  6. Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work Outline VLDB 2005, Trondheim, Norway

  7. Unifying Search on Heterogeneous Data Web XML Intranet Heuristics, type-spec transformations EnterpriseInformationSystems … Databases VLDB 2005, Trondheim, Norway

  8. Headlines<h1>Experiments</h1><h2>Settings</h2>We evaluated...<h2>Results</h2>Our system... Heuristic Transformation of HTML <Experiments><Settings>...</Settings><Results>...</Results> </Experiments> <Topic>XML</Topic> Goal: Transform layout tagsto semantic annotations • Patterns<b>Topic:</b>XML • Rules for tables, lists, … VLDB 2005, Trondheim, Norway

  9. <Professor> Gerhard Weikum<Course> IR </Course> Saarbrücken<Research> XML </Research></Professor> Generic XML Data Model person location Tags annotate content with corresponding concept docid=1tag=“Professor“ content=“Gerhard Weikum Saarbrücken“ 1 docid=1tag=“Research“content=“XML“ docid=1tag=“Course“content=“IR“ 2 3 Automatic annotation of important concepts (persons, locations, dates, money amounts) with tools from Information Extraction VLDB 2005, Trondheim, Norway

  10. Named Entity Recognition (NER) Named Entity ~ abstract datatype, concept (location, person,…, IP-address) Mature (out-of-the-box products, e.g. GATE/ANNIE) Extensible Information Extraction (IE) The Pelican Hotel in Salvador, operated by Roberto Cardoso, offers comfortable rooms starting at $100 a night, including breakfast. Please check in before 7pm. The <company>Pelican Hotel </company> in <location> Salvador </location>, operated by <person> Roberto Cardoso </person>, offers comfortable rooms starting at <price> $100 </price> a night, including breakfast. Please check in before <time> 7pm </time>. VLDB 2005, Trondheim, Norway

  11. Unifying Search on Heterogeneous Data Web XML Intranet Heuristics, type-spec transformations Annotation of named entitieswith IE tools (e.g., GATE) EnterpriseInformationSystems … AnnotatedXML Databases VLDB 2005, Trondheim, Norway

  12. <Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research></Professor> Annotation-Aware Data Model Annotation with GATE:„Saarbrücken“ of type „location“ docid=1tag=„Professor“content=“Gerhard Weikum“ 1 docid=1tag=“location“ content=“Saarbrücken“ docid=1tag=“Research“ content=“XML“ docid=1tag=“Course“content=“IR“ 3 4 2 docid=1tag=“Professor“content=“Gerhard Weikum Saarbrücken“ 1 docid=1tag=“Research“ content=“XML“ docid=1tag=“Course“content=“IR“ 3 2 Annotation introduces new tags VLDB 2005, Trondheim, Norway

  13. Data Model for Linked Documents VLDB 2005, Trondheim, Norway

  14. Architecture Search Engine Search Engine INDEX FROM=SIGIR SUBJECT=Notification Date = 15-18 August Event=SIGIR Location= Frankfurt Location=Salvador Time = 13:15 Location= Salvador Price =89 $ Location=Salvador … Person=Schenkel IE Processor Annotators Annotation Module PRICE Annotation Module DATE … Annotation Module LOCATION … Adapters XML Adapter EMail Adapter Web Portal Adapter Web Adapter SIGIR Website Hotel Website Tourist Guide (XML) Sources Flight Schedule Graupmann Homepage VLDB 2005, Trondheim, Norway

  15. Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work Outline VLDB 2005, Trondheim, Norway

  16. Extended keyword queries: similarity conditions ~professor, ~Saarbrücken concept-based conditions person=Max Planck, location=Trondheim grouping join conditions Ranked results with context-aware scoring SphereSearch Queries A VLDB 2005, Trondheim, Norway

  17. Score Aggregation: SphereScore Local score sL(e) for each element e (tf/idf, BM25,…) 1 2 2 s(1): researchXML Weighted aggregation of local scores in environment of element (sphere score): 1 Context awareness Rewards proximity of terms and compactness of term distribution VLDB 2005, Trondheim, Norway

  18. Similarity Conditions Thesaurus/Ontology: concepts, relationships, glosses from WordNet, Gazetteers, Web forms & tables, Wikipedia alchemist primadonna director artist wizard investigator intellectual researcher professor educator HYPONYM (0.7) scientist scholar lecturer mentor teacher academic, academician, faculty member relationships quantified by statistical co-occurence measures Similarity conditions like ~professor, ~Saarbrücken disambiguation Query expansion δ-exp(x)={w|sim(x,w)>δ} Local score: weighted max over all expansion terms sL(e,~professor) =max tδ-exp(professor) {sim(professor,t)*sL(e,t)} Abstraction awareness VLDB 2005, Trondheim, Norway

  19. Concept-based conditions docid=1tag=„location“content=“Trondheim“ e concept value sL(e,c=v)= score for concept-tag match + score for value-content-match concept-specific Goal: Exploit explicit (tags) and automatic annotations in documents location=Trondheim Allows similarity and range queries (for annotated concepts) likelocation~Trondheim1970<date<1980with concept-specific distancemeasures Concept awareness VLDB 2005, Trondheim, Norway

  20. Group conditions that relate to the same „entity“professor teaching IR research XML professor T(teaching IR)R(research XML) SphereScore computed for each group Find compact sets with one result for each group Query Groups Goal: Related terms should occur in the same context VLDB 2005, Trondheim, Norway

  21. Scores for Query Results A X 1 2 B 3 A X 1 2 1 3 A X 4 1 2 5 B X 3 2 1 5 B X 6 1 2 query result R: one result per query group compactness ~ 1/size of a minimal spanning tree Context awareness VLDB 2005, Trondheim, Norway

  22. Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper) A.person=B.person Dependent on database size, application • Precomputed • Computed during query execution B A VLDB 2005 research XML Ralf Schenkel 1.0 2004 2005 R.Schenkel 1.7 • Join conditions do not change the score for a node • Join conditions create a new link with a specific weight VLDB 2005, Trondheim, Norway

  23. Join conditionA.T=B.S: For all nodes n1 with type T, n2 with type S, add edge (n1,n2) with weight 1/sim(n1,n2)) sim(n1,n2): content-based similarity Score for Join Conditions A X 1 2 B 2 B X 3 1 2 VLDB 2005, Trondheim, Norway

  24. Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work Outline VLDB 2005, Trondheim, Norway

  25. Three corpora: Wikipedia (~400,000 docs) extended Wikipedia with links to IMDB (~500,000 docs,~12,000,000 links, ~28,500,000 elements) extended DBLP corpus with links to homepages (~1,000,000 docs,~3,000,000 links, ~9,500,000 elements) 50 Queries like A(actor birthday 1970<date<1980) western G(California,governor) M(movie) A(Madonna,husband)B(director)A.person=B.director Opponent: keyword queries with standard TF/IDF-based score  „simplified Google“ Setup for Experiments No existing benchmark (INEX, TREC, …) fits VLDB 2005, Trondheim, Norway

  26. Incremental Language Levels SSE-Join(join conditions) SSE-QG(query groups) SSE-CV(concept-based conditions) SSE-basic(keywords, SphereScores) VLDB 2005, Trondheim, Norway

  27. Experimental Results on Wikipdia VLDB 2005, Trondheim, Norway

  28. Experimental Results on Wiki++ and DBLP++ • SphereScores better than local scores • New SSE features nearly double precision VLDB 2005, Trondheim, Norway

  29. Improve graphical user interface Refined type-specific similarity measures (like geographic distances) [SIGIR-WS 2005] Deep Web search through automatic portal queries Parameter tuning with relevance feedback Efficiency of query evaluation through precomputation and integrated top-k(TopX talk this afternoon) Current and Future Work VLDB 2005, Trondheim, Norway

  30. Web Query Languagese.g., W3QS [VLDB95], WebOQL [ICDE95],… Web IR with thesaurie.g., Qiu et al.[SIGIR93], Liu et al.[SIGIR04],… XML IRe.g., XXL [WebDB00], XIRQL [SIGIR01],XSearch [VLDB03], XRank [SIGMOD03], … Information extractione.g., Lixto, KnowItAll, … Advanced graph IRe.g., BANKS [ICDE02], Hristidis et al.[VLDB03], … Some Related Work VLDB 2005, Trondheim, Norway

  31. Thank you! VLDB 2005, Trondheim, Norway

  32. Integrating TopX and SphereSearch Current top-k with[score,bestscore]intervals … … … … … … … … (G1,..,Gn) compactness-based top-k operator top-kresults distance-basedaggregationtop-k operator distance-basedaggregationtop-k operator distance-basedaggregationtop-k operator top-k top-k top-k VLDB 2005, Trondheim, Norway

  33. XML-IR: History and Related Work Web query languages: IR on structured docs (SGML): 1995 W3QS (Technion Haifa) OED etc. (U Waterloo) Araneus (U Roma) HySpirit (U Dortmund) Lorel (Stanford U) HyperStorM (GMD Darmstadt) WebSQL (U Toronto) WHIRL (CMU) XML query languages: IR on XML: XIRQL (U Dortmund) XML-QL (AT&T Labs) XXL & TopX (U Saarland / MPI) 2000 XPath 1.0 (W3C) ApproXQL (U Berlin / U Munich) ELIXIR (U Dublin) INEX benchmark NEXI XPath & XQuery Full-Text PowerDB-IR (ETH Zurich) JuruXML (IBM Haifa ) XPath 2.0 (W3C) XSearch (Hebrew U) Timber (U Michigan) XQuery (W3C) XRank & Quark (Cornell U) FleXPath (AT&T Labs) TeXQuery (AT&T Labs) Commercial software (MarkLogic, Verity?, IBM?, Oracle?, Google?, ...) 2005 VLDB 2005, Trondheim, Norway

More Related