1 / 23

Type-enabled Keyword Searches with Uncertain Schema

Type-enabled Keyword Searches with Uncertain Schema. Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen. Evolution of Web search. The first decade of Web search Crawling and indexing at massive scale Macroscopic whole-page connectivity analysis

conley
Télécharger la présentation

Type-enabled Keyword Searches with Uncertain Schema

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

  2. Evolution of Web search • The first decade of Web search • Crawling and indexing at massive scale • Macroscopic whole-page connectivity analysis • Very limited expression of information need • Exploiting entities and relations—clear trend • Maintaining large type systems and ontologies • Discovering mentions of entities and relations • Deduplicating and canonicalizing mentions • Forming uncertain, probabilistic E-R graphs • Enhancing keyword or schema-aware queries Chakrabarti

  3. WordNet Wikipedia FrameNet KnowItAll Raw corpus Disambiguation Named entity tagging Relation tagging 1 Uniform lexicalnetwork provider Annotated corpus 4 Indexer Past query workload stats Textindex Annotationindex 2 Answer typepredictor Ranking engine 3 Question Keywordmatch predictor Response snippets Chakrabarti

  4. Populating entity and relation tables • Hearst patterns (Hearst 1992) • T such as x, x and other T, x is a T • DIPRE (Brin 1998) • Snowball (Agichtein+ 2000) • [left] entity1 [middle] entity2 [right] • PMI-IR (Turney 2001) • Recognize synonyms using Web stats • KnowItAll (Etzioni+ 2004) • C-PANKOW (Cimiano+ 2005) • Is-a relations from Hearst patterns, lists, PMI Chakrabarti

  5. DIPRE and Snowball Seed tuples Augmented table Generate extraction patterns Tag mentions in free text Locate new tuples Encoded as bag-of-words ℓ m r … the Irving-based Exxon Corporation … location organization Chakrabarti

  6. Scoring patterns and tuples Snowball DIPRE • Pattern confidence = m+/(m+ + m−) over validation tuples • Soft-or tuple confidence = • Recent improvements: Urn model (Etzioni+ 2005) Uses 5-part encoding Chakrabarti

  7. KnowItAll and C-PANKOW • A “propose-validate” approach • Using existing patterns, generate queries • For each web page w returned • Extract potential fact e andassign confidence score • Add fact to database if ithas high enough score • Patterns use chunk info Chakrabarti

  8. Exploiting answer types with PMI • From two word queries to two text boxes • author; “Harry Potter” • person; “Eiffel Tower” • director; Swades movie • city; India Pakistan cricket • Keywords  search engine  snippets • Every token/chunk in a snippet is a candidate • Elimination hacks that we won’t discuss • Fire Hearst pattern queries between desired answer type and candidate token/chunk Answer type Keywordsto match Chakrabarti

  9. Information carnivores at work KO :: IndiaPakistanCricket SeriesA web site by Khalid Omar, sort of live from Karachi, Pakistan. “cities such as [probe]” “[probe] and othercities”, “[probe] is a city”, etc. • “Garth Brooks is a country” [singer],“gift such as wall” [clock] • “person like Paris” [Hilton],“researchers like Michael Jordan” (which one?) Chakrabarti

  10. Sample output • author; “Harry Potter” • J K Rowling, Ron • person; “Eiffel Tower” • Gustave, (Eiffel), Paris • director; Swades movie • Ashutosh Gowariker, Ashutosh Gowarikar • What can search engines do to help? • Cluster mentions and assign IDs • Allow queries for IDs — expensive! • “Harry Potter” context in “Ron is an author” Ambiguity andextremely skewedWeb popularity Chakrabarti

  11. WordNet Wikipedia FrameNet KnowItAll Raw corpus Disambiguation Named entity tagging Relation tagging 1 Uniform lexicalnetwork provider Annotated corpus 4 Indexer Past query workload stats Textindex Annotationindex 2 Answer typepredictor Ranking engine 3 Question Keywordmatch predictor Response snippets Chakrabarti

  12. Answer type (atype) prediction • Standard sub-problem in question answering • Increasingly important (but more difficult) for grammar-free Web queries (Broder 2002) • Current approaches • Pattern matching, e.g. head of noun phrase adjacent to what or which; map when, who, where, directly to classes time, person, place • Coupled perceptrons (Li and Roth, 2002) • Linear SVM on bag-of-2grams (Hacioglu 2002) • SVM with tree kernel on parse (Zhang and Lee, 2004): slim gains • Surely a parse tree holds more usable info Chakrabarti

  13. Informer span • A short, contiguous span of question tokens reveals the anticipated answer type (atype) • Except in multi-function questions, one informer span is dominant and sufficient • What is the weight of a rhino? • How much does a rhino weigh? • How much does a rhino cost? • Who is the CEO of IBM? • Question  parse  informer span tagger • Learn atype label from informer + question Chakrabarti

  14. 0 What is the capital city of Japan WP VBZ DT NN NN IN NNP 1 WHNP VP NP NP 2 3 PP 4 NP 5 SQ Level SBARQ 6 Example capital,city of,Japan What,is, the 1 2 3 (start) • Pre-in-post Markov process produces question • Train a CRF with features derived from parse tree • POS, attachments to neighboring chunks, multiple levels • First noun chunk? Adjacent to second verb? Chakrabarti

  15. Atype guessing accuracy Question TrainedCRF Filter Informerfeaturegenerator Ordinaryfeaturegenerator Merge Feature vector Linear SVM Atype Chakrabarti

  16. WordNet Wikipedia FrameNet KnowItAll Raw corpus Disambiguation Named entity tagging Relation tagging 1 Uniform lexicalnetwork provider Annotated corpus 4 Indexer Past query workload stats Textindex Annotationindex 2 Answer typepredictor Ranking engine 3 Question Keywordmatch predictor Response snippets Chakrabarti

  17. Scoring function for typed search • Instance of atype “near” keyword matches • IR systems: “hard” proximity predicates • Search engines: unknown reward for proximity • XML+IR, XRank: “hard” word containment in subtree Not closest Up to somemaximumwindow Question: Whoinvented the television? in was Atype: person#n#1 was born 1925. Inventor invented television John Baird Selectors: invent*,television IS-A Candidate Selectors person#n#1 Chakrabarti

  18. Learning a scoring function • Assume parametric form for a ranking classifier • Form of IDF, window size,  • Can also choose amongdecay function forms • Question-answer pairs give partial orders (Joachims 2004) • Recall in top-50,mean reciprocal rank Chakrabarti

  19. Indexing issues • Standard IR posting: word  {(doc,offsets)} • word1 near word2 is standard • instance-of(atype) near {word1, word2,…} • WordNet has 80000 atype nodes, 17000 internal, depth > 10 • “horse” also indexed as mammal, animal, sports equipment, chess piece,… • Original corpus 4GB, gzipped corpus 1.3GB,IR index 0.9GB, full atype index 4.3GB • XML structure indices not designed for fine-grain, word-as-element-node use Chakrabarti

  20. Exploit skew in query atypes? • Index only a small registered set of atypes R • Relax query atype a to generalization g in R • Test a response reachability and retain/discard • How to pick R? What is a good objective? • Relaxed query and discarding steps cost extra time • Rare atypes in what, which, and name questions—long-tailed distribution Chakrabarti

  21. Approx objective and approach • Index space approx • Expected query time bloat is approx • Minimize approx index space with an upper bound on bloat (hard, as expected) • Sparseness: queryProb(a) observed to be zero for most a-s in a large taxonomy • Smooth using similarity between atypes Chakrabarti

  22. Sample results • Index space approximation reasonable • Reasonable average query time bloat with small index space overheads Runtime Using g Using a Queries  Chakrabarti

  23. Summary • Entity and relation annotators • Maturing technology • Unlikely to be perfect for open-domain sources • The future: query paradigms that combine text and annotations • End-user friendly selection and aggregation • Allow uncertainty, exploit redundancy • Can we scale to terabytes of text? • Will centralized search engines be feasible? • How to federate annotation management? Chakrabarti

More Related