1 / 31

Discovery and Reconciliation of Entity Type Taxonomies

Discovery and Reconciliation of Entity Type Taxonomies. Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen. Searching with types and entities. Answer types How far is it from Rome to Paris? type=distance#n#1 near words={Rome, Paris} Restrictions on match conditions

teryl
Télécharger la présentation

Discovery and Reconciliation of Entity Type Taxonomies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovery and Reconciliation of Entity Type Taxonomies Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

  2. Searching with types and entities • Answer types • How far is it from Rome to Paris? • type=distance#n#1 near words={Rome, Paris} • Restrictions on match conditions • How many movies did 20th Century Fox release in 1923? • …by 1950, the forest had only 1923 foxes left… • type={number#n#1,hasDigit} NEAR … year=1923 organization=“20th Century Fox” • Corpus = set of docs, doc = token sequence, tokens connected to lexical networks

  3. Email1 …your ECML submission titledXYZ has been accepted… PDF file XYZ A. ThorAbstract:… Email2 Could I get a preprint of yourrecent ECML paper? Searching personal info networks EmailTo Time ECML A. U. Thor EmailDate EmailTo XYZ • No clean schema, data changes rapidly • Lots of generic “graph proximity” information EmailDate Canonical node Want toquickly find this file LastMod

  4. Building blocks • Structured “seeds” of type info • WordNet, Wikipedia, OpenCYC, … • Semi-structured sources • List of faculty members in a department • Catalog of products at ecommerce site • Unstructured open-domain text • Email bodies, text of papers, blog text, Web page, … • Discovery and extension of type attachments • Hearst patterns, list extraction, NE taggers • Reconciling federated type systems • Schema and data integration • Query execution engines using type catalogs

  5. Hearst patterns and enhancements • Hearst, 1992; KnowItAll (Etzioni+ 2004) • T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, … • C-PANKOW (Cimiano and Staab 2005) • Suitable for unformatted natural language • Generally high precision, low recall • If few possible Ts, use a named entity tagger

  6. Set extraction • Each node of a graph is a word • Edge connects words wi and wj if these words occur together in more than k docs • Use apriori style searches to enumerate • Edge weight depends on #docs • Given a set of words Q, set up Pagerank • Random surfer on word graph • W.p. d, jump to some element of Q • W.p. 1−d, walk to a neighbor • Present nodes (words) with largest Pagerank

  7. Example

  8. List extraction • Given a current set of candidate Ts • Limit to candidates having high confidence • Select random subset of k=4 candidates • Generate query from selected candidates • Download response documents • Look for lists containing candidate mentions • Extract more instances from lists found • Boosts extraction rate 2—5 folds

  9. Wrapping, scraping, tagging • HTML formatting clues • Help extract records and fields • Extensive work in the DB, KDD, ML communities Paper is-a P167 Gerhard Pass has-author Gregor Heinrich has-title Investigating word correlation…

  10. Reconciling type systems • WordNet: small and precise • Wikipedia: much larger, less controlled • Collect into a common is-a database

  11. Mapping between taxonomies • Each type has a set of instances • Assoc Prof: K. Burn, R. Cook • Synset: lemmas from leaf instances • Wikipedia concept: list of instances • Yahoo topic: set of example Web pages • Goal: establish connections between types • Connections could be “soft” or probabilistic

  12. Cross-training • Set of labels or types B, partly related but not identical to type set A • A=Dmoz topics, B=Yahoo topics • A=Personal bookmark topics, B=Yahoo topics • Training docs come in two flavors now • Fully labeled with A and B labels (rare) • Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)? • Inductive transfer, multi-task learning (Sarawagi+ 2003) DA DB

  13. Motivation • Symmetric taxonomy mapping • Ecommerce catalogs: A=distributor, B=retailer • Web directories: A = Dmoz, B = Yahoo • Incomplete taxonomies, small training sets • Bookmark taxonomy vs. Yahoo • Cartesian label spaces Region Top Label-pair-conditionedterm distribution Sports Regional Topic … Baseball Cricket UK USA

  14. Labels as features • A-label known, estimate B-label • Suppose we have A+B labeled training set • Discrete valued “label column”  • Most text classifiers cannot balance importance of very heterogeneous features • Do not have fully-labeled data • Must guess  (use soft scores instead of 0/1) Term feature values   Augmented feature vector Target label

  15. DA–DB Docs having only A-labels One-vs-rest SVMensemble for A:returns |A| scoresfor each test doc(signed distancefrom separator) Docs having only B-labels Train Test DB–DA S(A,0) Label Testoutput Text features   t  |A| Test case withA-label known(coded using avector of +1 and –1) Train S(B,1) One-vs-rest SVMensemble for B(target label set) Term features –1,…,–1,+1,–1,… SVM-CT: Cross-trained SVM S(A,1)S(B,2)S(A,2)…

  16. SVM-CT anecdotes • Discriminant reveals relations between A and B • One-to-one, many-to-one, related, antagonistic • However, accuracy gains are meager Positive Negative

  17. EM1D: Info from unlabeled docs • Use training docs to induce initial classifier for taxonomy B, say • Repeat until classifier satisfactory • Estimate Pr(|d) for unlabeled doc d, B • Reweigh d by factor Pr(|d) and add to training set for label  • Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.) • Ignores labels from another taxonomy A

  18. DB–DA: docswith B-labels Docs in DA–DBlabeled ’ Stratified EM1D • Target labels = B • B-labeled docs are labeled training instances • Consider A-labeled docs labeled  • These are unlabeled for taxonomy B • Run EM1D for each row  • Test instance has  known • Invoke semi-supervised model for row  to classify B-topics … Docs in DA–DBlabeled  A topics

  19. EM2D: Cartesian product EM • Initialize with fully labeled docs which go to a specific (,) cell • Smear training doc across label row or column • Uniform smear could be bad • Use a naïve Bayes classifier to seed • Parameters extended from EM1D • , prior probability for label pair (,) • ,,tmultinomial term probability for (,) A-labeled doc B-labeled doc Labels in B Labels in A

  20. Updatedclass-pairpriors Updatedclass-pair-conditionedterm stats EM2D updates • E-step for an A-labeled document • M-step

  21. Applying EM2D to a test doc • Mapping a B-labeled test doc d to an A label (e-commerce catalogs) • Given , find argmax Pr(,|d) • Classifying a document d with no labels to an A label • Aggregation • For each  compute  Pr(,|d), pick best  • Guessing (EM2D-G) • Guess the best * using a B-classifier • Find argmax Pr(,*|d) • EM pitfalls: damping factor, early stopping

  22. Experiments • Selected 5 Dmoz and Yahoo subtree pairs • Compare EM2D against • Naïve Bayes, best #features and smoothing • EM1D: ignore labels from other taxonomy, consider as unlabeled docs • Stratified EM1D • Mapping test doc with A-label to B-label or vice versa • Classifying zero-labeled test doc • Accuracy = fraction with correct labels

  23. Accuracy benefits in mapping Improvementover NB: 30% best,10% average • EM1D and NB are close, because training set sizes for each taxonomy are not too small • EM2D > Stratified EM1D > NB • 2d transfer of model info seems important

  24. Asymmetric setting • Few (only 300) bookmarked URLs(taxonomy B, target) • Many Yahoo URLs, larger number of classes (taxonomy A) • Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew

  25. Zero-labeled test documents • EM1D improves accuracy only for 12 train docs • EM2D with guessing improves beyond EM1D • In fact, better than aggregating scores to 1d • Choice of unlabeled:labeled damping ratio L may be important to get benefits

  26. Robustness to initialization NaïveBayessmear Uniformsmear • Seeding choices: hard (best class), NB scores, uniform • Smear a fraction uniformly, rest by NB scores • EM2D is robust to wide range of smear fractions • Fully uniform smearing can fail (local optima)

  27. Handling constraints • Type systems are often hierarchical • Neighborhood constraints • Two nodes match if their children also match • Two nodes match if their parents match and some of their descendants also match • If all children of node X match node Y, then X also matches Y • If a node in the neighborhood of node X matches ASSOCIATE- PROFESSOR, then the chance that X matches PROFESSOR is increased • Three-stage approach (Doan+ 2002) …

  28. Three-stage reconciliation • Use EM2D to build a distribution estimator • For every type pair A, B, find, over the domain of entities, Pr(A,B), Pr(A,!B), Pr(!A,B), Pr(!A,!B) • From these, compute a local similarity measure • E.g. Jaccard: • Perform relaxation labeling to find a good mapping (Chakrabarti+ 1998, Doan+ 2002)

  29. Relaxation labeling Label in O2 Label in O1 Everything known Mapping Features of the immediateneighborhood of X Markovian assumption: Initialize some reasonable f and reassign via Bayes rule until f stabilizes

  30. Sample results

  31. Summary To realize the semantic Web vision we must • Assemble type schema from diverse sources • Mine type instances automatically • Annotate large corpora efficiency • Build indices integrating text and annotations • Support schema-agnostic query languages • When did you last type XQuery into Google • Design high-performance query execution engines • New family of ranking functions

More Related