1 / 30

Information Extraction

Information Extraction. Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita. Information Extraction (IE) & Integration. The Extraction task: Given, E: a set of structured elements S: unstructured source S extract all instances of E from S.

abie
Télécharger la présentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita

  2. Information Extraction (IE) & Integration The Extraction task: Given, • E: a set of structured elements • S: unstructured source S extract all instances of E from S • Many versions involving many source types • Actively researched in varied communities • Several tools and techniques • Several commercial applications

  3. IE from free format text • Classical Named Entity Recognition • Extract person, location, organization names According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier's terms • Several applications • News tracking • Monitor events • Bio-informatics • Protein and Gene names from publications • Customer care • Part number, problem description from emails in help centers

  4. Title Journal Year Author Volume Page Problem definition Source: concatenation of structured elements with limited reordering and some missing fields • Example: Addresses, bib records House number Zip City Building Road Area 156 Hillside ctype Scenic drive Powai Mumbai 400076 P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

  5. Relation Extraction: Disease Outbreaks • Extract structured relations from text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus)

  6. Information Extraction on the web

  7. Personal Information Systems • Automatically add a bibtex entry of a paper I download • Integrate a resume in email with the candidates database Papers Files People Email Emails Web Projects Resumes

  8. ContactPattern  RegularExpression(Email.body,”can be reached at”) PersonPhone  Precedes(Person Precedes(ContactPattern, Phone, D), D) Hand-Coded Methods • Easy to construct in many cases • e.g., to recognize prices, phone numbers, zip codes, conference names, etc. • Easier to debug & maintain • Especially if written in a “high-level” language (as is usually the case): e.g., • Easier to incorporate / reuse domain knowledge • Can be quite labor intensive to write [From Avatar]

  9. Example of Hand-Coded Entity Tagger [Ramakrishnan. G, 2005, Slides from Doan et al., SIGMOD 2006] Rule 1 This rule will find person names with a salutation (e.g. Dr. Laura Haas) and two capitalized words <token> INITIAL</token> <token>DOT </token> <token>CAPSWORD</token> <token>CAPSWORD</token> Rule 2 This rule will find person names where two capitalized words are present in a Person dictionary <token>PERSONDICT, CAPSWORD </token> <token>PERSONDICT,CAPSWORD</token> CAPSWORD : Word starting with uppercase, second letter lowercase E.g., DeWitt will satisfy it (DEWITT will not) \p{Upper}\p{Lower}[\p{Alpha}]{1,25} DOT : The character ‘.’

  10. Hand Coded Rule Example: Conference Name # These are subordinate patterns$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)";my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";my $ordinals="(?:$wordOrdinals|$numberOrdinals)";my $confTypes="(?:Conference|Workshop|Symposium)";my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spacesmy $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference name for workshops (e.g. "VLDB Workshop ...")my $connectors="(?:on|of)";my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)"# The actual pattern we search for.  A typical conference name this pattern will find is# "3rd International Conference on Blah Blah Blah (ICBBB-05)"my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|<)";############################## ################################# Given a <dbworldMessage>, look for the conference pattern##############################################################lookForPattern($dbworldMessage, $fullNamePattern);########################################################## In a given <file>, look for occurrences of <pattern># <pattern> is a regular expression#########################################################sub lookForPattern {    my ($file,$pattern) = @_;

  11. Some Hand Coded Entity Taggers • FRUMP [DeJong 82] • CIRCUS / AutoSlog [Riloff 93] • SRI FASTUS [Appelt, 1996] • MITRE Alembic (available for use) • Alias-I LingPipe (available for use) • OSMX [Embley, 2005] • DBLife [Doan et al, 2006] • Avatar [Jayram et al, 2006]

  12. Learning models for extraction • Rule-based extractors • For each label, build two classifiers for accepting its two boundaries. • Each classifier: sequence of rules • Each rule: conjunction of predicates • E.g: If previous token a last-name, current token “.”, next token an article start of title. • Examples: Rapier, GATE, LP2 & several more • Critique of rule-based approaches • Cannot output meaningful uncertainty values • Brittle • Limited flexibility in clues that can be exploited • Not too good about combining several weak clues. • (Pros) Somewhat easier to tune.

  13. Statistical models of IE • Generative models like HMM • Intuitive • Very restricted feature setslower accuracy • Output probabilities are highly skewed (counterpart, naïve Bayes) • Conditional discriminative models • Local models: Maximum entropy models • Global models: Conditional Random Fields. Conditional models • output meaningful probabilities, • flexible, generalize, • getting increasingly popular • State-of-the-art!

  14. Y A C X B Z A B C 0.1 0.1 0.8 0.4 0.2 0.4 0.6 0.3 0.1 Emission probabilities Transition probabilities 0.5 0.9 0.5 0.1 0.8 0.2 dddd dd 0.8 0.2 IE with Hidden Markov Models • Probabilistic models for IE Title Author Journal Year

  15. HMM Structure • Naïve Model: One state per element • Nested model • Each element another HMM

  16. HMM Dictionary • For each word (=feature), associate the probability of emitting that word • Multinomial model • More advanced models with overlapping features of a word, • example, • part of speech, • capitalized or not • type: number, letter, word etc • Maximum entropy models (McCallum 2000)

  17. Learning model parameters • When training data defines unique path through HMM • Transition probabilities • Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i • Emission probabilities • Probability of emitting symbol k from state i = number of times k generated from i number of transition from I • When training data defines multiple path: • A more general EM like algorithm (Baum-Welch)

  18. Using the HMM to segment • Find highest probability path through the HMM. • Viterbi: quadratic dynamic programming algorithm 115 Grant street Mumbai 400070 115 Grant ……….. 400070 House House House House Road Road Road Road City City City ot ot Pin Pin Pin Pin

  19. Comparative Evaluation • Naïve model – One state per element in the HMM • Independent HMM – One HMM per element; • Rule Learning Method – Rapier • Nested Model – Each state in the Naïve model replaced by a HMM

  20. Results: Comparative Evaluation The Nested model does best in all three cases (from Borkar 2001)

  21. HMM approach: summary Inter-element sequencing Intra-element sequencing Element length Characteristic words Non-overlapping tags • Outer HMM transitions • Inner HMM • Multi-state Inner HMM • Dictionary • Global optimization

  22. Statistical models of IE • Generative models like HMM • Intuitive • Very restricted feature setslower accuracy • Output probabilities are highly skewed (counterpart, naïve Bayes) • Conditional discriminative models • Local models: Maximum entropy models • Global models: Conditional Random Fields. Conditional models • output meaningful probabilities, • flexible, generalize, • getting increasingly popular • State-of-the-art!

  23. t x y Basic chain model for extraction My review of Fermat’s last theorem by S. Singh y1 y2 y3 y4 y5 y6 y7 y8 y9 Independent model

  24. Features • The word as-is • Orthographic word properties • Capitalized? Digit? Ends-with-dot? • Part of speech • Noun? • Match in a dictionary • Appears in a dictionary of people names? • Appears in a list of stop-words? • Fire these for each label and • The token, • W tokens to the left or right, or • Concatenation of tokens.

  25. t x y Basic chain model for extraction My review of Fermat’s last theorem by S. Singh y1 y2 y3 y4 y5 y6 y7 y8 y9 Global conditional model over Pr(y1,y2…y9|x)

  26. Features • Feature vector for each position • Examples • Parameters: weight for each feature (vector) User provided previous label i-th label Word i & neighbors Machine learnt

  27. Transforming real-world extraction • Partition label into different parts? • Independent extraction per label? Unique Other Begin Continue End

  28. Examples: features with weights (publications). A large number

  29. Typical numbers • Seminars announcements (CMU): • speaker, location, timings • SVMs for start-end boundaries • 250 training examples • F1: 85% speaker, location, 92% timings (Finn & Kushmerick ’04) • Jobs postings in news groups • 17 fields: title, location, company,language, etc • 150 training examples • F1: 84% overall (LP2) (Lavelli et al 04)

  30. Publications • Cora dataset • Paper headers: Extract title,author affiliation, address,email,abstract • 94% F1 with CRFs • 76% F1 with HMMs • Paper citations: Extract title,author,date, editor,booktitle,pages,institution • 91% F1 with CRFs • 78% F1 with HMMs Peng & McCallum 2004

More Related