1 / 30

Information Extraction

Information Extraction. Extract meaningful information from text Without fully understanding everything! Basic idea: Define domain-specific templates Simple and reliable linguistic processing Recognize known types of entities and relations Fill templates with recognized information.

Télécharger la présentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction • Extract meaningful information from text • Without fully understanding everything! • Basic idea: • Define domain-specific templates • Simple and reliable linguistic processing • Recognize known types of entities and relations • Fill templates with recognized information

  2. Example 4 Apr. Dallas - Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. The Texaco station at 102 Main St. was also severely damaged, but no injuries were reported. Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA Damage: “mobile homes” (2) “Texaco station” (1) Injuries: none

  3. Tokenization & Tagging Sentence Analysis Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about .... Merging Pattern Extraction Template Generation tornado swept: Event: tornado through northwest Dallas: Loc: “northwest Dallas” causing extensive damage: Damage Early last evening: adv-phrase:time a tornado: noun-group:subject swept: verb-group ... Early/ADV last/ADJ evening/NN:time ,/, a/DT tornado/NN:weather swept/VBD ... 4 Apr. Dallas – Early last evening, a tornado swept through northwest.... Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA ...

  4. MUC: Message Understanding Conference • “Competitive” conference with predefined tasks for research groups to address • Tasks (MUC-7): • Named Entities: Extract typed entities from text • Equivalence Classes: Solving coreference • Attributes: Fill in attributes of entities • Facts: Extract logical relations between entities • Events: Extract descriptions of events from text

  5. Tokenization & Tagging • Tokenization & POS tagging • Also lexical semantic information, such as “time”, “location”, “weather”, “person”, etc. Sentence Analysis • Shallow parsing for phrase types • Use tagging & semantics to tag phrases • Note phrase heads

  6. Pattern Extraction • Find domain-specific relations between text units • Typically use lexical triggers and relation-specific patterns to recognize relations Concept: Damaged-Object Trigger: destroyed Position: direct-object Constraints: physical-thing ... and [ destroyed ] [ two mobile homes ]  Damaged-Object = “two mobile homes”

  7. Learning Extraction Patterns • Very difficult to predefine extraction patterns • Must be redone for each new domain • Hence, corpus-based approaches are indicated • Some methods: • AutoSlog (1992) – “syntactic” learning • PALKA (1995) – “conceptual” learning • CRYSTAL (1995) – covering algorithm

  8. AutoSlog (Lehnert 1992) • Patterns based on recognizing “concepts” • Concept: what concept to recognize • Trigger: a word indicating an occurrence • Position: what syntactic role the concept will take in the sentence • Constraints: what type of entity to allow • Enabling conditions: constraints on the linguistic context

  9. Concept: Event-Time • Trigger:“at” • Position: prep-phrase-object • Constraints: time • Enabling conditions: post-verb The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. Event-Time = 19:15

  10. Learning Patterns • Supervised: Training is text with patterns to be extracted from it • Knowledge: 13 general syntactic patterns • Algorithm: • Find sentence with target noun phrase “two mobile homes” • Partial parsing of sentence: find syntactic relations • Try all linguistic patterns to find match • Generate concept pattern from match

  11. Linguistic Patterns • Identify domain-specific thematic roles based on syntactic structure active-voice-verb followed by target=direct object  Concept = target concept Trigger = verb of active-voice-verb Position = direct-object Constraints = semantic-class of target Enabling conditions = active-voice

  12. More Examples • victim was murdered • perpetratorbombed • perpetrator attempted to kill • was aimed at target • Some bad extraction patterns occur (e.g, “is” as a trigger) • Human review process

  13. CRYSTAL • Complex syntactic patterns • Use “covering” algorithm: • Generate most specific possible patterns for all occurrences of targets in corpus • Loop: • Find most specific unifier of the most similar patterns C & C’, generating new pattern P • If P has less than ε error on corpus, replace C and C’ with P • Continue until no new patterns can be added

  14. Merging Motor Vehicles International Corp. announced a major management shake-up ... MVI said the CEO has resigned ... The Big 10 auto maker is attempting to regain market share ... It will announce losses ... A company spokesman said they are moving their operations ... MVI, the first company to announce such a move since the passage of the new international trade agreement, is facing increasing demands from unionized workers...

  15. Coreference Resolution • Many different kinds of linguistic phenomena: • Proper names, • Aliases (MVI), • Definite NPs (the Big 10 auto maker), • Pronouns (it, they), • Appositives (, the first company to ...) • Errors of previous phases may be amplified

  16. Learning to Merge • Treat coreference as a classification task • Should this pair of entities be linked? • Methodology: • Training corpus: manually link all coreferential expressions • Each possible pair is a training example, if they are linked it is positive if not, it is negative • Create a feature vector for each example • Use your favorite learning algorithm

  17. MLR (1995) • 66 features were used, in 4 categories: • Lexical features of each phrase e.g, do they overlap? • Grammatical role of each phrase e.g, subject, direct-object • Semantic classes of each phrase e.g, physical-thing, company • Relative positions of the phrases e.g, X one sentence after Y • Decision-tree learning (C4.5)

  18. C4.5 • Incrementally build decision-tree from labeled training examples • At each stage choose “best” attribute to split dataset • E.g, use info-gain to compare features • After building complete tree, prune the leaves to prevent overfitting • Use statistical tests to determine if enough examples are in leaf bins, if not – prune!

  19. f2 f3 C1 C2 C2 C1 C4.5 40 training f1 25 training 15 training 18 training 7 training 2 training 13 training

  20. RESOLVE (1995) • C4.5 with 8 complex features: • NAME-{1,2}: does reference include a name? • JV-CHILD-{1,2}: does reference refer to part of a joint venture? • ALIAS: does one reference contain an alias for the other? • BOTH-JV-CHILD: do both refer to part of a joint venture? • COMMON-NP: do both contain a common NP? • SAME-SENTENCE: are both in the same sentence?

  21. Decision Tree

  22. RESOLVE Results • 50 texts, leave-1-out cross-validation:

  23. Pattern Recognition Coreference Resolution Output Template Partial Templates Template Merger Full System: FASTUS (1996) Input Text

  24. Num Aux P Pers-Name Org-Name V N Poss-N-Group V-Group Domain-Event Pattern Recognition • Multiple passes of finite-state methods John Smith, 47, was named president of ABC Corp.

  25. Person: _______ Pos: President Org: ABC Corp. Person: John Smith Pos: President Org: ABC Corp. Start: End: Partially-Instantiated Templates Domain-Dependent!!

  26. Person: Mike Jones Pos: ________ Org: ________ Person: John Smith Pos: ________ Org: ________ Start: End: The Next Sentence... He replaces Mike Jones. Coreference analysis: He = John Smith

  27. Person: Mike Jones Pos: President Org: ABC Corp. Person: John Smith Pos: President Org: ABC Corp. Start: End: Unification Unify new template with preceding template(s), if possible...

  28. NN2 DT NN1 VBD CSub VBZ Event: Announce Actor: Committee heads Principle of Least Commitment • Idea: Maintain options as long as possible • E.g: parsing – maintain a lattice structure: The committee heads announced that... N-GRP Event

  29. NN2 DT NN1 NNpos NN VBZ Head: Committee Effort: ABC’s recruitment Principle of Least Commitment • Idea: Maintain options as long as possible • E.g: parsing – maintain a lattice structure: The committee heads ABC’s recruitment effort. N-GRP N-GRP Event

  30. More Least Commitment • Maintain multiple coreference hypotheses: • Disambiguate when creating domain-events • More information available • Too many possibilities? • Use beam search algorithm: maintain k ‘best’ hypotheses at every stage

More Related