1 / 82

Information Extraction

Information Extraction. October 13, 2006. Input: Specification: Types of entities to find Types of relations to find Templates to fill Corpus of text: Possibly formatted Possibly annotated for linguistic structure. Output: Text + annotation: Entities tagged w/type and coreference info

gautam
Télécharger la présentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction October 13, 2006

  2. Input: Specification: Types of entities to find Types of relations to find Templates to fill Corpus of text: Possibly formatted Possibly annotated for linguistic structure Output: Text + annotation: Entities tagged w/type and coreference info Relations b/t entities tagged Filled templates: Instances of templates found in text What is Information Extraction?

  3. MUC: Genesis of IE • DARPA funded significant efforts in IE in the early to mid 1990’s. • Message Understanding Conference (MUC) was an annual event/competition where results were presented. • Focused on extracting information from news articles: • Terrorist events • Industrial joint ventures • Company management changes • Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)

  4. MUC • Named entity • Person, Organization, Location • Co-reference • Clinton President Bill Clinton • Template element • Perpetrator, Target • Template relation • Incident • Multilingual

  5. Named entities and events San Salvador, 19 Apr 89 (ACAN-EFE) -- [TEXT] Salvadoran President-elect Alfredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … Vice President-elect Francisco Merino said that when the attorney general's car stopped at a light on a street in downtown San Salvador, an individual placed a bomb on the roof of the armored vehicle. …

  6. Coreference links San Salvador, 19 Apr 89 (ACAN-EFE) -- [TEXT] Salvadoran President-elect Alfredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … Vice President-elect Francisco Merino said that when the attorney general's car stopped at a light on a street in downtown San Salvador, an individual placed a bomb on the roof of the armored vehicle. …

  7. (Partial) Scenario template Incident: Date 19 Apr 89 Incident: Location El Salvador: San Salvador (CITY) Incident: Type Bombing Perpetrator: Individual ID “urban guerrillas” Perpetrator: Organization ID “FMLN” Perpetrator: Organization Confidence Suspected or Accused by Authorities: "FMLN" Physical Target: Description “vehicle” Physical Target: Effect Some Damage: “vehicle” Human Target: Name “Roberto Garcia Alvarado” Human Target: Description “attorney general”: “Roberto Garcia Alvarado” Human Target: Effect Death: “Roberto Garcia Alvarado”

  8. MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

  9. MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with alocal concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

  10. MUC Templates • Relationship • tie-up • Entities: • Bridgestone Sports Co, a local concern, a Japanese trading house • Joint venture company • Bridgestone Sports Taiwan Co • Activity • ACTIVITY 1 • Amount • NT$2,000,000

  11. MUC Templates • ATIVITY 1 • Activity • Production • Company • Bridgestone Sports Taiwan Co • Product • Iron and “metal wood” clubs • Start Date • January 1990

  12. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example from Fastus (1993)

  13. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000

  14. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000

  15. Automated Content Extraction • Objectives: • Extract information from texts of varying quality • Detect unique entities, events, and relations: • Find all entity mentions • Link mentions by entity • Track entities within and across documents • Output XML for downstream processes

  16. ACE entity and mention types

  17. ACE relation and event types

  18. Applications • Information gathering (intelligence tasks) • Question answering • Answer extraction from retrieved documents • Ontology induction • Improving indexing for IR

  19. IE task breakdown • Entities: • Identification: finding entity mentions • Classification: determining entity type • Normalization: standardizing entity mentions (e.g., identifying co-referring entity mentions) • Relations: • Association: identifying related entities and their relations

  20. Two approaches to IE • Knowledge-engineering approach • Grammar rules built by hand • Human expert generates domain-specific patterns through introspection and corpus work • Iterative process: build, test, evaulate errors, repeat • Data-driven approach • Use statistical methods • Learn recognizers and classifiers from annotated data where available • Leverage unannotated corpora, if possible, by bootstrapping

  21. Knowledge engineering • Advantages: • Conceptually straightforward • Best-performing systems still hand-built • Disadvantages: • Lots of human effort required • Human expertise also required • Not readily portable to new domains or languages

  22. Data-driven approach • Advantages: • Porting to new domains straightforward • Domain expertise not necessary • Good coverage is ensured • Disadvantages: • Training data may not exist or may be difficult to acquire • Changes in specification may require re-annotation of training data

  23. Use hand-built rule-based approach when: Resources (esp. lexicons) available Rule writers available Training data unavailable or hard to get Extraction specifications subject to change Highest possible performance needed Use data-driven approach when: Resources unavailable Rule writers unavailable Training data cheap and plentiful Extraction specifications stable Good performance good enough Which approach to use?

  24. Typical NLP tasks for IE • Tokenization • Finding word boundaries • Lexical lookup • Using domain lexicons w/type information, e.g., first-name lists, place-name lists, etc. • Part-of-speech tagging • POS tags provide generalization for later processes • Can be hand-built or machine-learned • Shallow parsing • Coreference resolution

  25. Shallow parsing: cascaded finite-state transducers • Limited linguistic analysis: • Grammar divided into levels (chunks and clauses) • Pipeline of finite-state recognizers/transducers • Robust: • Local decisions, no global optimization • Easy-first parsing • High-precision decisions • Attachment decisions can be indefinitely delayed • Time and space efficient • Deterministic search

  26. Natural Language Processing-based Information Extraction • If extracting from automatically generated web pages, simple regex patterns usually work. • If extracting from more natural, unstructured, human-written text, some NLP may help. • Part-of-speech (POS) tagging • Mark each word as a noun, verb, preposition, etc. • Syntactic parsing • Identify phrases: NP, VP, PP • Semantic word categories (e.g. from WordNet) • KILL: kill, murder, assassinate, strangle, suffocate • Extraction patterns can use POS or phrase tags. • Crime victim: • Prefiller: [POS: V, Hypernym: KILL] • Filler: [Phrase: NP]

  27. MUC: the NLP genesis of IE • DARPA funded significant efforts in IE in the early to mid 1990’s. • Message Understanding Conference (MUC) was an annual event/competition where results were presented. • Focused on extracting information from news articles: • Terrorist events • Industrial joint ventures • Company management changes • Information extraction is of particular interest to the intelligence community

  28. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE from FASTUS (1993)

  29. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)

  30. production of 20, 000 iron and metal wood clubs [company] [set up] [Joint-Venture] with [company] FASTUS Based on finite state automata (FSA) transductions 1.Complex Words: Recognition of multi-words and proper names set up new Taiwan dollars 2.Basic Phrases: Simple noun groups, verb groups and particles a Japanese trading house had set up 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

  31. 1 ’s PN Art 2 0 ADJ N ’s Art 3 P 4 PN Grep++ = Cascaded grepping Finite Automaton for Noun groups: John’s interesting book with a nice cover

  32. Rule-based Extraction Examples Determining which person holds what office in what organization • [person] , [office] of [org] • Vuk Draskovic, leader of the Serbian Renewal Movement • [org] (named, appointed, etc.) [person] P [office] • NATO appointed Wesley Clark as Commander in Chief Determining where an organization is located • [org] in [loc] • NATO headquarters in Brussels • [org] [loc] (division, branch, headquarters, etc.) • KFOR Kosovo headquarters

  33. IE with hidden markov models

  34. Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S S S transitions t - 1 t t+1 ... ... observations ... Generates: State sequence Observation sequence O O O t - t +1 t 1 o1 o2 o3 o4 o5 o6 o7 o8 Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet

  35. Markov Property S1: rain S2: cloud S3: sun The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt In another word, current state determines the probability distribution for the next state. 1/2 S2 1/3 1/2 S1 S2 2/3 1

  36. Markov Property S1: rain S2: cloud S3: sun 1/2 State-transition probabilities, A = S2 1/3 1/2 S1 S3 2/3 1 Q: given today is sunny (i.e., q1=3), what is the probability of “sun-cloud” with the model?

  37. state sequences O1 O3 O4 O5 O2 Hidden Markov Model S1: rain S2: cloud S3: sun 1/2 1/10 S2 9/10 1/2 1/3 S1 S3 2/3 observations 4/5 1 7/10 3/10 1/5

  38. IE with Hidden Markov Model Given a sequence of observations: SI/EECS 767 is held weekly at SIN2. and a trained HMM: course name location name background Find the most likely state sequence: (Viterbi) SI/EECS 767 is held weekly at SIN2 Any words said to be generated by the designated “course name” state extract as a course name: Course name: SI/EECS 767

  39. Person end-of-sentence start-of-sentence Org (Five other name classes) Other Name Entity Extraction [Bikel, et al 1998] Hidden states

  40. Name Entity Extraction Transitionprobabilities Observationprobabilities P(ot | st , st-1) P(st | st-1, ot-1) P(ot | st , ot-1) or (1) Generating first word of a name-class (2) Generating the rest of words in the name-class (3) Generating “+end+” in a name-class

  41. HMM-Experimental Results Train on ~500k words of news wire text. Results:

  42. Learning HMM for IE [Seymore, 1999] Consider labeled, unlabeled, and distantly-labeled data

  43. Some Issues with HMM • Need to enumerate all possible observation sequences • Not practical to represent multiple interacting features or long-range dependencies of the observations • Very strict independence assumptions on the observations

  44. We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

  45. Maximum Entropy Markov Models [Lafferty, 2001] S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations Courtesy of William W. Cohen

  46. Problems with Richer Representationand a Generative Model These arbitrary features are not independent. • Multiple levels of granularity (chars, words, phrases) • Multiple dependent modalities (words, formatting, layout) • Past & future Two choices: Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S S S S S S t - 1 t t+1 t - 1 t t+1 O O O O O O t - t +1 t - t +1 t 1 t 1

  47. MEMM S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 t+1 … t is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history Courtesy of William W. Cohen

  48. HMM vs. MEMM St-1 St St+1 ... Ot-1 Ot Ot+1 St-1 St St+1 ... Ot-1 Ot Ot+1

  49. Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o): • Can examine features, but not responsible for generating them. • Don’t have to explicitly model their dependencies. • Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

  50. Conditional Markov Models (CMMs) vs HMMS St-1 St St+1 ... Ot-1 Ot Ot+1 St-1 St St+1 ... Ot-1 Ot Ot+1 Lots of ML ways to estimate Pr(y | x)

More Related