1 / 61

ELABORAZIONE DEL LINGUAGGIO NATURALE

ELABORAZIONE DEL LINGUAGGIO NATURALE. SEMANTICA: NAMED ENTITIES RELAZIONI. SEMANTICA MODERNA. Sue sottocompiti base Classificazione di entita ’: NAMED ENTITY RECOGNITION (and classification) Riconoscimento di predicati e loro argomenti : RELATION EXTRACTION.

zyta
Télécharger la présentation

ELABORAZIONE DEL LINGUAGGIO NATURALE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ELABORAZIONE DEL LINGUAGGIO NATURALE SEMANTICA: NAMED ENTITIESRELAZIONI

  2. SEMANTICA MODERNA • Sue sottocompiti base • Classificazionedientita’: NAMED ENTITY RECOGNITION (and classification) • Riconoscimentodipredicatieloroargomenti: RELATION EXTRACTION

  3. Named Entity Recognition (NER) Input: Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne. Output: Apple Inc., formerly Apple Computer, Inc.,is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

  4. Named Entity Recognition (NER) • Locate and classify atomic elements in text into predefined categories (persons, organizations, locations, temporal expressions, quantities, percentages, monetary values, …) • Input: a block of text • Jim bought 300 shares of Acme Corp. in 2006. • Output: annotated block of text • <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX> • ENAMEX tags (MUC in the 1990s)

  5. THE STANDARD NEWS DOMAIN • Most work on NER focuses on • NEWS • Variants of repertoire of entity types first studied in MUC and then in ACE: • PERSON • ORGANIZATION • GPE • LOCATION • TEMPORAL ENTITY • NUMBER

  6. HOW • Two tasks: • Identifying the part of text that mentions a text (RECOGNITION) • Classifying it (CLASSIFICATION) • The two tasks are reduced to a standard classification task by having the system classify WORDS

  7. Basic Problems in NER • Variation of NEs – e.g. John Smith, Mr Smith, John. • Ambiguity of NE types • John Smith (company vs. person) • May (person vs. month) • Washington (person vs. location) • 1945 (date vs. time) • Ambiguity with common words, e.g. “may”

  8. Problems in NER • Category definitions are intuitively quite clear, but there are many grey areas. • Many of these grey area are caused by metonymy. Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”. Company vs. Artefact: “shares in MTV” vs. “watching MTV” Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”

  9. Approaches to NER: List Lookup • System that recognises only entities stored in its lists (GAZETTEERS). • Advantages - Simple, fast, language independent, easy to retarget • Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

  10. Approaches to NER: Shallow Parsing • Names often have internal structure. These components can be either stored or guessed. location: CapWord + {City, Forest, Center} e.g. Sherwood Forest Cap Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street

  11. Shallow Parsing Approach(E.g., Mikheev et al 1998) • External evidence - names are often used in very predictive local contexts Location: “to the” COMPASS “of” CapWord e.g. to the south of Loitokitok “based in” CapWord e.g. based in Loitokitok CapWord “is a” (ADJ)? GeoWord e.g. Loitokitokis a friendly city

  12. Machine learning approaches to NER • NER as classification: the IOB representation • Supervised methods • Support Vector Machines • Logistic regression (aka Maximum Entropy) • Sequence pattern learning • Hidden Markov Models • Conditional Random Fields • Distant learning • Semi-supervised methods

  13. THE ML APPROACH TO NE: THE IOB REPRESENTATION

  14. THE ML APPROACH TO NE: FEATURES

  15. FEATURES

  16. FEATURES

  17. Supervised ML for NER • Methods already seen • Decision trees • Support Vector Machines • Sequence pattern learning (also supervised) • Hidden Markov Models • Maximum Entropy Models • Conditional Random Fields

  18. EVALUATION

  19. TYPICAL PERFORMANCE

  20. NER Evaluation Campaigns • English NER-- CoNLL 2003 - PER/ORG/LOC/MISC • Training set: 203.621 tokens • Development set: 51.362 tokens • Test set: 46.435 tokens • Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE • Development set: 223.706 tokens • Test set: 90.556 tokens • Mention Detection-- ACE 2005 • 599 documents

  21. CoNLL2003 shared task (1) • English and German language • 4 types of NEs: • LOC Location • MISC Names of miscellaneous entities • ORG Organization • PER Person • Training Set for developing the system • Test Data for the final evaluation

  22. CoNLL2003 shared task (2) • Data • columns separated by a single space • A word for each line • An empty line after each sentence • Tags in IOB format • An example Milan NNP B-NP I-ORG 's POS B-NP O player NN I-NP O George NNP I-NP I-PER Weah NNP I-NP I-PER meet VBP B-VP O

  23. CoNLL2003 shared task (3) English precision recall F [FIJZ03] 88.99% 88.54% 88.76% [CN03] 88.12% 88.51% 88.31% [KSNM03] 85.93% 86.21% 86.07% [ZJ03] 86.13% 84.88% 85.50% --------------------------------------------------- [Ham03] 69.09% 53.26% 60.15% baseline 71.91% 50.90% 59.61%

  24. CURRENT RESEARCH ON NER • New domains • New approaches: • Semi-supervised • Distant • Handling many NE types • Integration with Machine Translation • Handling difficult linguistic phenomena such as metonymy

  25. NEW DOMAINS • BIOMEDICAL • CHEMISTRY • HUMANITIES: MORE FINE GRAINED TYPES

  26. Bioinformatics Named Entities • Protein • DNA • RNA • Cell line • Cell type • Drug • Chemical

  27. NER IN THE HUMANITIES SITE LOC CULTURE

  28. Powell met Zhu Rongji battle wrestle join debate Powell and Zhu Rongji met consult Powell met with Zhu Rongji Proposition:meet(Powell, Zhu Rongji) Powell and Zhu Rongji had a meeting SEMANTIC INTERPRETATION 2: FROM SENTENCES TO PROPOSITIONS meet(Somebody1, Somebody2) . . . When Powell met Zhu Rongji on Thursday they discussed the return of the spy plane. meet(Powell, Zhu) discuss([Powell, Zhu], return(X, plane))

  29. OTHER ASPECTS OF SEMANTIC INTERPRETATION • Identification of RELATIONS between entities mentioned • Focus of interest in modern CL since 1993 or so • Identification of TEMPORAL RELATIONS • From about 2003 on • QUALIFICATION of such relations (modality, epistemicity) • From about 2010 on

  30. TYPES OF RELATIONS • Predicate-argument structure (verbs and nouns) • John kicked the ball • Nominal relations • The red ball • Relations between events / temporal relations • John kicked the ball and scored a goal

  31. PREDICATE-ARGUMENT STRUCTURE • Linguistic Theories • Case Frames – FillmoreFrameNet • Lexical Conceptual Structure – JackendoffLCS • Proto-Roles – DowtyPropBank • English verb classes (diathesis alternations) - LevinVerbNet • Talmy, Levin and Rappaport

  32. Fillmore’s Case Theory • Sentences have a DEEP STRUCTURE with CASE RELATIONS • A sentence is a verb + one or more NPs • Each NP has a deep-structure case • A(gentive) • I(nstrumental) • D(ative) • F(actitive) • L(ocative) • O(bjective) • Subject is no more important than Object • Subject/Object are surface structure

  33. THEMATIC ROLES • Following on Fillmore’s original work, many theories of predicate argument structure / thematic roles were proposed, among which the best known perhaps • Jackendoff’s LEXICAL CONCEPTUAL SEMANTICS • Dowty’s PROTO-ROLES theory

  34. Dowty’s PROTO-ROLES • Event-dependent • Prototypes based on shared entailments • Grammatical relations such as subject related to observed (empirical) classification of participants • Typology of grammatical relations • Proto-Agent • Proto-Patient

  35. Proto-Agent • Properties • Volitional involvement in event or state • Sentience (and/or perception) • Causing an event or change of state in another participant • Movement (relative to position of another participant) • (exists independently of event named) *may be discourse pragmatic

  36. Proto-Patient • Properties: • Undergoes change of state • Incremental theme • Causally affected by another participant • Stationary relative to movement of another participant • (does not exist independently of the event, or at all) *may be discourse pragmatic

  37. Semantic role labels: Jan broke the LCD projector. break (agent(Jan), patient(LCD-projector)) cause(agent(Jan), change-of-state(LCD-projector)) (broken(LCD-projector)) Filmore, 68 Jackendoff, 72 agent(A) -> intentional(A), sentient(A), causer(A), affector(A) patient(P) -> affected(P), change(P),… Dowty, 91

  38. VERBNET AND PROPBANK • Dowty’s theory of proto-roles was the basis for the development of PROPBANK, the first corpus annotated with information about predicate-argument structure

  39. a GM-Jaguar pact give(GM-J pact, US car maker, 30% stake) PROPBANK REPRESENTATION a GM-Jaguar pact that would give the U.S. car maker an eventual 30% stake in the British company. Arg0 that would give Arg1 *T*-1 an eventual 30% stake in the British company Arg2 the US car maker

  40. ARGUMENTS IN PROPBANK • Arg0 = agent • Arg1 = direct object / theme / patient • Arg2 = indirect object / benefactive / instrument / attribute / end state • Arg3 = start point / benefactive / instrument / attribute • Arg4 = end point • Per word vs frame level – more general?

  41. FROM PREDICATES TO FRAMES In one of its senses, the verb observe evokes a frame called Compliance: this frame concerns people’s responses to norms, rules or practices. The following sentences illustrate the use of the verb in the intended sense: • Our family observes the Jewish dietary laws. • You have to observe the rules or you’ll be penalized. • How do you observe Easter? • Please observe the illuminated signs.

  42. FrameNet FrameNet records information about English words in the general vocabulary in terms of • the frames (e.g. Compliance) that they evoke, • the frame elements (semantic roles) that make up the components of the frames (in Compliance, Norm is one such frame element), and • each word’s valence possibilities, the ways in which information about the frames is provided in the linguistic structures connected to them (with observe, Norm is typically the direct object). theta

  43. NOMINAL RELATIONS

  44. CLASSIFICATION SCHEMES FOR NOMINAL RELATIONS

  45. ONE EXAMPLE (Barker et al1998, Nastase & Spakowicz 2003)

  46. THE TWO-LEVEL TAXONOMY OF RELATIONS, 2

  47. THE SEMEVAL-2007 CLASSIFICATION OF RELATIONS • Cause-Effect: laugh wrinkles • Instrument-Agency: laser printer • Product-Producer: honey bee • Origin-Entity: message from outer-space • Theme-Tool: news conference • Part-Whole: car door • Content-Container: the air in the jar

  48. THE MUC AND ACE TASKS • Modern research in relation extraction, as well, was kicked-off by the Message Understanding Conference (MUC) campaigns and continued through the Automatic Content Extraction (ACE) and Machine Reading follow-ups • MUC: NE, coreference, TEMPLATE FILLING • ACE: NE, coreference, relations

  49. TEMPLATE-FILLING

  50. EXAMPLE MUC: JOB POSTING

More Related