Feiyu Xu & Hans Uszkoreit & Hong Li

Automatic Event and Relation Detection with Seeds of Varying ComplexityDFKI, Saarbrücken, Germany Feiyu Xu & Hans Uszkoreit & Hong Li

Overview • Motivation • Approach and Realization • Seed construction • Learning algorithm • Rule compression and validation • Evaluation • Conclusion and Future Work

General Motivation • Development of generic strategy for extracting relations/events from large collections of open-domain free texts • Extracted patterns/rules can be used for • online information extraction • online question answering tasks • Question analysis • Answer extraction • Extracted relations/events can be used • to fill fact-DBs for simplefact/list questions • as annotations of free texts

Our Task • Find a generalunsupervised learning approach to relation and event extraction • Investigation of seed construction • Investigation of rule presentation of event extraction • Automatic detection of event extents, event triggers and event arguments • A uniform framework for relation and event extraction

Challenges for Event Extraction • Events are more complex than binary relations • Their slot fillers are distributed often in more than one sentence • A single linguistic pattern is often not sufficient to extract an event • Methods are needed for merging of the elementary extraction rules to one complex event extraction rule • Methods are needed for merging of the extracted fragments to one event

Challenges for Automatic Learning • Lack of annotated data • Uncertainty about needed number and complexity of seeds • Uncertainty about required depth in NLP • Rule induction and validation • Evaluation: precision, completeness and recall

Basic Idea • Systematic n-ary Seed Construction • Bootstrapping Approach • Rule Induction, Ranking, Validation

Input • A corpus of unclassified and unannotated documents • Seed of event instances

NLP as Preprocessing • Named Entitiy Recognition • SProUT (Drozdzyski et al, 2004) • Robust Dependency Structure Parsing • Minipar (Lin, 1998)

Named Entity and Relation Recognition • SProUT (Drozdzyski et al, 2004) • sprout.dfki.de • a platform for development of multilingual shallow text processing and information extraction systems • combines finite-state techniques and unification-based formalisms • integrated grammar development and testing environment • Unary Relation (named entity) recognition prize name, prize area, person, organization, time • Binary or even 3-ary relations in a phrase with keyword “Nobel” • the 1996 Nobel Physic Prize, Nobel poet, Nobel chemist, Nobel Peace Prize

Algorithm • Given • A large corpus of un-annotated andun-classified documents • A trusted set of event instances,initially chosen ad hoc by the user, the seed, normally, one or two. • Partition • Apply seeds to the documents and dividethem into relevant and irrelevant documentsA document is relevant, if its text fragments contain a minimal number of event arguments of a seed ‘and the distance among the arguments does not exceed three sentences • Pattern extraction and rule induction/compression • Annotate the relevant document fragments with named entities and dependency structures • Extract patterns • Rule induction/compression • Apply induced rules to the same document set • Rank rules, documents and new seeds • Stop if no new rules and seeds can be found, else repeat 2-6

Workflow irrelevant documents documents partition/classifier pattern extraction rule induction ranking relevant documents seeds new seeds Named Entity Recognition Dependency Parser

Duality/Density Principle (DD Principle) • Density: • Relevant documents contain more relevant patterns and relations and events • Duality: • documents that are relevant to the scenario are strong indicators of good patterns and good seeds • good patterns are indicators of relevant documents • good seeds are indicators of relevant documents based on (Yangarber, 2001)

Relevance Ranking based on DD Principle • Rule ranking: rank(r) = dd(r) * log sum(r) * rank(seed) • New seed ranking: rank(seed) = max{rank(r) | r R(seed)} R(seed) is the set of accepted rules that extracted the seed. The initial seed has the rank 1. • Document ranking: rank(d) = 1 (1  dd(r)) rK(d) K(d) is the set of accepted rules that successfully applied in d. In the first loop, rank(d) is either 0 or 1. 1 dd(r) = sum(r) =  rank(d) sum(r) |D| dD(r) D=D(r) are documents where the rule r matched.

Experiment Domain • Prize winning events • Nobel Prize • Background • Nice redundancy of event distribution in news texts • Evaluation data for precision and completeness available • Completeness of Nobel Prize winner list online available • Large amount of news texts available online • Adaptation to other prize winning events

Domain • Training Domain (currently) NOBEL PRIZE • Planned Test Domain other PRIZES

Learning Data

Domain Modeling • Application/specific domain can be modelled as • a semantic model containing • Entities • Attributes • Relations • Events • realized as a specific ontology, which can be integrated or embedded in a general ontology, e.g., SUMO • Nobel Prize Domain • Entities: person, location, area, date time, achievement, country, … • Attributes: nationality, gender, surname, birthday, … • Relations: person-position, person-affiliation, … • Events: receive-Nobel-Prize, announce-Nobel-Prize, nominate-Nobel-Prize

Relations as Concepts in an Ontology Receive-Nobel-Prize reason : achievement (accomplishment, service, skills, ...) prize : nobel prize area : nobel_prize_area (medicine, physics, literature, peace, ...) year: year recipient: cognitive-agent location: place recipient: person recipient:organization

From Query Analysis to Database Query Who won the Nobel Prize in Chemistry in 2001? award: [Recipient: ? Year: 2001 Area: chemistry] select person from person_prize_winner where year=„2001“ and area=„chemistry“ select organization from organization_prize_winner where year=„2001“ and area=„chemistry“

Modelling Ontology with SUMO, WordNet

N-ary to binary Relations

Research Issues • Evaluation of the interaction between seed construction and performance • Merging/fusion of segments of an event (binary relation instances) to a complete event instance

Two Approaches to Seed Construction • Pattern-oriented (e.g., ExDisco (Yangarber 2001)) • too closely bound to the linguistic representation of the seed, e.g., subject(company) v(“appoint”) object(person) • An event can be expressed by more than one patterns • Relation and event instances as seeds (e.g., DIPRE (Brin 1998) and Snowball (Agichtein and Gravano 2000)) • domain independence: it can be applied to all relation and event instances • flexibility of the relation and event complexity: it allows n-ary relations and events • processing independence: the seeds can lead to patterns in different processing modules, thus also supporting hybrid systems, voting approaches etc. • Not limited to a sentence as an extraction unit

Functionalities of Seed Relations and Events • detection of relevant sentences, which describe the seed events. These relevant sentences can be used as potential event extent • detection of relevant linguistic expressions, which can be used as event/relation triggers • learning patterns and their interaction rules for event extraction

Event Example (NYT16) NEW YORK -- Oct. 13, 1998 -- SCI-NOBEL-PHYSICS-CHEMISTRY, 10-13 – The Nobel Prizes in Physics and Chemistry were announcedTuesday by the Royal Swedish Academy of Sciences. Dr. Horst Stoermer, 49, a German-born professor who works at both Columbia University in New York and at Bell Laboratories in Murray Hill, N.J., is one of the three winners of the physicsprize. (Suzanne DeChillo/New York Times Photo) < receiving_nobel_prize,Dr. Horst Stoermer, Nobel, Physics, 1998>

Seed Construction • Assume, we wanted to ask questions about paintings • Assume, we wanted to ask for instance for dates of creation of paintings • Paul Cezanne, Bathing, 1892-94 • Paul Cezanne, Card Players, 1892-93 • Paul Cezanne, Two Card Players, 1892-93

More Examples • Paul Gauguin, Poplars, 1883 • Paul Gauguin, Sea Coast, 1887 • Paul Gauguin, Woman with a Red Dress, 1891

Seed Selection • Should the seeds be the names of paintings together with the year? Probably not! • Painter and year? No! • the seeds could be either <painter, painting, years> • or just painter and painting

Seed Selection • frequency of mentionings in relevant and irrelevant documents • <painter> • <painter, painting> • <painter, year> • <painting, year> • <painter, painting, year> • independence hypothesis: the choice of paraphrases for expressing the relation and event does not depend on the particular instances of the relation

frequency relation instances • Zipf Assumption: We assume that the frequency distributions of both relation instances and mention paraphrases follow Zipf's law

{ <nobel, physics>, <nobel, Laughlin>, <nobel, Störmer>, <nobel, Tsui>, <physics, Laughlin>, <Laughlin, 1998>, <nobel, physics>, . . . } Seed Construction for Pattern Learning receiving_nobel_prize event : event award: nobel_prize area : physics year: 1998 recipient: Laughlin co-recipients: Störmer, Tsui

{ <nobel, physics, Laughlin,1998> <nobel, physics, {Laughlin, Störmer, Tsui}, 1998>, . . . } Seed Construction for Pattern Learning receiving_nobel_prize event : event award: nobel_prize area : physics year: 1998 recipient: Laughlin co-recipients: Störmer, Tsui N-ary Complex Event seed

Event Instance as Seed Here a relation-seed is a 4-ary tuple of 4 entity types Examples in xml <seedid="1"> <prizename="Nobel"/> <year>1999</year> <areaname="chemistry"/> <winner> <person> <name>Ahmed H. Zewail</name> <surname>Zewail</surname> <gname>Ahmed</gname> <gname>H</gname> </person> </winner> </seed> Prize Name : string PrizeArea : string event & Winner List :list of person Year:4 digit* Surname : string person & Given Name : list of string Complete Name : string

Retrieval of Relevant Documents • Relevant documents are documents containing seeds • Using IR for document retrieval: “lucene” as our IR tool seed: (Nobel, chemistry, [Ahmed H. Zewail], 1999) IR query: Nobel AND chemistry ANDAhmed AND Zewail seed: (Nobel, chemistry, [Walter Kohn ; John A. Pople], 1998) IR query: Nobel AND chemistry AND ((walter AND kohn) OR (john AND pople))

Identification of Potential Event Extents • Analyzing all sentences of relevant documents • Find the sentences which contain seed entities • String match • Entity match • Find all entities in paragraphs with SProUT • Fuzzy entity match to get the equal entities (unification-based) e.g.Ahmed H. Zewail = = Ahmed Zwail

Seed Complexity and Sentence Extent • Which kind of sentences could represent an event? • Sentences from Year 1998-1999 Table 1. distribution of the seed complexity

Distribution of Entity Combination Table 2. distribution of entity combinations

What we can learn from these figures • Table 1 tells us that the more event arguments a sentence contains, the higher is the probability that the sentence is an event extent. • Table 2 shows the difference between different entity-class combinations with respect to the event identification. We can potentially regard these values as additional validation criteria of event extraction rules. • Table 1 helps us to pre-estimate the contribution of the arity classes for successful event extraction, • Table 2 shows us which types of incomplete seeds might be most useful. • Both distributions, especially the second one will be very much dependent on the kind of relations to extract • Such seed analyses could be used to better characterize a given relation-extraction task.

Sentence Analysis and Pattern Identification Seed: (Nobel, chemistry, [Ahmed H. Zewail], 1999) Sentence: Mr. Zewail won the Nobel Prize for chemistry Tuesday. fin Parse Tree (sprout + minipar) won(V) Zewail(N) subj (person) Prize(N) obj (prize_name) Tuesday mod *it is a time entity, but not an entity mentioned in seed Mr title the det Nobel lexmod for(Prep) mod chemistry(N) pcmpn (area)

Example 2 Seed: (Nobel, chemistry, [Ahmed H. Zewail], 1999) Sentence: Egyptian-born scientist Ahmed Zewail has been awarded the 1999 Nobel Prize for Chemistry. fin award(V) Zewail(N) subj (person) Prize(N) obj (binary_relation: [year],[prize]) has have be(be) be born mod scientist nn Ahmed lexmod the det 1999 num Nobel lexmod for mod Egyptian lexmod Chemistry(N) pcompn (area)

Rule Example rule_id: 26 rule_name: event_trigger_win rule_body: {head(“win”, v), subject([person]), object(relation_trigger_prize)} rule_name: relation_trigger_prize rule_body:{head(“prize”, n), [prize_name]|binary_relation([year], [prize_name]), mod(relation_trigger_prep)} rule_name: relation_trigger_prep_area rule_body: {head(“in”|”for”, prep), pcom-n([area])}

Rule Clustering and Compression • Cluster event and relation rules with same head categories • Verbs with the same lemma • Nouns with the same lemma • Entities with the same entity class • and same number of subrules additionally • No subsumption check done yet

Ranked Rules 1144 rules top ranked rules: • Event_trigger_win • Event_trigger_award • Event_trigger_share • Event_trigger_accept • Event_trigger_person: apposition construction • Event_trigger_be_cowinner • Event_trigger_be_winner • Event_trigger_be_prize_winner

Example of Extraction Known as the "Saint of the Gutters" for her unending work and compassion for the poor, Mother Teresa was awarded the Nobel Peace Prize in 1979. rule_id = 32 arity= 3 [nobel, peace, 1979, ] rule_id = 66 arity= 3 [nobel, peace, , (Teresa, mother )] rule_id = 69 arity= 3 [nobel, peace, , (Teresa, mother )] rule_id = 123 arity= 2 [nobel, peace, , ] rule_id = 124 arity= 2 [nobel, peace, , ] rule_id = 125 arity= 2 [nobel, peace, , ]

Evaluation • Data: 18 MB, from 1981 to 2006 • Seed: one event instance, one winner from the Nobel Prize winner database in year 1997 Nobel, chemistry, [Ahmed H. Zewail], 1999 • Number of iteration: 4 • Precision: 143/174=0.72 • Completeness (since 1981): 143/237=0.60 • 143 Nobel Prize winners since 1981 have been found

Conclusion • Semantic-oriented event seed construction helps to identify textual location of events • Event rule presentation is a new way for merging event arguments • Rule presentation is general enough for any n-ary relations

Next steps of the ongoing project • More sophisticated rule induction • Subsumption check • Extension to paragraph level • Integration of alternative NLP processing • Comparison with seed of binary relations • Extension to new prizes and new domains

Thank you for your attention

Feiyu Xu & Hans Uszkoreit & Hong Li