Automatic event extraction from text on the base of linguistic and semantic annotation

Automatic event extraction from text on the base of linguistic and semantic annotation Thierry Declerck DFKI – Language Technology Lab JRC 2005/05/10

Events … • Involve entities and relations between then • Implies a change of states • Example: The striker of Liverpool shot a wonderful goal in the 87. Minute. • 1 event (goal-shot) • 2 entities (person and team) • 1 change of state (the scoring) JRC 2005/05/10

Events in textual documents • Various types of text • Structured: Example and Example_2 • For processing, pattern matching techniques required. Very few linguistic knowledge needed • Semi-structured: Example • Requires a mixture of pattern matching and more linguistic knowledge • Unstructured: Example • Requires a mixture of layout analysis and linguistic knowledge • All types of text require a domain specific knowledge base (ontology) for event extraction JRC 2005/05/10

Domain Knowledge • Domain knowledge can be organised in terminologies, thesauri, taxonomies or ontologies. Example of a (non-formal) multingual ontology for the soccer domain. • More on ontology engineering in the talk by Borislav JRC 2005/05/10

Automatic Event Extraction from Text is • A combination of human language technology (HLT) and semantic web technologies (ontologies) • Can also be done on the base of purely statistical means (with minimal linguistic knowledge), but we concentrate here on the HLT-based approach JRC 2005/05/10

What is Human Language Technology JRC 2005/05/10

Linguistic Analysis Language technology tools are needed to support the upgrade of the actual web to the Semantic Web (SW) by providing an automatic analysis of the linguistic structure of textual documents. Free text documents undergoing linguistic analysis become available as semi-structured documents, from which meaningful units can be extracted automatically (information extraction) and organized through clustering or classification (text mining). Here we focus on the following linguistic analysis steps that underlie the extraction tasks: tokenization,morphological analysis, part-of-speech tagging, chunking, dependency structure analysis, semantic tagging. JRC 2005/05/10

Tokenisation Tokenisation deals with the detection of the word units in a text and with the detection of sentence boundaries. The markets acknowledge the measures taken on the 24th of September by the CEO of XYZ Corp. JRC 2005/05/10

Morphological Analysis Morphological analysis is concerned with the inflectional, derivational, and compounding processes in word formation in order to determine properties such as stem and inflectional information. Together with part-of-speech (PoS) information this process delivers the morpho-syntactic properties of a word. While processing the German word Häusern (houses) the following morphological information should be analysed: [PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS] JRC 2005/05/10

Part-of-Speech Tagging Part-of-Speech (PoS) tagging is the process of determining the correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for a particular word given its current context. The word “works” in the following sentences will be either a verb or a noun: He works[N,V] the whole day for nothing. His works[N,V]have all been sold abroad. PoS tagging involves disambiguation between multiple part-of-speech tags, next to guessing of the correct part-of-speech tag for unknown words on the basis of context information. JRC 2005/05/10

Chunking Chunks are sequences of words which are grouped on the base of linguistic properties, such as nominal, prepositional, adjectival and adverbial phrases and verb groups. [NP His works] [VG have] [NP all] [VG been sold] [AdvP abroad]. JRC 2005/05/10

Named Entities detection Related to chunking is the recognition of so-called named entities (names of institutions and companies, date expressions, etc.). The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) with the definition of regular expression patterns. Named entity recognition can be included as part of the linguistic chunking procedure and the following sentence fragment: “…the secretary-general of the United Nations, Kofi Annan,…” will be annotated as a nominal phrase, including two named entities: United Nations with named entity class: organization,and Kofi Annan with named entity class: person JRC 2005/05/10

Dependency Structure Analysis A dependency structure consists of two or more linguistic units that immediately dominate each other in a syntax tree. The detection of such structures is generally not provided by chunking but is building on the top of it. There are two main types of dependencies that are relevant for our purposes: On the one hand, the internal dependency structure of phrasal units or chunks and on the other hand the so-called grammatical functions (like subject and direct object). JRC 2005/05/10

Internal Dependency Structure In linguistic analysis, for this we use the terms head, complements and modifiers, where the head is the dominating node in the syntax tree of a phrase (chunk), complements are necessary qualifiers thereof, and modifiers are optional qualifiers.Consider the following example: “The shot by Christian Ziege goes over the goal.” The prepositional phrase “by Christian Ziege” (containing the named entity Christian Ziege) depends on (and modifies) the head noun “shot”. . JRC 2005/05/10

Grammatical Functions Determine the role (function) of each of the linguistic chunks in the sentence and allow to identify the actors involved in certain events. So for example in the following sentence, the syntactic (and also the semantic) subject is the NP constituent “The shot by Christian Ziege”: “The shot by Christian Ziege goes over the goal.” This nominal phrase depends on (and complements) the verb “goes”, whereas the Noun “shot” is the head of the NP (it this the shot going over the goal, and not Christian Ziege!) JRC 2005/05/10

Semantic Tagging Automatic semantic annotation has developed within language technology in recent years in connection with more integrated tasks like information extraction, which require a certain level of semantic analysis. Semantic tagging consists in the annotation of each content word in a document with a semantic category. Semantic categories are assigned on the basis of a semantic resources like WordNet for English or EuroWordNet, which links words between many European languages through a common inter-lingua of concepts. JRC 2005/05/10

Semantic Resources • Semantic resources are captured in dictionaries, thesauri, and semantic networks, all of which express, either implicitly or explicitly, an ontology of the world in general or of more specific domains, such as medicine. • They can be roughly distinguished into the following three groups: • Thesauri: Semantic resources that group together similar words or terms according to a standard set of relations, including broader term, narrower term, sibling, etc. (like Roget) • Semantic Lexicons: Semantic resources that group together words (or more complex lexical items) according to lexical semantic relations like synonymy, hyponymy, meronymy, and antonymy (like WordNet) • Semantic Networks: Semantic resources that group together objects denoted by natural language expressions (terms) according to a set of relations that originate in the nature of the domain of application (like UMLS in the medical domain) JRC 2005/05/10

The MeSH Thesaurus MeSH (Medical Subject Headings) is a thesaurus for indexing articles and books in the medical domain, which may then be used for searching MeSH-indexed databases. MeSH provides for each term a number of term variants that refer to the same concept. It currently includes a vocabulary of over 250,000 terms. The following is a sample entry for the term gene library (MH is the term itself, ENTRY are term variants): MH = Gene Library ENTRY = Bank, Gene ENTRY = Banks, Gene ENTRY = DNA Libraries ENTRY = Gene Bank etc. JRC 2005/05/10

The WordNet Semantic Lexicon WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses) JRC 2005/05/10

WordNet: An Example The word 'tree' has two meanings that roughly correspond to the classes of plants and that of diagrams, each with their own hierarchy of classes that are included in more general super-classes: 09396070 tree 0 09395329 woody_plant 0 ligneous_plant 0 09378438 vascular_plant 0 tracheophyte 0 00008864 plant 0 flora 0 plant_life 0 00002086 life_form 0 organism 0 being 0 living_thing 0 00001740 entity 0 something 0 10025462 tree 0 tree_diagram 0 09987563 plane_figure 0 two-dimensional_figure 0 09987377 figure 0 00015185 shape 0 form 0 00018604 attribute 0 00013018 abstraction 0 JRC 2005/05/10

What is the Semantic Web • “The Semantic Web is a new initiative to transform the web into a structure that supports more intelligent querying and browsing, both by machines and by humans. This transformation is to be supported through the generation and use of metadata constructed via web annotation tools using user-defined ontologies that can be related to one another.” Somewhere on the web JRC 2005/05/10

End User Semantic Web Ontology Articulation Toolkit Agents Ontology Construction Tool Ontologies Community Portal x C  D Inference Engine Web-Page Annotation Tool Annotated Web Pages Metadata Repository Based on www.semanticweb.org JRC 2005/05/10

Extracting Events from Structured Documents • Detecting Metadata in our Example: • Type of game: N/A • Teams involved: England - Deutschland • Players: Deutschland: Kahn (2) - Matthaeus (3) - Babbel (3,5), • Final (and intermediate) score:1:0 (0:0) • Referee:Schiedsrichter: Collina, Pierluigi (Viareggio) • Date: N/A • Etc… JRC 2005/05/10

Extracting Events from Structured Documents (2) • Detecting Events in our Example: • Substitution: Eingewechselt: 61. Gerrard fuer Owen, • Goal: Tore: 1:0 Shearer (53., Kopfball, Vorarbeit Beckham) • Cards: Gelbe Karten: Beckham - Babbel, Jeremies JRC 2005/05/10

Results in XML • Automatically extracted events (and entities and relations) from structured text, on the base of patterns (DTD) of typical expressions and the soccer ontology. Example and Example_2 • Since various results are available in XML files, those results can be merged automatically, guided by the ontology. Example. This is supporting an incremental and dynamic extraction. JRC 2005/05/10

Extracting Events from Semi-Structured Documents • Need of linguistic processing, for providing of a basic structure of the document, which allows the domain specific annotation. Example. JRC 2005/05/10

Extracting Events from Semi-Structured Documents (2) • Using as well the results from the semantic annotation of the structured documents, supporting incremental extraction: Example. JRC 2005/05/10

Actual Development • Extracting information from multilingual balance sheets (WINS eTen project), extending this to unstructured text and extracting relations and events from annexes to balance sheets (upcoming Project MUSING). • Detecting positive/negative mentioning of entities in news documents (project Direct-Info on Media Monitoring). Example. JRC 2005/05/10

Further Challenge for HLT • Not only use HLT for the semantic annotation of web pages (or other documents), but use HLT for supporting ontology extraction/learning from the web (or other documents) JRC 2005/05/10

Example of semantic relation extraction in bio-medicine • [Rheumatoid arthritis][is characterized][by progressive synovial inflammation • and joint destruction][.] JRC 2005/05/10

Open issues for HLT and SW • To achieve a better coordination for improving semantic annotation results • Development and use of standards for interelated linguistic and semantic annotation (see eContent Project LIRICS for standards for language resources) JRC 2005/05/10

Interoperable Standards? JRC 2005/05/10

Thank you! JRC 2005/05/10

Automatic event extraction from text on the base of linguistic and semantic annotation

Automatic event extraction from text on the base of linguistic and semantic annotation

Presentation Transcript

Semantic Annotation of Corpora

An Overview of Event Extraction from Text

Text Annotation

Summarization using Event Extraction Base System

Linguistic Annotation Framework

Lexico -semantic Patterns for Information Extraction from Text

osis linguistic annotation

Linguistic annotation of learner corpora

Automatic Extraction of Hierarchical Relations from Text

Information extraction from text

Information extraction from text

Linguistic annotation

Information extraction from text

TEXT EXTRACTION FROM IMAGES AND VIDEOS

Detection and Extraction of Artificial Text for Semantic Indexing

Information extraction from text

Information extraction from text

Information extraction from text

Automatic Text Summarization: A Solid Base

Information extraction from text