Information Extraction, Language Technology and the Semantic Web

Information Extraction, Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar(DFKI GmbH)

We present collaborative research work on the combination of language technology (LT) and technologies for encoding (domain) knowledge in ontologies, supporting the emergence of the Semantic Web (SW), or maybe more appropriate: Semantic Webs

Semantic Web Applications of LT • Supporting accurate ontology-based semantic annotation of multilingual web documents (Knowledge Markup) • Supporting Ontology Learning/Construction from linguistically/semantically annotated multilingual text (Knowledge Extraction)

Knowledge Markup and Knowledge Extraction Text/Speech Text/Speech Mining Linguistic and Semantic Annotations Concepts, Relations, Events Linguistic Analysis Morpho-Syntactic Analysis and Tagging, Semantic Class Tagging, Term/NE Recognition, Grammatical Function Tagging, Dependency Structure Analysis

Knowledge Markup and Knowledge Extraction (2) Text/Speech/Image-Video Text/Speech/Media Mining Linguistic, Low-level Image and Semantic Annotations Concepts, Relations, Events Linguistic and Media Analysis

Integration of Language Technology and Domain Knowledge

Linguistic Analysis Language technology tools are needed to support the upgrade of the actual web to the Semantic Web (SW) by providing an automatic analysis of the linguistic structure of textual documents. Free text documents undergoing linguistic analysis become available as semi-structured documents, from which meaningful units can be extracted automatically (information extraction) and organized through clustering or classification (text mining). Here we focus on the following linguistic analysis steps that underlie the extraction tasks: morphological analysis, part-of-speech tagging, chunking, dependency structure analysis, semantic tagging.

Morphological Analysis Morphological analysis is concerned with the inflectional, derivational, and compounding processes in word formation in order to determine properties such as stem and inflectional information. Together with part-of-speech (PoS) information this process delivers the morpho-syntactic properties of a word. While processing the German word Häusern (houses) the following morphological information should be analysed: [PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS]

Part-of-Speech Tagging Part-of-Speech (PoS) tagging is the process of determining the correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for a particular word given its current context. The word “works” in the following sentences will be either a verb or a noun: He works[N,V] the whole day for nothing. His works[N,V]have all been sold abroad. PoS tagging involves disambiguation between multiple part-of-speech tags, next to guessing of the correct part-of-speech tag for unknown words on the basis of context information.

Chunking Following Abney: chunks as the non-recursive parts of core phrases, such as nominal, prepositional, adjectival and adverbial phrases and verb groups. Chunk parsing is an important step towards making natural language processing robust, since the goal of chunk parsing is not to deliver a full analysis of sentences, but to extract just the linguistic fragments that can be surely identified. However, even if this strategy fails to produce an analysis for the whole sentence, the partial linguistic information gained so far will still be useful for many applications, such as information extraction and text mining.

Named Entities detection Related to chunking is the recognition of so-called named entities (names of institutions and companies, date expressions, etc.). The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) with the definition of regular expression patterns. Named entity recognition can be included as part of the linguistic chunking procedure and the following sentence fragment: “…the secretary-general of the United Nations, Kofi Annan,…” will be annotated as a nominal phrase, including two named entities: United Nations with named entity class: organization,and Kofi Annan with named entity class: person

Dependency Structure Analysis A dependency structure consists of two or more linguistic units that immediately dominate each other in a syntax tree. The detection of such structures is generally not provided by chunking but is building on the top of it. There are two main types of dependencies that are relevant for our purposes: On the one hand, the internal dependency structure of phrasal units or chunks and on the other hand the so-called grammatical functions (like subject and direct object).

Internal Dependency Structure In linguistic analysis, for this we use the terms head, complements and modifiers, where the head is the dominating node in the syntax tree of a phrase (chunk), complements are necessary qualifiers thereof, and modifiers are optional qualifiers.Consider the following example: “The shot by Christian Ziege goes over the goal.” The prepositional phrase “by Christian Ziege” (containing the named entity Christian Ziege) depends on (and modifies) the head noun “shot”. .

Grammatical Functions Determine the role (function) of each of the linguistic chunks in the sentence and allow to identify the actors involved in certain events. So for example in the following sentence, the syntactic (and also the semantic) subject is the NP constituent “The shot by Christian Ziege”: “The shot by Christian Ziege goes over the goal.” This nominal phrase depends on (and complements) the verb “goes”, whereas the Noun “shot” is the head of the NP (it this the shot going over the goal, and not Christian Ziege!)

Semantic Tagging Automatic semantic annotation has developed within language technology in recent years in connection with more integrated tasks like information extraction, which require a certain level of semantic analysis. Semantic tagging consists in the annotation of each content word in a document with a semantic category. Semantic categories are assigned on the basis of a semantic resources like WordNet for English or EuroWordNet, which links words between many European languages through a common inter-lingua of concepts.

Semantic Resources • Semantic resources are captured in dictionaries, thesauri, and semantic networks, all of which express, either implicitly or explicitly, an ontology of the world in general or of more specific domains, such as medicine. • They can be roughly distinguished into the following three groups: • Thesauri: Semantic resources that group together similar words or terms according to a standard set of relations, including broader term, narrower term, sibling, etc. (like Roget) • Semantic Lexicons: Semantic resources that group together words (or more complex lexical items) according to lexical semantic relations like synonymy, hyponymy, meronymy, and antonymy (like WordNet) • Semantic Networks: Semantic resources that group together objects denoted by natural language expressions (terms) according to a set of relations that originate in the nature of the domain of application (like UMLS in the medical domain)

The MeSH Thesaurus MeSH (Medical Subject Headings) is a thesaurus for indexing articles and books in the medical domain, which may then be used for searching MeSH-indexed databases. MeSH provides for each term a number of term variants that refer to the same concept. It currently includes a vocabulary of over 250,000 terms. The following is a sample entry for the term gene library (MH is the term itself, ENTRY are term variants): MH = Gene Library ENTRY = Bank, Gene ENTRY = Banks, Gene ENTRY = DNA Libraries ENTRY = Gene Bank etc.

The WordNet Semantic Lexicon WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses)

WordNet: An example The word 'tree' has two meanings that roughly correspond to the classes of plants and that of diagrams, each with their own hierarchy of classes that are included in more general super-classes: 09396070 tree 0 09395329 woody_plant 0 ligneous_plant 0 09378438 vascular_plant 0 tracheophyte 0 00008864 plant 0 flora 0 plant_life 0 00002086 life_form 0 organism 0 being 0 living_thing 0 00001740 entity 0 something 0 10025462 tree 0 tree_diagram 0 09987563 plane_figure 0 two-dimensional_figure 0 09987377 figure 0 00015185 shape 0 form 0 00018604 attribute 0 00013018 abstraction 0

CyC: A Semantic Network CYC is a semantic network of over 1,000,000 manually defined rules that cover a large part of common sense knowledge about the world . For example, CYC knows that trees are usually outdoors, or that people who died stop buying things. Each concept in this semantic network is defined as a constant, which can represent a collection (e.g. the set of all people), an individual object (e.g. a particular person), a word (e.g. the English word person), a quantifier (e.g. there exist), or a relation (e.g. a predicate, function, slot, attribute). The entry for the predicate #$mother: #$mother : (#$mother ANIM FEM) isa: #$FamilyRelationSlot #$BinaryPredicate This says that the predicate #$mother takes two arguments, the first of which must be an element of the collection #$Animal, and the second of which must be an element of the collection #$FemaleAnimal.

Word Sense Disambiguation Words mostly have more than one interpretation, or sense. If natural language were completely unambiguous, there would be a one-to-one relationship between words and senses. In fact, things are much more complicated, because for most words not even a fixed number of senses can be given. Therefore, only in certain circumstances and depending on what we mean exactly with sense, can we give restricted solutions to the problem of Word Sense Disambiguation (WSD)

Ontology_1: Movies Title: String Date: mm/dd/yyyy Duration: minutes Type: (action, drama,..) Director: String Main Actors: Name_1: Role: Name_2: Role: Name_3: Role: … … Ontology_1: Movies Title: Lord of the Rings Date: Duration: Type: Director: Peter Jackson Main Actors: Name_1: Role: Name_2: Role: Name_3: Role: … … A simplified Example of a Domain Ontology Instances

Example of RDF Schema forthe Movie Ontology <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:rdfs='http://www.w3.org/2000/01/rdf-schema#' xmlns:NS0='http://webode.dia.fi.upm.es/RDFS/MovieOntology#' > <rdf:Description rdf:about='http://webode.dia.fi.upm.es/RDFS/MovieOntology#SpecialEffectsCompanyActing'> <rdf:type rdf:resource='http://www.w3.org/2000/01/rdf-schema#Class'/> <rdfs:comment>Details of company that created special effects in this movie</rdfs:comment> <rdfs:subClassOf rdf:resource='http://webode.dia.fi.upm.es/RDFS/MovieOntology#CompanyActing'/> </rdf:Description> <rdf:Description rdf:about='http://webode.dia.fi.upm.es/RDFS/MovieOntology#Police'> <rdf:type rdf:resource='http://www.w3.org/2000/01/rdf-schema#Class'/> <rdfs:comment>Films that deal solely with police activity</rdfs:comment> <rdfs:subClassOf rdf:resource='http://webode.dia.fi.upm.es/RDFS/MovieOntology#Crime'/> </rdf:Description> etc…

Multilingual terminological lexicon, attached to a domain ontology (MUMIS) <lex-element id="ID" concept="Shot-on-goal"> <... lang="DE" type="main">Torschuss</term> <... lang="EN" type="main">shot on goal</term> <... lang="NL" type="main">schot op doel</term> <definition>ein Angriffsspieler kickt den Ball zu den gegnerischen Tor</definition> <... lang="DE" type="synonym">Distanzschuss</term> <... lang="DE" type="synonym">Nachschuss</term> <... lang="DE" type="synonym">Schuss</term> <... lang="DE" type="synonym">abzieh</term> </lex-element>

Extension and Formalization of the multilingual terminological lexicon, including syncategorematic information. Supporting WSD. <lex-element id="ID" concept="Shot-on-goal"> <...lang = "DE" type = "main„ pos = „N“ mod = {„von concept = „Player“ | concept = „player“ gender = „gen“ | pos = „posspron“ } >Torschuss</term> <...lang="DE" type="synonym„ pos = „V“ comp = {„SUBJ“ concept = „Player“} >abzieh</term> <definition>URL: DFB home page/glossary</definition> </lex-element>

Integrating Syntactic and Domain Knowledge Including Syntactic Analysis for a more accurate tagging of domain specific semantic annotation

Ontology_2: PP Head: Prep Type: {LocPP,DatePP, etc.} Comp: NP Ontology_1: NP Head:N Mod: {Adj*,PP?,GenNP} Spec: {Det? PossPron?} Type: {RefNP, ProNP, DateNP,etc.} Abstraction over Syntactic Annotation Ontology_3: Dependencies Head Comp Mod Spec Ontology_4: Grammatical Functions Subject, Object, Ind. Object NP Adjunct, PP Adjunct, etc..

Merging of Syntactic and Domain Knowledge Example of a possible rule for conceptual annotation: If (Head of Subj_NP of Verb[type=soccer::shot-on-goal] is a person) => { annotate head of NP with semantic class “soccer::player”; …} Example of a rule for Instance Filling: If (term annotated with concept “soccer::player”) => { try to find information about relations “Team”, “Age” etc. } (Template Filling in Information Extraction).

NLP-based knowledge markup

MuchMore:DTD forAnnotation id code from pref umlsterm umlsterms to tui cui msh code id xrceterms xrceterm pref from to tui cui msh id term1 semrels semrel term2 type sentence document ewnterms ewnterm offset sense id gramrels gramrel type id from chunks chunk to type id text token pos lemma

MuchMore: Linguistic Annotation (Lemmatization, POS, Basic Chunking) Balint syndrom is a combination of symptoms including simultanagnosia, a disorder of spatial and object-based attention, disturbed spatial perception and representation, and optic ataxia resulting from bilateral parieto-occipital lesions. <text> <token id="w1" pos="NN">Balint</token> <token id="w2" pos="NN">syndrom</token> <token id="w3" pos="VBZ" lemma="be">is</token> <token id="w4" pos="DT" lemma="a">a</token> <token id="w5" pos="NN" lemma="combination">combination</token> <token id="w6" pos="IN" lemma="of">of</token> <token id="w7" pos="NNS" lemma="symptom">symptoms</token> ... <token id="w20" pos="JJ" lemma="spatial">spatial</token> <token id="w21" pos="NN" lemma="perception">perception</token> <token id="w22" pos="CC" lemma="and">and</token> <token id="w23" pos="NN" lemma="representation">representation</token> ... </text> <chunks> <chunk id="c1" from="w1" to="w2" type="NP"/> <chunk id="c7" from="w20" to="w23" type="NP"/> </chunks> >

MuchMore: Semantic Annotation (UMLS, EuroWordNet) Balint syndrom is a combination of symptoms including simultanagnosia, a disorder of spatial and object-based attention, disturbed spatial perception and representation, and optic ataxia resulting from bilateral parieto-occipital lesions. <umlsterm id="t7" from="w20" to="w21"> <concept id="t7.1" cui="C0037744" preferred="Space Perception" tui="T041"> <msh code="F2.463.593.778"/> <msh code="F2.463.593.932.869"/> </concept> </umlsterm> <umlsterm id="t8" from="w26" to="w26"> <concept id="t8.1" cui="C0029144" preferred="Optics" tui="T090"> <msh code="H1.671.606"/> </concept> </umlsterm> <semrel id="r7" term1="t7.1" term2="t8.1" reltype="issue_in"/> <ewnterm id="e2" from="w21" to="w21"> <sense offset="0487490"/> <sense offset="3955418"/> <sense offset="4002483"/> </ewnterm>

MUMIS:DTD forLinguistic Annotation Subord-Clause AdvP AP Document Paragraph Sentence NE NP PP VG

MUMIS:DTD forLinguistic Annotation TYPE STRUK AP_AGR AP STRING W AP_HEAD

VG MUMIS:DTD forLinguistic Annotation TYPE VG_TYPE VG_SUBCAT_STEM VG_AGR STRING VG SENT_STRING KLAMMER STRUK VG_STRG W VG_HEAD ...

MUMIS:DTD forLinguistic Annotation STEM INFL POS TC TYPE STRING CLAUSE_PP_ADJUNKT CLAUSE_SUBJ SENT_STRING CLAUSE_PRED_SUBCAT W CLAUSE CLAUSE_TYPE CLAUSE_VG_LIST CLAUSE_PP_LIST CLAUSE_NP_LIST CLAUSE_PRED_STRG CLAUSE_PRED_AGR ...

MUMIS:Linguistic Annotation (Lemmatization … Dependency Structure) Industrie, Handel und Dienstleistungen werden in der ersten Liste aufgeführt, wobei die in Klammern gesetzten Zahlen auf die Mutterfirmen hinweisen. (Industry, trade and services are mentioned in the first list, in which numbers within brackets point to parent companies.) <chunks> <chunk id="c1" from="w1" to="w5" type="NP" head=”w1,w3,w5”/> <chunk id="c2" from="w6" to="w6" type="VG"/> <chunk id="c3" from="w7" to="w10" type="PP" head=”w7” complement=”w8,w9,w10”/> <chunk id="c4" from="w11" to="w1" type="VG"/> …. </chunks> <clauses> <clause id="cl1" from="c1" to="c4" pred_struct="c2 c4" GF_Subj="c1"/> <clause id="cl2" from="c6" to="c9" pred_struct="c9" GF_Subj="c6"/> </clauses>

MUMIS: Semantic Annotation (Events) 7. Ein Freistoss von Christian Ziege aus 25 Metern geht über das Tor. <chunks> <chunk id="c1" from="w1" to="w5" type="NP" head=”w2”pp modifier=”w3 w4 w5”/> <chunk id="c2" from="w6" to="8" type="PP" head=”w6” complement=”w7 w8”/> <chunk id="c3" from="w9" to="9" type="VG"/> <chunk id="c4" from="w10" to="w12" type="PP" head=”w10” complement=”w11 w12”/> </chunks> <clauses> <clause id="cls1" from="c1" to="c4" pred_struct="c3“ GF_Subj="c1"/> </clauses> <events> <event id="e1" clause=”cls1” event-name=”free-kick”> <arguments> <argument id="arg1" name="player” value=”w4, w5”/> <argument id="arg2" name="location” value=”25-meter”/> <argument id="arg3" name="time” value=”07:00”/> </arguments> </event> <event id="e2" clause=”cls1” event-name=”goal-scene-fail”> <arguments> <argument id="arg1" name="player” value=”w4, w5”/> <argument id="arg2" name="location” value=”25-meter”/> <argument id="arg3" name="time” value=”07:00”/> </arguments> </event> </events>

Conceptual Annotations for Multimedia Indexing and Retrieval: A multilingual cross-document and incremental IE approach (MUMIS) • Technology development to automatically index (with formal annotations) lengthy multimedia recordings (off-line process): Find and annotate relevant entities, relations and events • Technology development to exploit indexed multimedia archives (on-line process): Search for interesting scenes and play them via Internet • Test Domain: Soccer Games / UEFA Tournament 2000

Indexing by... Off-line Task • Automatic Speech Recognition (Radio/TV Broadcasts) Automatically transforms the speech signals into texts (for 3 languages — Dutch, English and German) • Natural Language Processing (Information Extraction) Analyse all available textual documents (newspapers, speech transcripts, tickers, formal texts ...), identify and extract interesting entities, relations and events • Merging all the annotations produced so far • Create a database with formal annotations • Use video processing to adjust time marks

Information Extraction • Information Extraction (IE) is the task of identifying, collecting and normalizing relevant information for a specific application or user. • The relevant information is typically represented in form of predefined “templates”, which are filled by means of Natural Language (NL) analysis. • IE combines pattern matching mechanisms, (shallow) NLP and domain knowledge (terminology and ontology).

Information Extraction (2) IE is generally subdivided in following tasks: - Named Entity task (NE) - Template Element task (TE) - Template Relation task (TR) - Scenario Template task (ST) - Co-reference task (CO)

Subtask of IE • Named Entity task (NE): Mark into the text each string that represents, a person, organization, or location name, or a date or time, or a currency or percentage figure. • Template Element task (TE): Extract basic information related to organization, person, and artifact entities, drawing evidence from everywhere in the text.

Subtask of IE (2) • Template Relation task (TR): Extract relational information on employee_of, manufacture_of, location_of relations etc. (TR expresses domain-independent relationships). • Scenario Template task (ST): Extract pre-specified event information and relate the event information to particular organization, person, or artifact entities (ST identifies domain and task specific entities and relations). • Co-reference task (CO): Capture information on co-referring expressions, i.e. all mentions of a given entity, including those marked in NE and TE.

IE applied to soccer Terms as descriptors for the NE task Team:TitelverteidigerBrasilien, den respektlosen AußenseiterSchottland Player:SuperstarRonaldo, von BewacherCalderwood noch von AbwehrchefHendry, von Jacksonals drittem Stürmer,Torschütze Cesar, von Roberto Carlos(16.), Referee: vom spanischen SchiedsrichterGarcia Aranda Trainer: Schottlands Trainer Brown, Kapitän Hendry seinen KeeperLeighton Location: im Stade de France vonSt. Denis (more fine-grained location detection would be: Stadion: im Stade de FranceandCity: vonSt. Denis ) Attendance: Vor 80000Zuschauern

IE applied to soccer (2) Terms for NE Task Time: in der 73. Minute, nach gerade einmal 3:50 Minuten, von Roberto Carlos (16.),nach einer knappen halben Stunde, scheiterte Rivaldo (49./52.) jeweils nur knapp, das vor der Pause Versäumte versuchten die Brasilianer nach Wiederbeginn, ... Date: am Mittwoch, der Turnierstart (?), im WM-Eröffnungsspiel(?) Score/Result: Brasilien besiegt Schottland 2:1, einen 2:1 (1:1)-Sieg, der zwischenzeitliche Ausgleich, in der 4. Minute in Führung gebracht, köpfte zum 1:0 ein

IE applied to soccer (3) Relations for TR Task: Opponents: BrasilienbesiegtSchottland, feierteder Top-Favorit ... einen glücklichen 2:1 (1:1)-Siegüber den respektlosen Außenseiter Schottland, Player_of: hatte Cesar Sampaioden vierfachen Weltmeister ... in Führung gebracht, Collins gelang ... der zwischenzeitliche Ausgleichfür die Schotten, der KeeperdesFC Aberdeen, BrasiliensKeeper Taffarel Trainer_of: Schottlands Trainer Brown ...

IE applied to soccer (4) Events for ST task: Goal: in der 4. Minute in Führung gebracht, das schnellste Tor ... markiert, Cesar Sampaio köpfte zum 1:0 ein, Collins (38.) verwandelte den Strafstoß, hätte Kapitän Hendry seinen Keeper Leighton um ein Haar zum zweiten Mal bezwungen, von dem der Ball ins Tor prallte Foul: als er den durchlaufenden Gallacher im Strafraum allzu energisch am Trikot zog Substitution: und mußte in der 59. Minute für Crespo Platz machen...

Automatic Speech Recognition Domain Modeling Ontology ASR DM Ontology Ontology Domain Lexicon Information Extraction Merging IE Merged Annotated formal text Legend User Interface Ontology Ontology Query Merging EN NL DE UI FORMAL Annotations Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Free Text Formal Text Formal Text Formal Text Formal Text Trans- cripts Formal Text Speech Signals Formal Text Formal Text Soccer Texts Formal Text Formal Text Formal Text Formal Text Anno-tations Conceptual Annotations for Multimedia Indexing and Retrieval: MUMIS

The first user interface of MUMIS

Information Extraction, Language Technology and the Semantic Web