Ops Review Parsing & Semantics

Ops ReviewParsing & Semantics November 22, 2007

General research context

Highlights of the last 6 months • Continue research and development of semantic modules (e.g. temporal expression, semantic resources, disambiguation of Named entities, co reference, events) • Progress in solving the issue of the linguistic resources • Progress in Knowledge representation (e.g. beginning of White Paper, research connection, implementation) • External recognition (evaluations campaign, program committee and scientific expertise, European and national projects, research connections)

FactSpotter Co-reference Chunking Semantic disambiguation Syntactic Tagger Named entities Morphological analyser Normalisation Temporal expressions Tokenizer location Xip lexicons Dependencies Inference On-the-flyrule compiler interface Disamb Concept matching NTM XIPgrammarlayers

Research

Multilingual Language Resources • Advanced linguistic technology relies on languages resources which must be adapted, extended and improved for new fields. • Morphological Analyzers: • new FST for French (based on AlethDic/Genelex) • new FST for English (based on the SPECIALIST NLP Tools) • XIP Grammars: • integration of new morphology into French Grammar [finished] • creation of basic, general purpose grammar for Dutch [finished] • extension of German grammar (named entities) [ongoing]

XIPTRANS • An in-house transducer implementation which is part of the new XIP engine • It still uses previous FST files • It allows for a compiling of lexicons on the fly • A version of XIP is now available which does not embed the XFST library anymore.

Named Entities/Semantic disambiguation • Named Entity Normalization: detect in a text multiple occurrence of different surface forms of Named Entities and link them to the single corresponding referent (J. Chirac, Chirac, Jacques Chirac, Chichi, M. Chirac, “le Chi”…); • Implementation of part of the norm in English and French • Named Entity Detection for French : Work with INA and Vecsys on NE detection on TV show notices and on TV show speech to text transcriptions (Infomagic SP5): improvement of the system, integration of events an work of art titles • Scenario on Tourism for the Nutch Search Engine : Development of event NE extraction (festivals etc) in French (Infomagic SP1) • Maud’s Phd on Named Entities: in the process of being written

Co-reference • Evaluation (internal): finalizing a set of tools to evaluate against the ACE reference corpus • Coreference enhanced with resources: ongoing integration of WordNet to experiment on anaphoric Definite NPs

Temporal Processing • Work on verbal tenses : Determining tense using verbal morphology (continuation of the work between April and September) • Temporal ordering : Temporally ordering events appearing in text (continuation of the work between April and September).

UIMA • Four XIP-based UIMA components now available • InfoM@gic • morphological analysis and named entities in French • morphological analysis and named entities in English • SAPIR • lemmatization and named entities in English • summarizer (demo version for integration testing, not fully functional)

Knowledge Representation • Data and their extraction: Data are very small pieces of knowledge extracted from different medias. • Representation and storage: all these atoms of knowledge need to be represented and store in a knowledge base Also this Knowledge base will serve as a basis for more reasoning each time more knowledge is extracted from new documents, and therefore added to the knowledge base. • Access to knowledge: The other important point to raise is how users will be able to access knowledge.

Knowledge representation • Need to consider many aspects, among them : • how to extract this knowledge? • how to access this knowledge? • how to properly represent this knowledge? • how to reach a general level of representation (keeping in mind the fact that customisation will be ALWAYS needed) • how to reason on the extracted knowledge (inference etc) • how to store this knowledge

Design - The big picture with complete practical processing

Implementation – Representation language • First choice: OWL Web Ontology Language • Because: • W3C Recommendation as Standard in 2004 and widely used • Based on XML as basic raw data exchange format • Based on RDF as reference data exchange • Different level of expressive power: Lite, DL, Full • DL: description logic (introduce negation, cardinality and complex restrictions)

Implementation – Java API using Jena Framework • Import description of domain using “com.hp.hpl.jena.ontology.OntModel” • Bind different reasoners • Default reasoner for OWL-DL (instances based reasoning) • External reasoners • Choice of reasoners: Pellet (not RACER, maybe FaCT) • Implementation of a test set for trying different framework • Implementation of JenaKB and run the simple trial.

Implementation – Input the description of Domain

Knowledge Representation • Added Conceptual Graph handling within the core of XIP • Representation of the information throughout a document in the guise of one or many graphs • Full API to handle and manipulate these graphs from XIP formalism. • Full API to benefit from these graphs • Mechanism already protected in a patent

Frame Semantics • Starting integration of FrameNet data (from FrameNet Berkeley Project) into XIP linguistic processing. • Use of both static frame description and also annotated data from real texts provided by FrameNet. • This task implies: • Adaptation of grammar results to FrameNet valency description. • Explore annotated data in order to detect the most probable Frame Element associated to a given lexical unit. • Expected benefits: • Exploitation of this work in Risk detection application • A more fine-grained normalized representation • Explore side-effects of frame annotation for semantic disambiguation, thesaurus building etc.

Event extraction Source factual counter-factual not factual • Event • Who did what when where • persons temporal location • organizations expressions expressions Event types

EC &National projects

EC/national proposals status • SAPIR on going • InfoM@gic on going • PASSAGE on going • ERRQI, CACAO accepted • RADARS, VUs : shortlisted . Resubmission of RADARs envisioned in 2008 (encouraged by the EC) • ATHOS (resubmission envisioned at ANR in 2008) • Discussion on a new proposal with IBM

SAPIR • Delivery of D3-1 (coordinated by XRCE) in July 2007 • Achievement of first Milestone (coordinated by XRCE) end of September 2007 (Implementation of the first feature extractors for each media using UIMA). • June deliverable (coordinated by Xerox): MPEG-7 based annotation scheme for multimedia documents Continuation of the technical work which aims to build a UIMA annotator extracting MPEG-7 features from text (either original text or text resulting from STT system)

Infom@gic: use case risk 2nd phase • Progress since April 2007: • Demo of event extraction: EventSpotter text to data • Evaluation: • Text to event types: input to automatic risk assessment system • Communication: common article with partners at the Language Technology Conference in Poland

Infom@gic: SP2.11 phase 2 • Description: Linguistics methods for named Entity Extraction • Demo of “double” annotation of NE at the 3rd July presentations to DGE • Integration of XIP NE extractor within UIMA • French integrated • English under construction • To be delivered and integrated to the infom@gic platform in November • Start reflections on named entity relationships (paper deliverable) • Start definition of a use case for phase 3 about Named Entity disambiguation

PASSAGE: Syntactic Parsing and Annotated Corpora • Type: French National Project (ANR, Masses de Données / Connaissances Ambiantes) • Aim: • Evaluation of syntactic parsers • Building (semi-automatically) a large-scale corpus (millions of words) annotated with syntactic structures, using the outputs of the project participants’ parsers • Status: started in Q1 2007 • Our role: • Contributing to the definition of the annotation reference • Participating (with XIP) in the PASSAGE evaluation campaigns • Contributing to the elaboration of methods for building the final annotated corpus 10K euros, 36 months

CACAO:Cross-language Access to Catalogues And On-line libraries • Type: STREP project part of eContentPlus (IST-FP6 ) • Aim: providing a sound and maintainable infrastructure for multilingual access to the content of digital libraries and on line library catalogues • Status: accepted, grant expected to be signed with the EC for a start december 1, 2007 • Our role: coordinator and • Multi-lingual search (French, German, English) • Semantic indexing (French, German, English) • Named entity (French, German, English) 340KEuros, 24 months

EERQI: European Research Quality Index • Type: collaborative project (Socio-economic sciences and the Humanities, FP7) • Aim: • content aggregation • work out new methods for research quality evaluation • Status: accepted, before negotiations • Our role: coordinate 3 work packages • Multi-lingual search • Semantic analysis of citation types • Named entity extraction $180K , 36 months

Scientific excellence

IDs • C. Brun & C. Hagege “Semantic Compatibility Checking for Automatic Correction and Discovery of Named Entities” (rated 3). • Claude Roux and Xavier Tannier “Detection of dates and named entities within documents and mails to update personal calendars” (rated 3) • Robert Lofthus, Kristine German, frederique Segond, Tracy King “Smart Document methods to teach /improve reading skills” (rated 3) • “A Method and System to Search and to Retrieve Information Sources and to Tag the Information with respect to its Factuality according to the Sources” (S. Aït-Mokhtar, Á. Sándor) • “Real-time query suggestion in a troubleshooting context.” (Roulland, Castellani, Kaplan, Grasso, O'Neill, Selin ) • “Finding and validating email addresses” (Anne Schiller, Frédérique Segond)

Publications Conferences: • Aude Rebotier, Agnes Sandor, Stavroula Voyatzi, Takuya Nakamura, Claude Martineau, Thomas Delevallade, Philippe Capet, Julien Jacquelinet: Intelligent awareness: event extraction, information evaluation & risk assessment, L&TC, 3rd Language & Technology Conference, Poznan, Poland, 5-7 October, 2007. • C. Hagege, X. Tannier, XRCE-T: XIP Temporal Module for TempEval Campaign in Proceedings of SemEval 2007, Prag. • Brun, Ehrmann, Jacquet “An Hybrid System for Named Entity Metonymy Resolution”, in Proceedings of LTC’07, Poznan, Poland, October’07 • C. Hagege & X. Tannier : XTM: A Robust Temporal Text Processor. Submitted at CICLING 2008. • Claude Roux: A Calendar Interface in French: XIPAgenda, Submitted at the Intelligent Interface Conference 2008 Journals Luca Dini, Frederique Segond “la linguistique informatique au service des sentiments”, Revue de l’electricite et de l’electronique, Octobre 2007 Books Anne Schiller "Coding inflectional morphology in computational dictionaries” Submitted to "Dictionaries. An International Encyclopedia of Lexicography"

External Scientificactivities • Program Committee & expert: • Frederique: • act as reviewer for RANLP2007, chairman as well • invited tutor at RALNP 2007 • Expert for the FP7 evaluation in Robotics and interfaces in Brussels (June 11-15) • reviewer for the LTC2007 conference • Expert to the FP7 evaluation of econtent Plus in Luxembourg (November 14-21) • expert for INRIA (Conseil Scientifique PRST MISN) (November 26, Nancy) • expert for INRIA (projet Alpage) • reviewer for the publisher Lavoisier • Participant to a round table at Paris Innovation tour (Thales, Vexcsys, FT, INA, EXALEAD,) • ParSem delivered the prize for the best paper in semantics at RANLP 2007 • Salah: • acted as a member of the Intern Day jury. • reviewer for the NLE (Natural Language Engineering) journal • Salah and Frederique acted as reviewers for the ANR • Teaching: • Agnes: lecture on Semantics at Stendhal • Guillaume, Caroline H and Caroline B : lecture semantics at Grenoble III • Presentation at the Loria NLP seminar (Nancy) on Named Entities disambiguation (Maud) • Caroline B. invited at University of Luminy for a seminar about NE and a course about XIP • Salah: Master 2 Ingénierie de la cognition, Université Grenoble II.

Internal Scientificactivities • Internal • Participation to the SEP day (two prizes won): • paper category: • C. Brun G. Jacquet & Maud Erhmann "A Hybrid System for Named Entity Metonymy Resolution" (awarded at the paper-id session) • Caroline Hagege, Xavier Tannier, presentation of work on temporal processing • Demo category • Claude Roux , Xavier Tannier presentation of the work on the agenda (awarded at the demo session) • Aaron member of UAC

Some of the Research connections (outside EC project) • France • Universite de Lyon (Stéphane Duchâteau): The goal was to connect the French XIP grammar to the voice recognition system: Dragon to spot named entities in art descriptions (support the application of NLP to speech to text (see also InfoM@gic) • Universite de Luminy (Cedric Tarcitano: Internship on event (InfoM@gic)) • Universite de Lille (Pierre Marquis): support on Knowledge representation in general • Italy • IRST (Bernardo Magnini): support research around fact extraction with other approach (entailment) as well as recognition in the field and participation to international evaluation campaigns • Hungary • Hungarian academy of Sciences: (Heja Eniko: internship on semantic disambiguation) • Portugal • Lab at INESC-ID in Lisbon for a common participation to the second NER evaluation campaign for Portuguese • Norway • Participation to the HAREM guidelines with a proposal for temporal expressions annotations (organized by SINTEF-Linguateca ) • Great Britain • Cardiff University: possible candidate to work on the customization of FactSpotter for litigation (interview of Nikolaos Lagos on December 3) • USA • MIT (Rob Speer) : discussion about building a French version of Open Mind. Interest expressed for Internship (support research around knowledge representation as well as international recognition) • Cornell University: (Serena Crivellaro: internship on Famenet this summer)

Evaluations/Competitions

TempEval • Wider context : SemEval 2007, Prag • 3 tasks • task A: for each given event E in a text, link E to every given temporal expressions of the same sentence • task B: for each given event E in a text, link E to the Document Creation Time • task C: for each pair of consecutive events E1 and E2, find the temporal relation holding between E1 and E2. • Set of temporal relations : BEFORE, AFTER, OVERLAP, BEFORE_OR_OVERLAP, AFTER_OR_OVERLAP.

TempEval • Results • Tasks A and B: precision 0.79 (A) 0.82(B), recall 0.50(A) and 0.60(B). • Highest precision but low recall. • Task C: precision 0.58, recall 0.58 • Second-best score

Semeval NE resolution • Detect and categorize NE sense shift for location and organization names: • England lost the semi-final loc-for-people(England) • Vietnam haunted him loc-for-event(Vietnam) • IBM announced that …  org-for-members(IBM) • BMW slipped 4p to 3p. org-for-index(BMW)

Results (1) Prec. Coverage Prec. baseline Coverage baseline Location (coarse) 0.851 1 0.794 1 Location (medium) 0.848 1 0.794 1 Location (fine) 0.841 1 0.794 1 Org (coarse) 0.731 1 0.617 1 Org (medium) 0.711 1 0.617 1 Org (fine) 0.700 1 0.617 1 Coarse: distinction between literal and non-literal readings Medium: distinction between literal, metonymic and mixed readings Fine: distinction between all types of metonymy

Results (2) LOCATION-fine ORGANIZATION-fine

XGS and other customers

Cannelle • Semi-automatic process+tool for generating IR query expansion rules • In-house evaluation showed significant potential for improving IR results • Interest from XNA, transfer to be contracted for 2008 • (more details in WPT review)

XGS / Bouygues Telecom Precision Recall customer's name 94.2% 89.7% customer's phone number (with disamb.) 97.8% 74.0% Termination's date (with disamb.) 100% 80.0% Specific termination 100% 89.0% • « termination contract » mail : Linguistic analysis • information extraction with XIP (easier to harder): • customer's number ; customer's phone number • customer's name ; customer's adress • Mail’s type (classical termination, termination owing to bereavement, etc.), termination’s date. • Evaluation on complex informations (on 200 documents)

Other events • Claude gave a presentation to Loyds • Aaron Participated in voice of the customer meeting with HSBC May 16 • Press event on June 20 around Innovation and services with the announcement of FactSpotter • Frederique did many interviews for different medias with Irene • Press event (October 3,4) • Frederique visited VIDAL (June, 27) • Frederique visited ARISEM for setting ground around direct collaboration (interested by XIP) • Thales visited Parsem for using XIP • Discussion with European Commission: interest in licensing term extractor • Frederique and Denys: discussion with VIDAL ( interest in licencing MDA and XIP) • Redaction of a document describing some of our linguistic tools to answer request from US Air Force contact (Caroline B) • First contact for RFPs (Tracy Mendez, Thomas Hurysz) • XLS (demo of FactSpotter to Mc Cusheon and J Schneider)

Ops Review Parsing & Semantics