Robust Semantic Processing for Information Extraction

Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge aac@cl.cam.ac.uk

Outline • Information Extraction • Combining deep and shallow processing • RMRS • MRS • basic ideas of RMRS • RASP-RMRS • RMRS and IE in Deep Thought • SciBorg project

Acknowledgements • Deep Thought (EU funded, 2002-2004) • Computer Lab: Ann Copestake, Anna Ritchie, Ben Waldron • Sussex, Saarland, DFKI, Xtramind, CELI, NTNU • SciBorg (EPSRC, 2005-2009) • Computer Lab: Ann Copestake, Simone Teufel, CJ Rupp, Advaith Siddharthan • Chemistry: Peter Murray-Rust, Peter Corbett • CeSC: Mark Hayes, Andy Parker • DELPH-IN (informal ongoing collaboration) • Boeing funding to Computer Lab: Ben Waldron • especially Dan Flickinger, Alex Lascarides, Stephan Oepen, John Carroll, Anette Frank

Information extraction • Classic IE: MUC-style template filling, gene/protein interactions • IE in general: acquiring specific types of knowledge from text via language processing: e.g., • organic chemistry syntheses • ontological relationships • relationships between texts (for search) • IR, QA, I2E

IE from Chemistry texts To a solution of aldimine1 (1.5mmol) in THF (5mL) was added LDA (1mL, 1.6 M in THF) at 0 °C under argon, the resulting mixture was stirred for 2h, then was cooled to -78 °C ... recipe expressed in CML ... alkaloids and other complex polycyclic azacycles ... <owl:Class rdf:ID="Alkaloid"> <rdfs:subClassOf rdf:resource="#Azacycle" /> Enamines have been used widely ... (citation Y), however, ... did not provide the desired products. X cites Y (contrast)

Standard IE architecture • Preprocessing of markup etc (specific to text type) • Tokenisation (not domain-specific) • Named Entity Recognition (domain-specific ontologies, domain-specific patterns) • Chunking: detection of noun and verb groups (not domain-specific) • Anaphora resolution (domain-specific ontologies) • Relationship detection via patterns over chunks (domain- and task- specific) • DB instantiation (task-specific)

State of the art in IE • Several options for whole IE systems and individual components, especially for English • Increasing integration of ontologies • Commercial systems for some applications • But, many IE-style tasks still done manually: • IE performance (especially when high precision required) • IE robustness to different text types • IE porting requirement (especially NER and relation patterns) • Performance of standard architecture may be reaching a plateau • More advanced IE tasks are not generally attempted • e.g., organic synthesis example could be done with adaptation of standard architecture, but would take substantial effort by highly trained people. • Skill set: substantial domain skills plus substantial NLP

Objectives • Integrate and adapt tools for language processing in general • Eventual use by non-NLP people: black box for language processing • Incorporate deeper processing (DELPH-IN technology): aim to get above plateau • Integration with XML, semantic web • Methodology: • Combine statistical and symbolic processing, machine learning and hand-crafting • Open Source where possible, collaborative development • No toy systems, no artificial evaluations • Multilingual via collaboration

Deep processing in IE • Some early IE systems attempted to use deep processing: SRI (and also NYU) • FASTUS was originally shallow preprocessor for TACITUS but TACITUS was dropped: much too slow, not sufficiently robust • Often claimed: deep processing failed for IE, but: • only two serious attempts(?), both under time pressure, limited types of IE task • deep processing has improved since early 1990s: • speed • empirical coverage (note that hand-built deep grammars do scale, unlike traditional AI knowledge bases) • integration of statistical techniques into deep processing • if existing IE architecture is approaching a plateau, we have to try something else – i.e., combined deep and shallow processing (DFKI Whiteboard project)

Integrating processing • No single system can do everything: deep and shallow processing have inherent strengths and weaknesses • shallow: speed and robustness: e.g., POS tagging, chunking • deep: detail, precision, potential for bidirectional processing: e.g., HPSG-based parsers and generators (DELPH-IN technology) • also intermediate: RASP (Robust accurate statistical parser): relatively detailed but no lexicon. • Domain-dependent and domain-independent processing must be linked • Desirable to have a common representation language for processing above sentence level (e.g., anaphora) • Long-term solutions ...

Compositional semantics for component integration • Need a common representation language for systems: pairwise compatibility between systems is too limiting • Syntax is theory-specific and unnecessarily language-specific • Eventual goal of sentence analysis should be semantics • Core idea: shallow processing gives underspecified semantic representation, so deep and shallow systems can be integrated • Full interlingua / common lexical semantics is too difficult (certainly currently), but can link predicates to ontologies, etc.

Integration via underspecified semantics • Integrated parsing: • shallow parsed phrases incorporated into deep parsed structures • deep parsing invoked incrementally in response to information needs • Knowledge sources expressed via semantics can be used by multiple components: e.g., • NER, IE templates, anaphora resolution • Advantages over ad-hoc representation approaches: • Ability to link with detailed lexical semantics as it becomes available • Language generation from semantic representation • Explicit logic: formal properties clearer, representations more generally usable • Deep semantics taken as normative: extensibility

Robust Minimal Recursion Semantics • Minimal Recursion Semantics: MRS. Compositional semantics for deep processing: • Copestake, Flickinger, Sag and Pollard (1999, in press) • adopted for DELPH-IN and other HPSG work • also compatible with LFG etc • logically well-defined • flat semantics (easier to process, allows information to be ignored) • underspecification of quantifier scope (avoid ambiguity) • novel approach to composition (monostratal) • Robust MRS: adaptation of MRS allowing processing without a subcategorization lexicon

RMRS: Extreme underspecification • Goal is to split up semantic representation into minimal components (cf Verbmobil VITs) • Scope underspecification (MRS) • Splitting up predicate argument structure • Explicit equalities • Hierarchies for predicates and sorts • Compatibility with deep grammars: • Sorts and (some) closed class word information in SEM-I (API for grammar, more later) • No lexicon for shallow processing (apart from POS tags and possibly closed class words)

Semantics from POS tagging • every_AT1 cat_NN1 chase_VVD some_AT1 dog_NN1 • _every_q(x1), _cat_n(x2sg), _chase_v(epast), _some_q(x3), _dog_n(x4sg) • Tag lexicon: AT1 _lemma_q(x) NN1 _lemma_n(xsg) VVD _lemma_v(epast)

Deep parser output • Conventional semantic representation Every dog chased some cat every(x,cat(xsg),some(ysg,dog1(ysg),chase(esp,xsg,ysg))) some(ysg,dog1(ysg),every(xsg,cat(xsg),chase(esp,xsg,ysg))) • Compositional: reflects morphology and syntax • Scope ambiguity is explicit • May be awkward to process if you don’t care about quantifier scope

Modifying syntax of deep grammar semantics: overview • Underspecification of quantifier scope: Minimal Recursion Semantics (MRS) – next 6 slides ... • Robust MRS • Separating arguments • Explicit equalities • Conventions for predicate names and sense distinctions • Hierarchy of sorts on variables

PC trees every some x y cat some dog1 every x y y chase x chase dog1 cat e x y y x e x y Every cat chased some dog

PC trees share structure every some x y cat some dog1 every x y y chase x chase dog1 cat e x y y x e x y

every x cat some x y dog1 y Bits of trees Reconstruction conditions: tree-ness variable binding chase e x y

Label nodes and holes h0 lb4:some Valid solutions: equate holes and labels y lb5:dog1 h7 lb1:every y lb2:cat x h6 h0 – hole corresponding to the top of the tree x lb3:chase e x y

e x y Maximize splitting h0 lb4:some Constraints: h8=lb5 h9=lb2 h8 y h7 lb1:every x h6 h9 lb2:cat lb3:chase lb5:dog1 y x

MRS: flat representation elementary predications: lb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y,h8,h7), lb3:chase(e,x,y), scope constraints: h9=lb2,h8=lb5 (actually qeqs) easy to ignore quantification when not relevant for application: cat(x), dog1(y), chase(e,x,y)

RMRS: Separating arguments lb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y,h8,h7), lb3:chase(e,x,y), h9=lb2,h8=lb5 goes to: lb1:every(x), RSTR(lb1,h9), BODY(lb1,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y), RSTR(lb4,h8), BODY(lb4,h7), lb3:chase(e),ARG1(lb3,x),ARG2(lb3,y), h9=lb2,h8=lb5

Naming conventions:predicate names without a lexicon lb1:_every_q(x1sg),RSTR(lb1,h9),BODY(lb1,h6), lb2:_cat_n(x2sg), lb5:_dog_n_1(x4sg), lb4:_some_q(x3sg),RSTR(lb4,h8),BODY(lb4,h7), lb3:_chase_v(esp),ARG1(lb3,x2sg),ARG2(lb3,x4sg) h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg note also explicit equalities

POS output as underspecification DEEP – lb1:_every_q(x1sg), RSTR(lb1,h9), BODY(lb1,h6), lb2:_cat_n(x2sg), lb5:_dog_n_1(x4sg), lb4:_some_q(x3sg), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(esp), ARG1(lb3,x2sg),ARG2(lb3,x4sg), h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg POS – lb1:_every_q(x1), lb2:_cat_n(x2sg), lb3:_chase_v(epast), lb4:_some_q(x3), lb5:_dog_n(x4sg)

POS output as underspecification DEEP – lb1:_every_q(x1sg), RSTR(lb1,h9),BODY(lb1,h6), lb2:_cat_n(x2sg), lb5:_dog_n_1(x4sg), lb4:_some_q(x3sg), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(esp), ARG1(lb3,x2sg),ARG2(lb3,x3sg), h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg POS – lb1:_every_q(x1), lb2:_cat_n(x2sg), lb3:_chase_v(epast), lb4:_some_q(x3), lb5:_dog_n(x4sg)

RMRS principles • Split up information content as much as possible • Accumulate information monotonically by simple operations • Don’t represent what you don’t know but preserve everything you do know • Use a flat representation to allow pieces to be accessed individually

Semantics from RASP • RASP: robust, domain-independent, statistical parsing (Briscoe and Carroll) • can’t produce conventional semantics because no subcategorization • can often identify arguments: • S -> NP VP NP supplies ARG1 for V • potential for partial identification: • VP -> V NP • S -> NP S NP might be ARG2 or ARG3

RMRS construction • deep grammars: MRS <-> RMRS converter. • POS-RMRS: tag lexicon. • RASP-RMRS: tag lexicon plus semantic rules associated with RASP rules. • no lexical subcategorization, so rely on grammar rules to provide the ARGs • output aims to match deep grammar (ERG) • developed on basis of ERG semantic test suite • default composition principles when no rule RMRS specified • Composition algebra: • MRS composition assumes a lexicalized approach: algebra defined in Copestake, Lascarides and Flickinger (2001) • RMRS with non-lexicalised grammars has similar basic algebra • All approaches have common composition principles, so there is compatibility at a phrasal level.

Some cat sleeps (in RASP) [h3,e], <h3>, {h3:_sleep(e)} sleeps [h,x], <h1>, {h1:_some(x),RSTR(h1,h2),h2:_cat(x)} some cat S->NP VP: Head=VP, ARG1(<VP anchor>,<NP hook.index>) [h3,e], <h3>, {h3:_sleep(e), ARG1(h3,x), h1:_some(x),RSTR(h1,h2),h2:_cat(x)} some cat sleeps

ERG-RMRS / RASP-RMRS

Inchoative

Infinitival subject (unbound in RASP-RMRS)

Mismatch: Expletive it

SEM-I: semantic interface • Meta-level: manually specified `grammar’ relations (constructions and closed-class) • Object-level: linked to lexical database for deep grammars • Object-level SEM-I auto-generated from expanded lexical entries in deep grammars (because type can contribute relations) • Validation of other lexicons • Need closed class items for RMRS construction from shallow processing

Alignment and XML • Comparing RMRSs for same text efficiently requires `characterization’ • labels RMRSs according to their source in the text • currently characters, but also XPath plus characters • RMRS-XML • RMRS seen as levels of mark-up: standoff annotation

RMRS approach: current and planned applications • Question answering: • Cambridge CSTIT: deep parse questions, shallow parse answers • QA from structured knowledge: Frank et al (QUETAL project) • Information extraction: • emails (Deep Thought) • Chemistry texts (SciBorg) • Dictionary definition parsing for Japanese and English (Bond and Flickinger) • Rhetorical structure, multi-document summarization ... • also LOGON: semantic transfer. MRSs from LFG used in HPSG generator.

RMRS in Deep Thought • Different systems integrated via the HoG: • Invoke shallow or deep parsing, full or partial results, all expressed in RMRS. • Also shallow parsing as precursor to deep parsing: NER, unknown words. • Preliminary test on email response application (Xtramind Mailminder): • email categorized, then category-specific templates built from RMRS • increase in precision of automatically instantiated templates (up to 29%) with the addition of deep parser to the system

IE architecture using deeper processing and RMRS • Preprocessing of markup etc • Tokenisation • Named Entity Recognition: delivers RMRS • Shallow processing (including chunking): delivers RMRS • Deep parsing: uses shallow processing and NER, delivers RMRS • Word sense disambiguation: uses RMRS from best available source, further instantiates RMRS according to ontology • Anaphora resolution: uses RMRS from best available source, further instantiates RMRS • Relationship detection via patterns over deepest possible RMRSs • DB instantiation

SciBorg: Chemistry texts • eScience project started in October at Cambridge • Computer Laboratory, Chemistry, CeSC • Partners: Nature Publishing, Royal Society of Chemistry, International Union of Crystallography (supplying papers and publishing expertise) • Aims: • Develop an NL markup language which will act as a platform for extraction of information. Link to semantic web languages. • Develop IE technology and core ontologies for use by publishers, researchers, readers, vendors and regulatory organisations. • Model scientific argumentation and citation purpose in order to support novel modes of information access. • Demonstrate the applicability of this infrastructure in a real-world eScience environment.

RSC papers Nature papers base XML IUCr papers Biology and CL (pdf) Outline architecture sentence splitting RASP WSD RMRS merge POS tagging anaphora tasks NER rhetorical analysis ERG/PET standoff annotation

Research markup • Chemistry: The primary aims of the present study are (i) the synthesis of an amino acid derivative that can be incorporated into proteins /via/ standard solid-phase synthesis methods, and (ii) a test of the ability of the derivative to function as a photoswitch in a biological environment. • Computational Linguistics: The goal of the work reported here is to develop a method that can automatically refine the Hidden Markov Models to produce a more accurate language model.

RMRS and research markup • Specify cues in RMRS: e.g., • l1:objective(x), ARG1(l1,y), l2:research(y) • The concept objective generalises the predicates for aim, goal etc and research generalises study, work etc. Ontology for rhetorical structure. • Deep process possible cue phrases to get RMRSs: • feasible because domain-independent • more general and reliable than shallow techniques • allows for complex interrelationships e.g., our goal is not to ... but to ... • Use zones for advanced citation maps (e.g., X cites Y (contrast)) and other enhancements to repositories

Conclusions • Information Extraction is more than company mergers or gene-protein interactions! • Combined deep-shallow processing techniques have potential for IE • RMRS is a representation language that allows for deep-shallow compatibility via extreme underspecification • various systems adapted to output RMRS and further work ongoing • RMRS offers detailed compatibility at a phrasal level • RMRS processing can be integrated with ontologies in various ways • RMRS tools are distributed as Open Source via DELPH-IN • SciBorg will further develop this approach for eScience applications using a generic standoff architecture

Further work on RASP-RMRS • Fast enough (time not significant compared to RASP processing time because no ambiguity) • Too many RASP rules! Need to generalise over classes. • Requires SEM-I: i.e., API for MRS/RMRS from deep grammar • RASP and ERG may change: • compatible test suites – semi-automatic rule update? • alternative technique for composition? • Parse selection – need to generalise over RMRSs • weighted intersections of RMRSs (cf RASP grammatical relations)

Robust Semantic Processing for Information Extraction