1 / 20

SProUT Shallow Processing with Unification and Typed Feature Structures

SProUT Shallow Processing with Unification and Typed Feature Structures. Jakub Piskorski Language Technology Lab DFKI GmbH. Concept indices, more accurate queries. Domain-specific patterns. Building ontologies. Document Indexing/Retrieval. Tokens. EXECUTIVE INFORMATION SYSTEMS. Clause

cianna
Télécharger la présentation

SProUT Shallow Processing with Unification and Typed Feature Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SProUT Shallow Processing with Unificationand Typed Feature Structures Jakub PiskorskiLanguage Technology LabDFKI GmbH

  2. Concept indices, more accurate queries Domain-specific patterns Building ontologies Document Indexing/Retrieval Tokens EXECUTIVEINFORMATIONSYSTEMS Clause structure MULTI-AGENTS Word Stems Term association extraction Shallow Text Processing Components Template generation Phrases Text Mining Information Extraction Q/A Systems Semi-structured data Fine-grained concept matching Named Entities E-COMMERCE DATAWAREHOUSING WORKFLOWMANAGEMENT Text Classification Automatic Database Construction Shallow Text Processing TEXT DOCUMENTS

  3. Finite-State based approaches  SPPC - pure finite-state based STP, small number of basic predicates SMES – predciates inspect arbitrary properties of the input tokens/fragmentsFASTUS – uses CPSL (Common Pattern Specification Language)GATE – uses JAPE (Java Annotation Patterns Engine)

  4. Motivation for SProUT  One System for Multilingual and Domain Adaptive Shallow Text Processing Trade-off between efficiency and expressivenessModularityFlexible integration of different processing modules Portability Industrial standards

  5. SProUT is a joint work by: Markus Becker, Witold Drożdżyński, Ulrich KriegerJakub Piskorski, Ulrich Schäfer, FeiyuXu

  6. LEXICAL RESOURCES INPUT DATA JTFS STREAM OFTEXT ITEMS …. [..] [..] [..] …. STRUCTURED OUTPUT DATA FINITE-STATE MACHINE TOOLKIT SProUT Architecture LINGUISTIC PROCESSING RESOURCES EXTENDED OPTIMIZED FINITE-STATE NETWORK REGULAR COMPILER XTDL INTERPRETER XTDL GRAMMAR G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G

  7. Core Components – FSM Toolkit  Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices Finite-state Machine model: FSA, WFSA, FST, WFST Arbitrary real-valued semirings Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs) Functionality similar to AT&T tools

  8. Core Components – Regular Compiler  Definition and configuration via XML Unicode compatible Extendible set of circa 20 operations Scanner definitions vs. general regular expressions Biasing optimization process Various ways of handling ambiguities Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation  Regular expressions over TFSs (SProUT) with restrictions

  9. Core Components – Typed Feature Structure Package  JAVA implementation of TFSs Efficient unification operations  Dynamic extension of the type hierarchy Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers

  10. XTDL Formalism  Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application XTDL grammar rules – production part on LHS, and output description on RHS TDL used for establishment of a type hierarchy of linguistic entities *top* atom *avm* *rule* tense sign infl index-avm present token morph lang tokentype de en separator url morph := sign & [POS atom, STEM atom, INFL infl]

  11. XTDL Formalism  Couple of standard regular operators: concatenation optionality ?disjunction | Kleene star *Kleene plus + n-fold repetition {n}m-n span repetition {m,n} Unidirectional coreference under Kleene star (and restricted iteration) [POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]

  12. XTDL Formalism loc-pp :> morph & [POS Prep & #preposition, INFL [CASE #1, NUMBER #2, GENDER #3]] morph & [POS Determiner, INFL [CASE #1, NUMBER #2, GENDER #3]] ? morph & [POS Adjective, INFL [CASE #1, NUMBER #2, GENDER #3]] * gazetteer & [TYPE general-location, SURFACE #location] -> [CAT location-pp, PREP #preposition LOCATION #location].

  13. XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS)2. LHS Pattern instance creation3. Unfication of the rule instance and matched input  Longest match strategy Ambiguities allowed Interpreter generates TFSs as output (cascaded architecture)

  14. XTDL Interpreter  Matched input sequence “im sonnigen Rom” (in sunny Rome)

  15. XTDL Interpreter  Rule with an instantiated pattern on the LHS

  16. XTDL formalism  Unified result

  17. Linguistic Processing Resources  Tokenization with fine-grained token classification Gazetteer (static named-entity lexica) Morphology Full-form lexica obtained from ‘compactified’ MMORPH: English 200,000 entriesGerman 830,000 entries + Shallow Compound RecognitionFrench 225,000 entriesSpanish 570,000 entriesItalian 330,000 entries Asian Languages: Chinese – ShanxiJapanese – Chasen Other: Czech – HMM-based Part-of-Speech Tagging + Morphology

  18. System Description Language  Construction of a concrete system instance via definition of a regular expression of module specifications  All lingusitic modules must implement a specific JAVA interface  Automatic compilation of system description into a single JAVA class

  19. System Description Language (M1 M2)(input) M1.clearState(); M1.setInput(input); M1.setOutput(M1.computeOutput(M1.getInput())); M2.clearState(); M2.setInput(mediateSeq(M1,M2)); M2.setOutput(M2.computeOutput(M2.getInput())); return M2.getOutput(); (M*)(input) M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();

  20. Future Work  Optimization of grammar interpretation Various search strategiesAdditional linguistic processing resourcesReal data testing: large grammars and real-world texts

More Related