410 likes | 557 Vues
Introduction to UIMA. Dr. Judith Eckle -Kohler, Richard Eckart de Castilho, Roland Kluge, Dr. Torsten Zesch. Part 1: UIMA. UIMA – Unstructured Information Management Architecture. M ajor goal: transform unstructured information to structured information
E N D
Introductionto UIMA Dr. Judith Eckle-Kohler, Richard Eckart de Castilho, Roland Kluge, Dr. Torsten Zesch
Part 1: UIMA 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
UIMA – Unstructured Information Management Architecture Major goal: • transform unstructured information to structured information … in order to discover knowledge that is relevant to an end user • Component-based architecture for analysis of unstructured content like text, video, audio • How it works: think of UIMA components as machines in an assembly line 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
A Short History of UIMA • Unstructured Information Management Architecture • Originallydevelopedat IBM – today an Apache project • Used in commercialaswellaseducationalcontexts • LanguageWare, Watson (IBM) • uimaFIT (TU Darmstadt, University of Colorado) • DKPro Core (!) (TU Darmstadt) • manymore... • Java and C++ implementations 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Learning toreadisdifficultforcomputers … $�Ⱶ�‡⃗⃝⏏%Ⱶ $%$�ᐆ﬩ ⃗⃝ᐆ↕ ⧫ᐆ%¥ ﬩↕%б↕ -- $ᐆ⠼∇бᐆ ↕⧫ᐆ⁆ #%‡ ↕%Ⱶ⁂ ᐆ⁑ᐆ‡ $ᐆ⠼∇бᐆ ↕⧫ᐆ⁆ #%‡ $%$$Ⱶᐆ % ﬩�‡⃗⃝Ⱶᐆ ⁍∇б¥, $%$�ᐆ﬩ �‡ $�Ⱶ�‡⃗⃝⏏%Ⱶ ⧫∇⏏﬩ᐆ⧫∇Ⱶ¥﬩ =%⁆ ⃗⃝ᐆ↕ % ⧫ᐆ%¥ ﬩↕%б↕ �‡ Ⱶ�⠼ᐆ, %##∇б¥�‡⃗⃝ ↕∇ % ↕ᐆ%= ∇⠼ ﬩#�ᐆ‡↕�﬩↕﬩ �‡ �↕%Ⱶ⁆. б%↕⧫ᐆб ↕⧫%‡ #∇‡⠼⏏﬩�‡⃗⃝ $%$�ᐆ﬩, ⧫ᐆ%б�‡⃗⃝ =∇бᐆ ↕⧫%‡ ∇‡ᐆ Ⱶ%‡⃗⃝⏏%⃗⃝ᐆ ⃗⃝�⁑ᐆ﬩ ‡ᐆ⁍$∇б‡﬩ % =ᐆ‡↕%Ⱶ $∇∇﬩↕, %##∇б¥�‡⃗⃝ ↕∇ ↕⧫ᐆ ‡ᐆ⁍ ﬩↕⏏¥⁆, ⁍⧫�#⧫ ↕ᐆ﬩↕ᐆ¥ ﬩ᐆ⁑ᐆ‡-=∇‡↕⧫-∇Ⱶ¥ �‡⠼%‡↕﬩ "�‡ =%‡⁆ ᐆ⏏б∇∘ᐆ%‡ #∇⏏‡↕б�ᐆ﬩, ∘%бᐆ‡↕﬩ %бᐆ ⁍%б⁆ ∇⠼ ⃗⃝�⁑�‡⃗⃝ % $�Ⱶ�‡⃗⃝⏏%Ⱶ ᐆ¥⏏#%↕�∇‡ ↕∇ ↕⧫ᐆ�б ⁂�¥﬩ %‡¥ ↕б⁆ ↕∇ ﬩∘ᐆ%⁂ ∇‡Ⱶ⁆ ∇‡ᐆ Ⱶ%‡⃗⃝⏏%⃗⃝ᐆ," ﬩%�¥ ﬩↕⏏¥⁆ %⏏↕⧫∇б ↗%#⁋⏏ᐆ﬩ =ᐆ⧫Ⱶᐆб ∇⠼ ↕⧫ᐆ Ⱶ%‡⃗⃝⏏%⃗⃝ᐆ, #∇⃗⃝‡�↕�∇‡, %‡¥ ¥ᐆ⁑ᐆⱵ∇∘=ᐆ‡↕ Ⱶ%$ %↕ ↕⧫ᐆ �‡↕ᐆб‡%↕�∇‡%Ⱶ ﬩#⧫∇∇Ⱶ ⠼∇б %¥⁑%‡#ᐆ¥ ﬩↕⏏¥�ᐆ﬩ �‡ ↕б�ᐆ﬩↕ᐆ, �↕%Ⱶ⁆. "↕⧫ᐆ⁆ %бᐆ %⠼б%�¥ [↕⧫ᐆ�б #⧫�Ⱶ¥бᐆ‡] =�⃗⃝⧫↕ ﬩⏏⠼⠼ᐆб ⁍⧫ᐆ‡ ↕⧫ᐆ⁆ ⃗⃝ᐆ↕ ↕∇ ﬩#⧫∇∇Ⱶ %‡¥ ﬩∇ ∇‡," =ᐆ⧫Ⱶᐆб ﬩%�¥. "$ᐆ#%⏏﬩ᐆ ∇⠼ ∇⏏б бᐆ﬩⏏Ⱶ↕﬩, � ¥∇⏏$↕ ↕⧫%↕ ⁑ᐆб⁆ =⏏#⧫.“ Unstructured text 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Analysis Levels in Text Processing unstructured Segmentation Morphology Syntax Semantics structured 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
UIMA Pipeline Example Collection Reader Segmentation CAS Analysis Engine 1 Morphology CAS … Syntax Semantics Analysis Engine n CAS CAS Consumer 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
UIMA Example Pipeline for Text Processing Collection Reader CAS Segmenter Segmentation CAS Morphology POS Tagger CAS Syntax NamedEntityRec. CAS Semantics CAS Consumer 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
UIMA Concepts I Pipeline Stages/Components: • Collection Reader: startofpipeline, abstractionofinputfiles • Analysis Engine: performsanalysis (tokenization, segmentation, etc.) • CAS Consumer: e.g. forwriting out results (XML, text, console) 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
UIMA Concepts II Data Structures • Common Analysis System (CAS): „datatransferobject“ • Type System: representationofannotations, contractedinterfacebetweencomponents • Indexes: accessingannotations • Views: e.g. raw HTML view, cleanedtextview • Subject-of-Analysis (SofA): e.g., documenttextofthecurrentview 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Common Analysis System (CAS) • High-density data structure, functions like an in-memory database • Provides access to • primary data (document/artifact under consideration) • secondary data (meta-data/annotations) 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Type System A UIMA type systemspecifiesthe type ofdatathatcanbemanipulatedbyannotatorcomponents. • UIMA provides an “object-oriented” type system • A type system defines two kinds of objects: • Types (Type -> class) • Features (Feature -> class member, Feature Structure -> instance) 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Type System • Single inheritance • Sub-type polymorphism • Primitive types: integer, float, boolean, String • Built-in complextypes: arrays, lists, Annotation • Type systemispartofcommunicationcontractbetweencomponents 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Example Type System 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Type System Editor (Eclipse) File: src/main/resources/desc/types/TypeSystem.xml Java package name of generated classes • JCasGengenerates Java classesfrom XML 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Java + CAS = JCas • JCasmaps CAS typesintothe Java type system • JCasGengenerates Java classesfromthe XML type systemdescriptor • Token.java– featurestructurewrapperwithgettersandsetters • Token_type.java– type wrapper (cf. Java ‘Class’ class) • Do not edittheseautomaticallygenerated Java classesmanually! • JCaswrapperscannotbeused stand-alone • XML type systemdescriptors still neededtoinitializetheunderlying CAS • Java Code Example: JCasjCas = …; Token token = new Token(jCas); // newallocatesmemory in the CAS! token.addToIndexes(); // neverforgetthis! 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Indexes • Recap: featurestructures (FS) arestored on theheap • Components cannotdirectlyaccess FS, but only via indexes • Feature structuresonlyaccessiblewhenaddedto an index • Feature structurescanonlyberemovedfromindex, neverfromCAS • Properties of an index (excerpt): • Typetobeindexed (indeximplicitlycontains all sub-types) • Kind: bag, set, sorted (seenextslide) • Example: Built-in Annotation index • Type: Annotation • Kind: sorted, begin (standard), end (reverse) 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Figure: Indexes Bag Set Sorted (0,2 v = “Hi”) (7,10 v = “Tom”) (3,6 v = “old”) (7,10 v = “Tim”) (0,2 v = “Ho”) (3,6 v = “red”) (0,2 v = “Hi”) (7,10 v = “Tom”) (3,6 v = “old”) (0,2 v = “Hi”) (0,2 v = “Ho”) (3,6 v = “old”) (3,6 v = “red”) (7,10 v = “Tim”) (7,10 v = “Tom”) duplicatesallowed unordered nokeys noduplicates unordered keysonlytestequality duplicatesallowed ordered 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Indexes – all youneedtoknow • normallynobodyneedstodefineindexes • indexes are the only way for UIMA annotators to access annotations in the CAS • it is necessary to generate these indexes, they are not provided automatically within UIMA 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Views andSofAs – Conceptual • CASrepresentstheanalysisof a singleartifact (a document) • Eachviewcontains a copyoftheartifact, • referredtoastheSubjectof Analysis (SofA) – theprimarydataassociatedwith a view (asreturnedbygetDocumentText()), • and a setofindexes, theFSIndexRepository,that UIMA annotatorsusetoaccessdata in the CAS • Usualsetting: Viewisonerepresentationoftheartifact, e.g. • Translation scenario: original text, translatedtext • Transformation scenario: original text, transformedtext • Multi-modal scenario: videoframes, close-captions 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Figure: Views andSofAs Base CAS View “_InitialView” SofA FS “_InitialView” Logical Physical SofAunawarecomponentreceivesdefaultview in process (CAS) whencallinggetDocumentText() Base CAS View “Text” SofA FS “Text” Logical Physical View “HTML” SofA FS “HTML” SofAawarecomponentreceivesbase CAS in process(CAS) needstocallgetView(viewName) 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Views andSofAs – Use Cases • CASrepresentstheanalysisof a singleartifact (a document) • This istrue in mostapplications • But: Views can also beusedtocompare different artifacts (mostlypairs) • thisrequires a customizedreaderthatreads in severalartifactsinto a single CAS, • andthenstoreseachartifact in a separate view. 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Part 2: uimaFIT 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
uimaFIT • „add-on“ for UIMA simplifyingtypicaldevelopmenttasks • forinstance: • consistencywith XML descriptorfiles • componentconfiguration • (shared) resourcemanagement • CAS/JCasaccess • Componentbaseclasses • @ConfigurationParameterannotation • Factories • http://code.google.com/p/uimafit/ 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Steps of Implementing a Collection Reader • subclasstheuimaFITcomponentJCasCollectionReader_ImplBase • Methodstobeimplemented: • voidgetNext(JCas): storenextdocument in thegivenoutputparameter • booleanhasNext() • Progress[] getProgress(): returnsprogressinformation • commonimplementation:newProgress[]{newProgressImpl(remaining, total, Progress.ENTITIES)} • voidclose(): freeresources • Optional: • voidinitialize(UimaContext): maybeusedforopeningfiles etc. 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Steps of Implementing an Annotator • subclasstheuimaFITcomponentJCasAnnotator_ImplBase • voidprocess(JCas)performstheactualanalysis • Optional: • voidinitialize(UimaContext): maybeusedforopeningfiles etc. • alwayscallsuper.initialize(context); publicclassNameAnnotator extendsJCasAnnotator_ImplBase { @Override publicvoidprocess(JCasaJCas) throwsAnalysisEngineProcessException {} } 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Steps of Implementing a CAS Consumer I • uimaFITdoes not distinguishbetween CAS Consumer and Annotation Engine (as UIMA does): • Bothareinitializedalmostidentically in uimaFIT • See implementationofJCasConsumer_ImplBaseandJCasAnnotator_ImplBase • The onlydifferencebetweentheinitializationof a CAS Consumer and an Analysis Engine in uimaFITistheabilityof multi-threading • multi-threading isallowedfor Analysis Enginesbydefault, but itis not allowedfor CAS Consumers 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Steps of Implementing a CAS Consumer II • subclasstheuimaFITcomponentJCasConsumer_ImplBase • voidprocess(JCas)extractsdatafromthe CAS • Optional: • voidinitialize(UimaContext): maybeusedforopeningfilesetc. • alwayscallsuper.initialize(context); • voidcollectionProcessComplete(): iscalledwhen all CASeshavebeenprocessed publicclassAnnotationFrequencyConsumer extendsJCasConsumer_ImplBase { @Override publicvoidprocess(JCasaJCas) throwsAnalysisEngineProcessException{} } 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Create and Configure Your Component - @ConfigurationParameter • uimaFITprovidesuswith a powerful annotation-basedconfigurationmechanism • declarepropertyasfield (any primitive + classeswith String-onlyconstructor, Locale, Pattern, …) • addannotation @ConfigurationParameter • Attributes (excerpt): • name: referredtowhenconfiguringthecomponent • mandatory: failifmissing/null • defaultValue: string publicstatic final String PARAM_DICTIONARY_FILE = "dictionaryFile"; @ConfigurationParameter(name = PARAM_DICTIONARY_FILE, mandatory = true) private File dictionaryFile; 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Create and Configure Your Component: @ConfigurationParameter – Best Practices • Best Practice: usethefieldnameasvalueofthestringconstant • Example: • public static final String PARAM_DICTIONARY_FILE = "dictionaryFile"; • private File dictionaryFile; publicstatic final String PARAM_DICTIONARY_FILE = "dictionaryFile"; @ConfigurationParameter(name = PARAM_DICTIONARY_FILE, mandatory = true) private File dictionaryFile; 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Create and Configure Your Component: @ConfigurationParameter – Best Practices • Best Practice: attributesmandatory, default • Ifpossible, setthedefaultvalue, evenifmandatory=true • Why: componentsare not ableto handle a nullvalue • In theexample, itis not possibletoset a defaultvalue, as a meaningfulvaluecanonlybesetby a user publicstatic final String PARAM_DICTIONARY_FILE = "dictionaryFile"; @ConfigurationParameter(name = PARAM_DICTIONARY_FILE, mandatory = true) private File dictionaryFile; 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Create and Configure Your Component II • uimaFITinstantiatesthecomponentsforyou! • AnalysisEngineFactory.createPrimitiveDescription • foranalysisenginesand CAS consumers • CollectionReaderFactory.createDescription • forcollectionreaders CollectionReaderDescriptionreader = createDescription( TextReader.class, TextReader.PARAM_PATH, "src/test/resources/txt", TextReader.PARAM_PATTERNS, new String[] {"[+]*.txt"}, TextReader.PARAM_LANGUAGE, "de"); AnalysisEngineDescriptionsegmenter = createPrimitiveDescription(BreakIteratorSegmenter.class); AnalysisEngineDescriptionconsumer = createPrimitiveDescription(FrequencyConsumer.class); SimplePipeline.runPipeline(reader, segmenter, consumer); 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Using and Exploring Annotations • The utilityclassorg.uimafit.util.JCasUtilprovidesconvenientaccesstotheannotations. • selectCoveredisthepreferredwaytoretrieveannotationsfromthe CAS int i = 0; for(Sentencesentence : JCasUtil.select(jCas, Sentence.class)) { System.out.println("Tokens ofsentence " + (i++) + ":"); for(Token token : JCasUtil.selectCovered(jCas, Token.class, sentence)) { System.out.println(token.getCoveredText()); } } 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Manually Creating JCas Instances JCasjCas = JCasFactory.createJCas(); jCas.setDocumentText("sometext"); jCas.setDocumentLanguage("en"); // IMPORTANT! AnalysisEngineDescriptiontokenizer = createPrimitiveDescription(MyTokenizer.class); runPipeline(jCas, tokenizer); for(Token token : JCasUtil.select(jCas, Token.class)){ System.out.println(token.getCoveredText()); } 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
uimaFIT Best Practices – descriptions Ifavailable, usethe Factory methodsofuimaFITtocreatecomponentdescriptions, e.g., CollectionReaderFactory.createDescription • Why: a readercreatedthiswaycanbeused multiple times in different pipelines (seeexamplepipelinein de.tudarmstadt.kdsl.teaching.dkprocore.intro) 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
uimaFIT Best Practices – CollectionReader • Alwayssettheparameter PARAM_LANGUAGE, thisisrequiredbymany Analysis Engines • Alwayssettheparameters PARAM_PATH and PARAM_PATTERNS in combination • PARAM_PATTERNS isspecifiedby ANT-style patterns, i.e., youhavetoset a patternthatspecifiesthefilestobeincluded, seehttp://ant.apache.org/manual/dirtasks.html#patterns • Goodtoknow: CollectionReaderinstancescan also processcompressedfiles (zipformat) 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Type System Auto-Discovery • uimaFITneedstoknowthe XML type systemdescriptor'slocationatruntime, see http://code.google.com/p/uimafit/wiki/TypeDescriptorDetection • Either • createfilesrc/resources/META-INF/types.txt • addpathtoyour XML file in thefollowingmanner: • classpath*:desc/types/*.xml • (uimaFITwill takeintoaccountany XML file in desc/types) • Or • addVM optionto Launch Configuration: • -Dorg.apache.uima.fit.type.import_pattern=classpath*:desc/types/*.xml • Formoreinformationsee Chapter 7 oftheuimaFitGuide at http://code.google.com/p/uimafit 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Can youanswerthesequestions? • Whatis UIMA? • What is uimaFIT? • WhatisthebenefitofusinguimaFITcomponentdescriptions? • What is the basic structure of UIMA-based projects? • Whatis an Annotation? • How do you create a new annotation type? • How do you add annotations to a JCas? • Why do you need to call addToIndexes()? • When do you need different views of an artifact? • How to implement a Collection Reader? (Annotator, CAS Consumer) 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Exercises (I) • Take a lookatuimaFIT’sJCasUtil • UIMA Basics • Exploretheproject • Pipeline, CR, AE, Consumer • TypesystemDescriptor: • src/test/resources/desc/types/TypeSystem.xml • src/test/resources/META-INF/org.uimafit/types.txt • Micro-corpus • src/test/resources/txt • (optional) explorethestructureofthemultimoduleproject • pom.xml • aggregator's pom.xml 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
Exercises (II) • UIMA Exploring • Objective: Write your own pipeline and analyze the results • Get the uimaexploring.exercise project • Write your own NameAnnotator which looks up each token in a name list (src/main/resources/dictionaries/names.txt) • read in dictionary in initialize(UimaContext) method • Write a NamePrintConsumer which nicely prints out your name annotations; output how many name annotations you have assigned (for each document/all documents in total) • Hint: collectionProcessComplete() may be helpful • Serialize your CASes to XML and use the GUI tools to examine their contents 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch
References • T. Götz, O. Suhre, 2004: Design andimplementationofthe UIMA Common Analysis System, IBM Systems Journal Vol 43 #3, p. 476-489 • http://uima.apache.org/doc-uima-why.html • http://uimafit.googlecode.com/svn/tags/uimafit-parent-1.4.0/apidocs/index.html 29.05.2013 | Dr. J. Eckle-Kohler, R. Eckart de Castilho, R. Kluge, Dr. T. Zesch