110 likes | 250 Vues
Machine Learning in GATE. Valentin Tablan. Machine Learning in GATE. Uses classification . [Attr 1 , Attr 2 , Attr 3 , … Attr n ] Class Classifies annotations . (Documents can be classified as well using a simple trick.) Annotations of a particular type are selected as instances.
E N D
Machine Learning in GATE Valentin Tablan
Machine Learning in GATE • Uses classification. [Attr1, Attr2, Attr3, … Attrn] Class • Classifies annotations. (Documents can be classified as well using a simple trick.) • Annotations of a particular type are selected as instances. • Attributes refer to instance annotations. • Attributes have a position relative to the instance annotation they refer to.
Attributes Attributes can be: • Boolean The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation. • Nominal The value of a particular feature of the referred instance annotation. The complete set of acceptable values must be specified a-priori. • Numeric The numeric value (converted from String) of a particular feature of the referred instance annotation.
Implementation Machine Learning PR in GATE. Has two functioning modes: • training • application Uses an XML file for configuration: <?xml version="1.0" encoding="windows-1252"?> <ML-CONFIG> <DATASET> … </DATASET> <ENGINE>…</ENGINE> <ML-CONFIG>
<DATASET> <DATASET> <INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTE> <NAME>POS_category(0)</NAME> <TYPE>Token</TYPE> <FEATURE>category</FEATURE> <POSITION>0</POSITION> <VALUES> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> <VALUE>NNPS</VALUE> … </VALUES> [<CLASS/>] </ATTRIBUTE> … </DATASET>
<ENGINE> <ENGINE> <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER> <OPTIONS> <CLASSIFIER>weka.classifiers.j48.J48</CLASSIFIER> <CLASSIFIER-OPTIONS>-K 3</CLASSIFIER-OPTIONS> <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-THRESHOLD> </OPTIONS> </ENGINE>
Attributes Position Instances type: Token
Machine Learning PR • Can save a learnt model to an external file for later use. Saves the actual model and the collected dataset. • Can export the collected dataset in .arff format.
Standard Use Scenario Application • Prepare data by enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc). • [ Load the previously saved model. ] • Run the ML PR in application mode. • [ Save the learnt model. ] Training • Prepare training data by enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc). • Run the ML PR in training mode. • Export the dataset as .arff and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options. • Update the configuration file accordingly. • Run the ML PR again to collect the actual data. • [ Save the learnt model. ]
An Example Learn POS category from POS context.
Using Other ML Libraries The MLEngine Interface Method Summary • void addTrainingInstance(List attributes) Adds a new training instance to the dataset. • Object classifyInstance(List attributes) Classifies a new instance. • void init() This method will be called after an engine is created and has its dataset and options set. • void setDatasetDefinition(DatasetDefintion definition) Sets the definition for the dataset used. • void setOptions(org.jdom.Element options) Sets the options from an XML JDom element. • void setOwnerPR(ProcessingResource pr) Registers the PR using the engine with the engine.