1 / 19

Improving Morphosyntactic Tagging of Slovene by Tagger Combination

Improving Morphosyntactic Tagging of Slovene by Tagger Combination. Jan Rupnik Miha Gr čar Tomaž Erjavec Jožef Stefan Institute. Outline. Introduction Motivation Tagger combination Experiments. N V V S A N C A A N.

carina
Télécharger la présentation

Improving Morphosyntactic Tagging of Slovene by Tagger Combination

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik MihaGrčar Tomaž Erjavec Jožef Stefan Institute

  2. Outline • Introduction • Motivation • Tagger combination • Experiments

  3. N V V S A N C A A N Veža je smrdela po kuhanem zelju in starih, cunjastih predpražnikih. POS tagging Part Of Speech (POS) tagging: assigning morphosyntactic categories to words

  4. Slovenian POS • multilingual MULTEXT-East specification • almost 2,000 tags (morphosyntactic descriptions, MSDs) for Slovene • Tags: positionally coded attributes • Example: MSD Agufpa • Category = Adjective • Type = general • Degree = undefined • Gender = feminine • Number = plural • Case = accusative

  5. State of the art: Two taggers • Amebis d.o.o. proprietary tagger • Based on handcrafted rules • TnT tagger • Based on statistical modelling of sentences and their POS tags. • Hidden Markov Model tri-gram tagger • Trained on a large corpus of anottated sentences

  6. Statistics: motivation • Different tagging outcomes of the two taggers on the JOS corpus of 100k words • Green: proportion of words where both taggers were correct • Yellow: Both predicted the same, incorrect tag • Blue: Both predicted incorrect but different tags • Cyan: Amebis correct, TnT incorrect • Purple: TnT correct, Amebis incorrect

  7. Example True TnT Amebis

  8. N V V S A N C A A N Veža je smrdela po kuhanem zelju in starih, cunjastih predpražnikih. Combining the taggers Tag Meta Tagger TagTnT TagAmb TnT AmebisTagger Text flow

  9. Combining the taggers Agufpa A binary classifier; the two classes are TnT and Amebis Meta Tagger Feature vector Agufpn Agufpa TnT AmebisTagger … prepričati italijanske pravosodne oblasti ...

  10. Feature vector construction Agreement features POSA=T=yes, TypeA=T=yes, …, NumberA=T=yes, CaseA=T=no, AnimacyA=T=yes, …, Owner_GenderA=T=yes TnT features POST=Adjective, TypeT=general, GenderT=feminine, NumberT=plural, CaseT=nominative, AnimacyT=n/a, AspectT=n/a, FormT=n/a, PersonT=n/a, NegativeT=n/a, DegreeT=undefined, DefinitenessT=n/a, ParticipleT=n/a, Owner_NumberT=n/a, Owner_GenderT=n/a Amebis features POSA=Adjective, TypeA=general, GenderA=feminine, NumberA=plural, CaseA=accusative, AnimacyA=n/a, AspectA=n/a, FormA=n/a, PersonA=n/a, NegativeA=n/a, DegreeA=undefined, DefinitenessA=n/a, ParticipleA=n/a, Owner_NumberA=n/a, Owner_GenderA=n/a Agufpn Agufpa TnT AmebisTagger This is the correct tag  label: Amebis … prepričati italijanske pravosodne oblasti ...

  11. Context TnT features (pravosodne) Amebis features (pravosodne) Agreement features (pravosodne) … italijanske pravosodne oblasti … (a) No context TnT features (italijanske) Amebis features (italijanske) TnT features (pravosodne) Amebis features (pravosodne) Agreement features (pravosodne) TnT features (oblasti) Amebis features (oblasti) … italijanske pravosodne oblasti … (b) Context

  12. Classifiers • Naive Bayes • Probabilistic classifier • Assumes strong independence of features • Black-box classifier • CN2 Rules • If-then rule induction • Covering algorithm • Interpretable model as well as its decisions • C4.5 Decision Tree • Based on information entropy • Splitting algorithm • Interpretable model as well as its decisions

  13. Experiments: Dataset • JOS corpus - approximately 250 texts (100k words, 120k if we include punctuation) • Sampled from a larger corpus FidaPLUS • TnT trained with 10 fold cross validation, each time training on 9 folds and tagging the remaining fold (for the meta-tagger experiments)

  14. Experiments • Baseline 1: • majority classifier (always predict what TnT predicts) • Accuracy: 53% • Baseline 2 • Naive Bayes • One feature only: Amebis full MSD • Accuracy: 71%

  15. Baseline 2 • Naive Bayes classifier with one feature (Amebis full MSD) is simplified to counting the occurrences of two events for every MSD f: • #cases where Amebis predicted the tag f and was correct: ncf • #cases where Amebis predicted the tag f and was incorrect nwf • NB gives us the following rule, given a pair of predictions MSDa and MSDt: if ncMSDa < nwMSDa predict MSDt, else predict MSDa.

  16. Experiments: Different classifiers and feature sets • Classifiers: NB, CN, C4.5 • Feature sets: • Full MSD • Decomposed MSD, agreement features • Basic features subset of the decomposed MSD features set (Category, Type, Number, Gender, Case) • Union of all features considered (full + decompositions) • Scenarios: • no context • Context, ignore punctuation • Context, punctuation

  17. Results • Context helps • Punctuation slightly improves classification • C4.5 with basic features works best No context Context without punctuation Context with punctuation

  18. Overall error rate

  19. Thank you!

More Related