1 / 22

totale Multilingual Tokenisation, Tagging and Lemmatisation

totale Multilingual Tokenisation, Tagging and Lemmatisation. Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC Workshop, 26-27 September 2005. Overview of the talk. Introduction The totale pipeline Training totale Annotating JRC-ACQUIS-sl

liz
Télécharger la présentation

totale Multilingual Tokenisation, Tagging and Lemmatisation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. totaleMultilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC Workshop, 26-27 September 2005

  2. Overview of the talk • Introduction • The totale pipeline • Training totale • Annotating JRC-ACQUIS-sl • Conclusions Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  3. Introduction • Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be linguistically pre-processed • This normalizes (reduces) the data and gives other tools more features to work with Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  4. Example 2. (a) Where an exporter has declared goods packaged using automatic systems for bagging, canning, bottling, etc., TOKEN TYPE LEMMA MSD -------------------------------------- 2. TOK_ENUM 2. Rmp (a) TOK_ENUM (a) Rmp Where TOK where Cs an TOK a Di exporter TOK exporter Ncns has TOK have Vaip3s declared TOK declare Vmps goods TOK good Ncnp packaged TOK package Vmis using TOK use Vmpp automatic TOK automatic Afp systems TOK system Ncnp for TOK for Sp bagging TOK bag Vmpp , PUN canning TOK can Vmpp , PUN bottling TOK bottle Vmpp , PUN etc. TOK_ABBR etc. Rmp MSD and LEMMA are context dependent MSD useful for any syntactically oriented further processing (PoS filtering) LEMMA useful for reducing the lexical space (easier searches) Task is much harder for inflectionally rich (or agglutinative) languages than for English or most ‘old’ EU! Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  5. Nagging doubts • Normalization loses information • Annotation introduces errors and bias • Evaluation for IE non-conclusive • Unsupervised methods! Still… Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  6. Wanted A tool that would take text in any language and • tokenise, • PoS tag and • lemmatise it. Should be simple to install and use, robust, fast, and adaptable to new languages, preferably with a large number of already available models (and work under Linux!) Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  7. What is out there • Component software:tokenisers, taggers, (stemmers) • FS/RE environments: INTEX, CLARK • Various LT workbenches, most famous GATE • Alas: Java, time investment, history Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  8. Linguistic annotation with totale • Multilingual tokenisation, tagging and lemmatisation • Perl program with a simple pipeline architecture • Input is plain UTF-8 text • Output is a list of annotated tokens • Several output formats (tabular, XML) Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  9. Example use $ totale -l en Doctor, can you help? ^D <TEXT> Doctor TOK doctor Ncfs , PUN can TOK can Voip you TOK you Pp2 help TOK help Vmn ? PUN_TERM <S/> </TEXT> Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  10. Multilingual resources Multilingual resources Multilingual resources Totale building blocks Perl CLOG TnT mlToken Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  11. Tokenisation in totale • Perl module mlToken.pm(Camelia Ignat, JRC) • Multilingual, with resource files for supported languages (also default rules) • Splits text into tokens, marks token type • Marks paragraph and sentence boundaries • Modelled on mtSeg Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  12. Tagging in totale • Annotating words in the text with their context disambiguated morphosyntactic annotations (MSDs) • Used the tri-gram tagger TnT • Trainable, fast, unknown-word guessing module, able to accommodate the large morphosyntactic tagsets of various EU languages • Uses (and induces from annotated corpus) a lexicon with ambiguity classes and tri-gram file Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  13. Lemmatisation in totale • Used CLOG, which learns first-order decision lists (+ list of exceptions) • Learns lemmatisation rules for each MSD • CLOG produces Prolog programs, but these converted into Perl Tomaž Erjavec and Sašo Džeroski: Machine Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Applied Artificial Intelligence 18(1), pp. 17-40, 2004. Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  14. Example CLOG rule sub SUB_afcfda { my $w = $_[0]; my $lem; if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"} elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"} elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"} elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"} elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"} elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"} elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"} elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"} elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"} elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"} elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"} else {$lem="???"} return $lem; } Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  15. Training totale with MULTEXT-East resources • Learning totale tagging and lemmatisation models • MULTEXT-East language resources V3, a standardised multilingual dataset for language engineering R&D • Covers mainly Central and Eastern European languages • Freely available for research use from http://nl.ijs.si/ME/V3/ • Used MSD tagged “1984” corpus (100kW) for tagger training • Used MSD lexica (15k lemmas) for lemmatiser training Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  16. Currently supported languages • English • Slovene • Czech • Romanian • Serbian • Estonian • Hungarian Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  17. Processing JRC’s ACQUIS-sl with totale • sl.tar.gz 03-Sep-2005 03:51 34.4Msl/slcelex_*.xml = 144M, 7772 files • Wrapper perl program: for each file • extract text (all <P>s except first) • | totale -l sl -f XML | • substitute contents of original <P>s with annotated ones • validate against DTD • 72 hrs on asterix but 10s startup time = 77720s = 21hrs Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  18. The problem of titles • Dual role of titles: as text and name of document • Should they contain P at all? • Many titles untranslated – experiment with TextCat:4,964 sl 1,663 en “Ni na razpolago v slovenskem jeziku”1,074 en 59 sl or en 12 en or sl • Also cases like “ODLOCBA t. 1346/2001/ES …” • So, did not process them.. Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  19. Quantitative results: elements Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  20. Lexical analysis Extracted the MULTEXT lexicon from corpus: … 8 rafinacija rafinacija Ncfsn 2 rafinacije rafinacija Ncfpa 40 rafinacije rafinacija Ncfsg 2 rafinacije15rafinacije15 Mc---d 26 rafinaciji rafinacij Npmpn 9 rafinaciji rafinacija Ncfsl 17 rafinacijo rafinacija Ncfsa … Number of lexical entries: 381,068 Different word-forms: 221,876 Different lemmas: 154,241 Different MSDs: 970 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  21. Some problems • Complex tokenisation – over 15% “weird” words: priloge.opomba priloge.opomba Ncfsn who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp zavarovalnica(-e) zavarovalnica(-e) Ncmsi • Weak tagging model (likes verbs!): 3 anion anion Ncmsa--n 4 anion anion Ncmsn 1 anion anion Npmsn 3 anion anion Vmp--smp 6 aniona anion Ncmsg 8 anione anion Ncmpa 1 anioni anioenAfpmsny 1 anioni anion Ncmpn 1 anioni anioniVmp--pmp 1 anioni anioniti Vmip3s--n Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

  22. Conclusions • Presented processing with totale onACQUIS-sl and a quick evaluation • Further work: • methodology of semi-manual annotation (model tweaking) • “lexical priming” in totale • Translations and collocates Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

More Related