1 / 15

TAGSET REDUCTION BASED ON THE FEATURES NEEDED FOR THE SLOVENE SKETCH GRAMMAR

TAGSET REDUCTION BASED ON THE FEATURES NEEDED FOR THE SLOVENE SKETCH GRAMMAR. Simon Krek Amebis, d.o.o ., Kamnik, Slovenia Jožef Stefan Institute, Slovenia. We ' ll talk about . Slovene morphology Tagsets for Slovene MTE-JOS tagset Slovene corpora and lexical database

kalila
Télécharger la présentation

TAGSET REDUCTION BASED ON THE FEATURES NEEDED FOR THE SLOVENE SKETCH GRAMMAR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TAGSET REDUCTION BASED ON THE FEATURES NEEDED FOR THE SLOVENE SKETCH GRAMMAR Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute, Slovenia

  2. We'll talk about... • Slovene morphology • Tagsets forSlovene • MTE-JOS tagset • Slovenecorporaandlexicaldatabase • SketchgrammarforSlovenelexicaldatabase • MTE-JOS tagsetreduction • Taggingresultswithreducedtagsets • Tagsetreduction: conclusions

  3. Slovenemorphology • Theextremecaseofadjective: • inflectionalcategories • case: 6 • gender: 3 • number: 3 (dual!!) • definitiveness: 2 • only nominative & accusativemasculinesingular • degree: 3 • forms • 56 forpositivedegree • 164 forall three degrees

  4. Adjective "disgusting"

  5. 12 combinationsfor -ih

  6. Tagsetstandardization • Eagles (1993-1996) • Multext (1994-1996) • Multext-East (1995-1997) • 1997 - ver. 1 MTE: Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene,English (7 languages) • 2010 - ver. 4 Mondilex: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene,Ukrainian (16 languages)

  7. From MTE to JOS • JOS project (2007-2009) • MTE principles: positionalattributesetc. • Fromsyntax to morphology • Reorderingofattributes to make tagsshorter • Non-trivialmapping • 12 categories, 15 attributes, 56 values • 1.902 tags

  8. Slovenecorpora • FIDA (2000) • 100 million • MTE tagset, rule-basedtagger • FidaPLUS (2006) • 620 million • MTE tagset (2006) – rule-basedtagger (SketchEngine) • JOS tagset (2010), rule-based & statistical + metatagger • Gigafida (2011) • 1.1 billion • JOS tagset (2011) – statisticaltagger (SketchEngine – beforesummer?)

  9. Slovenelexicaldatabase semanticssyntaxcollocationsexamples syntactic combination syntactic pattern & structure collocation extended collocation semantic frame example semantic indicator phraseology

  10. I. LEMMA • headword svitati se (to dawn) • part-of-speechverb II. SENSE • indicator 1. daniti se (day)2. dojemati (understand) • semanticframeko se svita DAN, če se ČLOVEKU začne svitati o nekem začne vzhajati sonceDOGAJANJU, začne dojemati, karprej ni vedel, ali pa je bilo to pred njim skrito unaryrelations& constructions gramrels III. SYNTAX •restrictiononly in 3rd pers. •structuregbz Inf-GBZ rbz GBZ •patternkaj se svitakomu se svita o čem (sth is dawning) (sth is dawning to sbaboutsth) • synt. combin. wordsketches IV. COLLOC. •collocation[začeti, pričeti] se svitati [počasi, malo, malce] se svita GDEX V. EXAMPLES • examplePreden se začne zjutraj Počasi se mi je začelo svitati, svitati, je najtemnejša noč. zakaj Jasni oči tako žarijo. Na vzhodu se je že svital Petru se pričenja svitati o nekdanji dan, ko sta se poslovila. zvezi ned Chadom in Heather. •multi-wordunit VI. PHRASEOLOGY •phraseologicalunits

  11. SketchGrammarforSlovene • ver. 15: syntacticpatternsfor SLB • 32 gramrels • 18 DUAL • 5 TRINARY • 1 UNARY • 1 SYMMETRIC • 7 “regular” • ver. 16: in progress • new directives • *SEPARATEPAGE • *CONSTRUCTION • togetherwiththeswitchto Gigafida? • beforesummer 2011 • work on constructions • info from the new dependency parser

  12. Tagsetreduction – ver. 15 • 12 categories: 64 tags • verb • type: main, auxilliary • form: infinitive, supine, participle, present, future, conditional, imperative • person: 1st, 2nd, 3rd • negation: yes, no • noun • type: common, proper • case: nominative, genitive, dative, accusative, locative, instrumental • adjective • type: general, participle • case: nominative, genitive, dative, accusative, locative, instrumental • prononun • type: personal, possessive, demonstrative, relative, reflexive, general, interrogative, indefinite, negative

  13. Tagsetreduction – ver. 16 • 10 categories: 154 tags • number (3/2): noun, verb, adjective • degree (3): adjective, adverb • type: numeral (3), adverb (3), conjuction (2) • case (6): preposition • number: adjective = no dual -> more than 1

  14. Statisticaltagger – threetagsets • 500.000 word training corpus

  15. Conclusions • 2.94% = 32,340,000 tokens (1.1 billioncorpus) • 0.43% = 4,730,000 tokens (1.1 billioncorpus) • significant: unknownwords • not significant: categories • Reduced tagset: automatic extraction of data from the corpus • Full tagset: manual corpus analysis in Sketch Engine & other concordancers

More Related