1 / 24

The Sketch Engine for Dutch with the ANW corpus Carole Tiberius

The Sketch Engine for Dutch with the ANW corpus Carole Tiberius. Outline. The A lgemeen N ederlands W oordenboek Main features The ANW corpus The Sketch Engine Background Word Sketches for Dutch. The ANW dictionary. Online scholarly dictionary

celina
Télécharger la présentation

The Sketch Engine for Dutch with the ANW corpus Carole Tiberius

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Sketch Engine for Dutch with the ANW corpus Carole Tiberius

  2. Outline • The Algemeen Nederlands Woordenboek • Main features • The ANW corpus • The Sketch Engine • Background • Word Sketches for Dutch

  3. The ANW dictionary • Online scholarly dictionary • Contemporary standard Dutch in the Netherlands and Flanders • General (mainly written) language • Period: 1970-2018 • Size: 70.000 main entries and 250.000 subentries • Users: from laymen to professionals • No clone of an existing printed dictionary • Semasiological and onomasiological • Modular editing and publication • Corpus-based

  4. ANW: main content features + = sportveld spelling morphology and compounding grammar + combinations; collocations multimedia meaning

  5. ANW Corpus Compiled from: • Electronic texts already available at the INL • Internet • Scanning Subcorpora: • Corpus of domains 32 million tokens • Corpus of literary texts 20 million tokens • Newspaper corpus 40 million tokens • Corpus of neologisms 5,5 million tokens • Pluscorpus 5 million tokens Total 102,5 million tokens

  6. Corpus preparation • Conversion to vertical format: word-form tag lempos • Inclusion of <g> tag for punctuation • Removal of double occurring texts • Conversion to UTF8 • More uniform document headers • subcorpus; ID; variant; dates etc.

  7. Changes to the editor The ANW editor was adapted such that the lexicographers can automatically copy examples plus source information from the Sketch Engine into the editor.

  8. object-of with ‘dat’ (that)-compl subject-of with wh-compl with auxiliary with ‘of’ (whether)-compl premodifying adjective with ‘alsof’ (as if)-compl premodifying present participle with demonstrative pronoun premodifying past participle with possessive pronoun with infinitive plus ‘om te’ with PP in PP with indefinite pronoun with personal pronoun premodifying noun premodifying genitive postmodifying noun postmodifying genitive premodifying numeral with proper noun postmodifying numeral with article postmodifying adjective with coordinated noun with infinitive plus ‘te’ other ANWGrammatical Relations for nouns

  9. Dutch Sketch Grammar • Geared completely towards the ANW requirements • Covers ± 50 of the 70 relations • Types of relations: • Symmetric (e.g. and/or) • Trinary (e.g. headword + pp + noun) • Dual (e.g. adj + headword) • Unary (e.g +relative clause – dat)

  10. Specific problems for Dutch • Verb-subject and verb-object relations as word order not a reliable source, e.g. BOONEN zou Voigt in de sprint geklopt hebben Boonen would Voigt in the sprint beaten have ‘Boonen would have beaten Voigt in the sprint.’ VOIGT zou Boonen in de sprint geklopt hebben Voigt would Boonen in the sprint beaten have ‘Voigt would have beaten Boonen in the sprint.’ (Bouma 2008:20)

  11. Sketch Grammar rules *DUAL =object/object_of # hij ziet de man / hij heeft de man gezien "P.*pers.*nom.*" 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag!="N.*" & tag!="S.pre.*"] "P.*pers.*nom.*" "V.*aux.*" [[tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*" # gisteren zag Piet Jan [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} "N.*" [[tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”][tag="V.*aux.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*" # omdat Piet Jan ziet "C.*sub.*" [[tag=“[T|D|M|R|A].*"]{0,3} "N.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*" *DUAL =subject/subject_of # gisteren zag Piet Jan [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag=“[T|D|M|R|A].*"]{0,3} "N.*" [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] [tag="V.*aux.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag=“[T|D|M|R|A].*"]{0,3} "N.*" 1:"V.*mai.*" # omdat Piet Jan ziet [word="omdat" | word="dat" & tag="C.*sub.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [[tag=“[T|D|M|R|A].*"]{0,3} "N.*" 1:"V.*mai.*" # gepleegd door de moordenaar 1:"V.*mai.*part.*past.*" [word="door"] [[tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*"

  12. Specific problems for Dutch • Separable verbs, e.g. Hij at een hele boterham op (from ‘opeten’) He ate a whole sandwich up ‘He ate a whole sandwich’ omdat hij een hele boterham op heeft gegeten because he a whole sandwich up has eaten ‘because he has eaten a whole sandwich’

  13. Sketch Grammar rules =bijw+WW # separable verbs "N.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "N.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "A.*partpast.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "N.*partpast.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "V.*mai.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "V.*mai.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "N.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "N.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "A.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "A.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "V.*mai.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "V.*mai.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*"

  14. Subcorpora Within the ANW corpus, 7 subcorpora were defined: • Belgian Dutch • Dutch Dutch • Corpus Literary Texts • Domain-dependent Texts • Newspaper Texts • Neologisms • Pluscorpus

  15. Language variety: BelgianDutch

  16. Language variety: DutchDutch

  17. Wish list / Questions • Fixed order of display • Efficient dealing with different tag sets • Correct display of unary relations • Possible formats of dates in document headers • Use of morphological information in Sketch Engine

  18. http://anw.inl.nl

More Related