1 / 51

Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag

Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag. Christopher Cox, University of Alberta christopher.cox@ualberta.ca AACL - March 15, 2008. Introduction.

naasir
Télécharger la présentation

Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Tagging of a Corpus of Mennonite Low German:A Case Study Using Qtag Christopher Cox, University of Alberta christopher.cox@ualberta.ca AACL - March 15, 2008

  2. Introduction • Presentation considers the application of probabilistic part-of-speech (POS) tagging methods to minority language data • Methods appliedprofitably in large-scale corpus construction • Time, linguistic data, technical expertise, and financial resources often comparatively abundant • Perhaps less documented: challenges of applying similar techniques when resources are limited

  3. Introduction • Both linguistic and technical-financial challenges in minority language corpus development: • Existing computational techniques not amenable to linguistic structure of given language • Lack of standardization (e.g. in POS tags, spelling) • Limited resources (cf. McEnery & Ostler 2000) • Important to understand which factors produce acceptable results and minimize investment of effort, given set goals and resources for tagging

  4. Introduction • Presentation offers case study in probabilistic tagging of minority language data: • Applies Qtag to a small (~120,000-token) corpus of written Mennonite Low German (Plautdietsch) • Opportunity to consider problems faced in tagging minority language data in concrete detail • Chance to evaluate tagging procedure adopted, consider alternatives which may have produced results of comparable quality

  5. “Qtag?” • Qtag:language-independent, “pure” probabilistic tagger designed by Oliver Mason • Freely available for non-commercial use • Well-documented Java API • Unicode support • Not alone among pure probabilistic taggers; arguably presents a reasonable point of departure into probabilistic tagging

  6. “Plautdietsch?” • Plautdietsch:Mennonite Low German • Variety of Eastern Low German, once spoken near Gdansk, Poland • Approx. 400,000 speakers, predominantly descendants of Dutch-Russian Mennonites (Anabaptist Christians) • Sizeable Plautdietschspeech communities on four continents and in no fewer than a dozen countries (cf. Epp 1993: 103-4)

  7. A Corpus of Plautdietsch • Corpus intended primarily for research into syntax of verbal complementation in Plautdietsch: • Adequate tagging for verbal-inflectional features important (e.g. tense, person, number, etc.) • Dialectal variation potentially relevant in analysis • Technical resources furnished largely by the Text Analysis Portal for Research (TAPoR) at University of Alberta; time expenditure should be minimized

  8. Challenges • Plautdietsch poses challenges common in minority language corpus construction: • No single orthographic standard. Systems vary between authors and individual published works • No corpora published to date. No tagsets proposed; little consensus on POS classes • Dialectal variation. Substantial variation between and within national varieties.

  9. Corpus Construction • Three-stage corpus construction procedure: • Spelling normalization. Created versions of all corpus source texts normalized according to a published orthographic standard (Epp 1996) • Tagset selection. Adapted a tagset proposed for Standard German (Münster Tagset for German, MT/D; Steiner 2003) to Plautdietsch

  10. Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31”>Goon</word> <wordwd_id=“32”>dach</word> <wordwd_id=“32”>,</word> <wordwd_id=“33”>kompt</word> <wordwd_id=“34”>ennen</word> <wordwd_id=“34”>,</word> <wordwd_id=“35”>sat</word> <wordwd_id=“36”>junt</word> <wordwd_id=“37”>dol</word> <wordwd_id=“37”>.</word> . . . </document>

  11. Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31”>Go’n</word> <wordwd_id=“32”>Dag</word> <wordwd_id=“32”>,</word> <wordwd_id=“33”>komt</word> <wordwd_id=“34”>’enenn</word> <wordwd_id=“34”>,</word> <wordwd_id=“35”>sat</word> <wordwd_id=“36”>Junt</word> <wordwd_id=“37”>dol</word> <wordwd_id=“37”>.</word> . . . </document>

  12. Corpus Construction • Three-stage corpus construction procedure: • Tagging. Normalized texts then tagged gradually with the adopted tagset, in an iterative, interactive process:

  13. Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. . . .

  14. Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. . . .

  15. Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. • Train Qtag on all correct tags and have it tag the next chunk. . . .

  16. Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. • Train Qtag on all correct tags and have it tag the next chunk. • Manually correct the tags assigned to the last chunk, adding them to the training data. . . .

  17. Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. • Train Qtag on all correct tags and have it tag the next chunk. • Manually correct the tags assigned to the last chunk, adding them to the training data. . . .

  18. Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. • Train Qtag on all correct tags and have it tag the next chunk. • Manually correct the tags assigned to the last chunk, adding them to the training data. . . .

  19. The Road(s) Not Taken • Iterative, interactive process successful, albeit time consuming, labour intensive - What could have been done to reduce the burden of corpus construction without lessening the quality of resulting data? • Necessary to normalize spelling in advance? • Should greater numbers of tokens have been tagged at each stage? • Should the tag set have been less elaborate?

  20. Simulating Iterative Tagging • Simulations of different models of iterative, interactive tagging conducted using the corrected data • Parameters of the tagging models simulated: • Normalization. Normalized, unnormalized data • Chunk size. 100, 200, 300, 400, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000, 7500, 10000 tokens tagged per round • Tagset selection. 99 tags, 50 tags, 13 tags

  21. Evaluating Tagging Models • Evaluation of each model by rate of accuracy developmentandestimated time requirement • Estimated time requirement as a function of time requirements for initial manual tagging and subsequent tag correction (at various error rates) of c chunks of n tokens using tagset t:

  22. Evaluating Normalization • Does orthographic normalization matter, either for the rate of accuracy development or estimated overall time expenditure? • Holding tag set and chunk size constant, compare simulations of tagging normalized and unnormalized data:

  23. Evaluating Normalization

  24. Evaluating Normalization

  25. Evaluating Normalization • Does orthographic normalization matter, either for the rate of accuracy development or estimated overall time requirement? • Rate of accuracy development: on average 20% lower for unnormalized data over all tagsets • Estimated time requirement: on average 26 hours long for POS-99, 15 hours for POS-50, 11 hours for POS-13 for unnormalized data

  26. Evaluating Training Data • Does chunk size matter, either for the rate of development of accuracy or estimated overall time expenditure? • Holding tagsets and normalization constant, compare simulations of tagging for different chunk sizes:

  27. Evaluating Training Data

  28. Evaluating Training Data

  29. Evaluating Training Data • Does chunk size matter, either for the rate of development of accuracy or estimated overall time expenditure? • Rate of accuracy development: no substantial difference in accuracy development for chunk sizes <= 5000 • Estimated time requirement: considerable differences, with smaller chunk sizes (< 2000) taking less time • Minimize time required to tag first chunk manually without the aid of automatically-assigned tags

  30. Evaluating Tagsets • Does tagset detail matter, either to the rate of accuracy development or estimated overall time requirement? • Holding chunk size and normalization constant, compare all three tagsets:

  31. Evaluating Tagsets

  32. Evaluating Tagsets

  33. Evaluating Tagsets • Does tagset detail matter, either for the rate of development of accuracy or estimated overall time requirement? • Rate of accuracy development: average 15% increase of mean accuracy for minimal tagset over full tagset, regardless of normalization • Estimated time requirement: time requirement for full tagset (80.5 hours) more than double that of minimal tagset (36.5 hours)

  34. Evaluation: Summary • In the present case, the following guidelines would appear relevant to ‘successful’ tagging: • Normalization:Accuracy gains (here, 20%) may be substantial; however, gains must be weighed against cost of normalization itself • Chunk size:Favour smaller chunk sizes; choose tag correction over manual tag assignment • Tagset:Minimize tagset complexity (wherever corpus goals permit)

  35. Evaluation and Planning • Determining interaction of all such factors in their relation to accuracy likely impossible during corpus planning • Nevertheless, planning and evaluation might profitably enter into corpus construction: • Consideration of general guidelines, such as those proposed in this case study, during corpus design • Periodic evaluation as additional part of iterative tagging process

  36. Tagging and Minority Language Data • Such suggestionsmust bemeasured against requirements, resources, and stated goals of the corpus project: • In present case, detailed verbal coding needed; cost of tagset mitigated through normalization • Sociolinguistic situation may require preservation (in some form) of original orthographies or other “distinctive” features of source data

  37. Tagging and Minority Language Data • Selection of pure probabilistic methods over others in part determined by typological features and available sources of data: • Highly fusional or polysynthetic languages may benefit from morphological parsing, rather than probabilistic POS assignment alone; • Integration of tagged documents with other linguistic data (e.g. dictionaries, word lists) may encourage use of hybrid tools permitting concurrent lemmatization

  38. Conclusion • Computer-assisted part-of-speech assignment a complex problem, one profitably viewed in the larger context of minority language corpus construction: • Computational methods, probabilistic or otherwise, of clear importance, but not sole object of inquiry • Rather, consideration required of resources, requirements, and (socio-)linguistic conditions which bear upon minority language corpus construction as a whole

  39. Conclusion • Case studies of minority language corpus design might contribute to an understanding of such problems in context: • Present direction for further quantitative study of corpus and tagset design • Offer assessment of the challenges facing corpus-based language documentation, providing guidelines from which similar projects might benefit

  40. Thanks!

  41. Acknowledgements • Text Analysis Portal for Research (TAPoR), University of Alberta • Social Sciences and Humanities Research Council of Canada (SSHRC) • Members of the Department of Linguistics, University of Alberta • Oliver Mason (for Qtag)

  42. References • Epp, Reuben. 1993. The History of Low German and Plautdietsch: Tracing a language across the globe. Hillsboro, Kansas: The Reader’s Press. • Epp, Reuben. 1996. The Spelling of Low German and Plautdietsch. Hillsboro, Kansas: The Reader’s Press. • McEnery, Tony and Nick Ostler. 2000. A New Agenda for Corpus Linguistics - Working with all of the World’s Languages. Literary and Linguistic Computing 15.403-49.

  43. References • Tufis, Dan and Oliver Mason. 1998. “Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger.” Proceedings of the First International Conference on Language Resources & Evaluation (LREC), Granada (Spain), 28-30 May 1998, 589-596. • Steiner, Petra. 2003. Das revidierte Münsteraner Tagset Deutsch (MT/D). Beschreibung, Anwendung, Beispiele und Problemfälle [The revised Münster Tagset for German (MT/D). Description, Application, Examples and Problematic Cases]. Online: http://xlex.uni-muenster.de/Portal/MTPD/tagsetDescriptionDE.ps

  44. Qtag Algorithm • Read in the next token. • Retrieve all tags observed for this token (if none available, guess possible tags) • For each possible tag: • Calculate Pw = P(tag|token)= P(token has tag) • Calculate Pc = P(tag|t1,t2)= P(tag follows t1, t2) • Calculate Pw,c = Pw * Pc • Repeat this calculation for the other two tags in the window (except with Pc = P(t1 precedes t2, tag), Pc = P(t2 between t1, tag))

  45. “Qtag” http://www.english.bham.ac.uk/staff/omason/software/qtag.html

  46. Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31”>Goon</word> <wordwd_id=“32”>dach</word> <wordwd_id=“32”>,</word> <wordwd_id=“33”>kompt</word> <wordwd_id=“34”>ennen</word> <wordwd_id=“34”>,</word> <wordwd_id=“35”>sat</word> <wordwd_id=“36”>junt</word> <wordwd_id=“37”>dol</word> <wordwd_id=“37”>.</word> . . . </document>

  47. Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31”>Go’n</word> <wordwd_id=“32”>Dag</word> <wordwd_id=“32”>,</word> <wordwd_id=“33”>komt</word> <wordwd_id=“34”>’enenn</word> <wordwd_id=“34”>,</word> <wordwd_id=“35”>sat</word> <wordwd_id=“36”>Junt</word> <wordwd_id=“37”>dol</word> <wordwd_id=“37”>.</word> . . . </document>

  48. Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31” pos99a=“Aa”>Go’n</word> <wordwd_id=“32” pos99a=“Ngns”>Dag</word> <wordwd_id=“32” pos99a=“Fi”>,</word> <wordwd_id=“33” pos99a=“Vfvca2p”>komt</word> <wordwd_id=“34” pos99a=“Qv”>’enenn</word> <wordwd_id=“34” pos99a=“Fi”>,</word> <wordwd_id=“35” pos99a=“Vfvca2p”>sat</word> <wordwd_id=“36” pos99a=“Rs”>Junt</word> <wordwd_id=“37” pos99a=“Bg”>dol</word> <wordwd_id=“37” pos99a=“Bg”>.</word> . . . </document>

  49. Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31” pos99a=“Aa” pos99c=“Aa”>Go’n</word> <wordwd_id=“32” pos99a=“Ngns” pos99c=“Ngas”>Dag</word> <wordwd_id=“32” pos99a=“Fi” pos99c=“Fi”>,</word> <wordwd_id=“33” pos99a=“Vfvca2p” pos99c=“Vfvca2p”>komt</word> <wordwd_id=“34” pos99a=“Qv” pos99c=“Qv”>’enenn</word> <wordwd_id=“34” pos99a=“Fi” pos99c=“Fi”>,</word> <wordwd_id=“35” pos99a=“Vfvca2p” pos99c=“Vfvca2p”>sat</word> <wordwd_id=“36” pos99a=“Rs” pos99c=“Rs”>Junt</word> <wordwd_id=“37” pos99a=“Bg” pos99c=“Qv”>dol</word> <wordwd_id=“37” pos99a=“Fs” pos99c=“Fs”>.</word> . . . </document>

More Related