1 / 26

Treebanking a Blackfoot Corpus

Treebanking a Blackfoot Corpus. Joel Dunham UBC. Overview. Blackfoot language Online Linguistic Database (OLD) Blackfoot OLD (BOLD) BOLD Annotation/treebanking. Blackfoot language. Algonquian (Plains): Alberta & Montana Endangered: < 5000 speakers Fieldwork: UBC, UCalgary, UMontana.

sydnee
Télécharger la présentation

Treebanking a Blackfoot Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Treebanking a Blackfoot Corpus • Joel Dunham • UBC

  2. Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • BOLD Annotation/treebanking

  3. Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Fieldwork: UBC, UCalgary, UMontana

  4. Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative

  5. Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ • ‘Why don’t you eat with her?’

  6. OLD • Online Linguistic Database • www.onlinelinguisticdatabase.org • Web application for documenting and analyzing languages

  7. OLD • Open source (GPL): Python (Pylons), MySQL, HTML/JS • Powerful search capability: regex, boolean • Multi-user, web-based, collaborative • Multi-media: audio, video, images, text • Auto-linking of morphemes

  8. Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http://blaold.webfactional.com/ • Other OLD web apps: • Okanagan OLD (OKAOLD) • Plains Cree OLD (CRKOLD) • etc.

  9. BLAOLD

  10. BLAOLD • Forms (morphemes & sentences): 21,788 (2011-07-25) • morphemes: 5,094 • sentences: 3,193 • unclassified: 13,501 • (word tokens: 20,577)

  11. BLAOLD • Sources: • textual: 16,209 forms • field work: 5,569 forms (and growing...)

  12. BLAOLD • Collections • texts created by ordered references to forms • 135 Collections at present • E.g., Creation Story: • http://blaold.webfactional.com/creationstory

  13. BLAOLD • ... Collection (text) created by referencing Forms entered into the BLAOLD.

  14. BLAOLD • Files: • Associate Forms, Collections & Files • 2,159 files (2011-07-25) • 1,744 audio • 259 image • 148 text • 4 video

  15. Morpheme segmentation and morpheme gloss lines. Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: “prev-asp-vta drt-num nan drt-num agra-nan adt-asp-vai-oth-num” Form with morphemic analysis Associated WAV file (tagged as an object language utterance) Associated JPG (used as a stimulus in elicitation)

  16. BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks & improve search • morphological parsing • treebanking?

  17. Morphological Parser • ‘A morphological parser for Blackfoot’ (Dunham, 2010; WAIL) • input = transcription: • kimaaksawohpokooyimasi • output = <segmentation, morph glosses, POSes>: • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2-why-NEG-with-eat-TA-DIR-3SG-CONJ • agra-adt-oth-adt-vai-fin-thm-agrb-agrb

  18. Morphological Parser kimaaksawohpokooyimasi FST Accuracy: ca. 70% Challenges: - variations in transcription - no hard and fast spelling rules - researchers differ in the extent to which they use the standard phonemic orthography to capture phonetic detail Phonology (from a grammar) hand-coded into FST Phonology Morphotactics (lexicon) Morphotactics & lexicon extracted programmatically from the BLAOLD POS/morphemic N-grams used to select most probable parse k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2-why-NEG-with-eat-TA-DIR-3SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb-agrb

  19. Morphological Parser • Benefits of a morphological parse(r): • parse online in real time (i.e., during data entry): save researcher time • create more data to improve searching

  20. Morphological Parser • Search example: find all sentences with an overt subject and an overt object • Regex on POS string for 2 nominal roots: • /n[ai][nr].*n[ai][nr].*/

  21. Morphological Parser /n[ai][nr].*n[ai][nr].*/ Good Bad

  22. Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: ‘S < (NP $. (VP < NP))’ S NP VP NP DT NP VBD

  23. Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot may actually be easy relative to English • one of the longest words in the BLAOLD by character (69 chr.s) has only 5 words

  24. Treebank S S S VP VP NP NP DEM VBZ DEM NN CC VBZ drt-num adt-asp-fin-fin-thm drt-num nan-nin und adt-adt-asp-fin-fin-thm-agrb-oth ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áyi ‘He is building that house and he is still building it.’

  25. Treebank • Worth it to treebank Blackfoot?

  26. Nitsííkoohtaahsi’taki

More Related