Analyzing Afrikaans Morphological Constructions: Automatic Morphological Analysis Overview
This paper explores the application of embodied construction grammar to Afrikaans morphological structures, detailing projects focused on automatic morphological analysis. Established in South Africa, these initiatives aim to advance human language technology (HLT) through the development of efficient, reusable modules for Afrikaans. Key morphological processes such as inflection and derivation are emphasized, particularly in the context of plural and nominalizing constructions. The paper outlines required formalism characteristics and the interdisciplinary approach taken to enhance language understanding and technology applications.
Analyzing Afrikaans Morphological Constructions: Automatic Morphological Analysis Overview
E N D
Presentation Transcript
Applying Embodied Construction Grammar: a description of some Afrikaans morphological constructions Gerhard B van Huyssteen Potchefstroom University for CHE South Africa Acknowledgement: Sulené Pilon ICLC 2003
Overview • HLT and CL in South Africa • Project: Automatic Morphological Analysis of Afrikaans • Requirements of a Formalism • Two Afrikaans Constructions • Plural Construction • Nominalising Construction • Concluding remarks ICLC 2003
HLT in South Africa • CL and NLP: • well-established research fields in USA, Europe, and other parts of the world • unexplored territory in South Africa • no catholic HLT projects for many years • Since 2000: • awareness of importance of HLT • governmental level – advisory committee of DACST (2002) • academic level – new projects & programmes ICLC 2003
CL at the PUCHE • Since 2001- prioritised CL as strategically important • establish research focus area “Language and Technology” • establish first complete graduate study programme in CL in South Africa • set up dedicated HLT laboratory • acquire text and speech corpora for: • Afrikaans • South African English • Setswana • Two related Afrikaans projects: • Spelling Checker project (funded by University) • Automatic Morphological Analysis of Afrikaans project (funded by NRF) ICLC 2003
AMAA project • Aim: to develop efficient, reusable modules for the automatic morphological analysis of Afrikaans • tokeniser –hyphenator • word segmenter – POS tagger • compound analyser –stemmer • Project team includes 4 linguists, 1 computational linguist (from University of Tilburg, Netherlands), 2 computer scientists • Problem: communication between: • different disciplines • different languages ICLC 2003
In Search of a Formalism • A formalism is a set of features used to precisely and rigorously interpret linguistic analysis (i.e. rules, principles, conditions, etc.) in logical or mathematical terms, in order to develop a calculus (cf. Crystal, 1997: 156) • Looking for: • a formal rule system (i.e. formal grammar or formalism) • for declarative purposes • not for more procedural purposes (like parsing and generation) • to represent Afrikaans morphological structure • not particularly interested in syntax, semantics, pragmatics ICLC 2003
Requirements: Formalisms • Accessibility • Transparent • Supported by literature • Efficiency • Linguistically efficient • Must be able to capture all linguistic phenomena accurately • Computationally efficient • To be implemented in a computer environment • Flexibility • Describe language structure with ease • Represent the underlying linguistic theory • Reusability • apply in different environments and applications ICLC 2003
Some specific requirements • Must represent regexp’s • developing a rule-based stemmer, using PERL • Must rank the rules • exceptions (i.e. low-level instantiations) are ranked higher than rules (i.e. schemas) • “longer” rules are ranked higher than “shorter” rules • DIM construction: -tjie is removed before –jie paaltjie hondjie • Must be compatible with CG ICLC 2003
Procedure • Identify main morphological processes • Inflection • Derivation • Compounding • Identify constructions • PLURAL construction • PAST construction • NOMINALISING construction • REDUPLICATION construction • Draw categorisation networks • Translate into ECG • Implement in stemmer ICLC 2003
Afrikaans Plural Construction • Inflectional process, realised by means of suffixation • 2 prototypical constructions: • -e: hond – honde [dogs]; bal – balle [balls] • -s: venster – vensters [windows]; tafel – tafels [tables] • Elaborations of the general schema • ’e: 3 – 3’e [3’s] • ’s: ma – ma’s [mothers] • Extensions of the general schema • -a: datum – data ICLC 2003
Categorisation Network GB van Huyssteen (PUCHE) ICLC 2003 ICLC 2003
PLURAL construction I construction SUFFIXATION subclass of AFFIXATION constructional constituents root suffix constraints constituency : [rootm/rootf] [[suffixm/suffixf]] form constraints rootfmeets suffixf suffixf .dependency dependent rootf .dependency autonomous | dependent meaning constraints profile-det suffix ICLC 2003
PLURAL construction II construction PLURAL subclass of SUFFIXATION constructional evokes INFLECTION constituents root : NOUN-SG; LET; NUM; ABBR suffix : PLURAL-SUF constraints rootm.scope-of-pred BOUNDED-REGION suffixm.scope-of-pred UNBOUNDED-REGION form meaning constraints scope-of-pred UNBOUNDED-REGION ICLC 2003
PLURAL construction III construction PLURAL-s subclass of PLURAL constructional constituents root : NOUN-SG-CN suffix : s constraints rootf: /^($C)?$V($C)$V[a-z]*$/ suffixf: /s/ rootm.profile THING ranking : 16 form constraints s /^($C)?$V($C)$V[a-z]*$/^($C)?$V($C)$V[a-z]*s$/ meaning constraints profile THING ICLC 2003
PLURAL construction IV construction PLURAL-’s subclass of PLURAL-s constructional constituents root : NOUN-SG-PROPER; NOUN-SG-CN; LETT; NUM; ABBR suffix : ’s constraints rootf : /%PROPN($V)$/ /%CN([iouá])$/ /^([a-z][^lmnrsxz])$/ /^([1-9]+[^123456])$/ /^%ABBR($V)$/ rootm.profile THING | SAR suffixf : /’s/ ranking : 13 form constraints s /%PROPN($V)$/%PROPN($V)’s$/ s /%CN($V)$/%CN($V)’s$/ s /^(/[a-z][^lmnrsxz]/)$/^([a-z][^lmnrsxz]’s)$/ s /^([1-9]+[^123456])$/^([1-9]+[^123456])’s$/ s /%ABBR($V)$/%ABBR($V)’s$/ meaning constraints profile THING ICLC 2003
PLURAL construction V construction PLURAL-specified subclass of PLURAL constructional constituents root: pad sambreel hemp seun bod Aardklop (l|spr)eeu man (m)?eeu vrou voël kasteel bal oom suffix: PLURAL-SUF constraints ranking : 1 form constraints s/pad/paaie/ s/sambreel/sambrele/ s/hemp/hemde/ s/seun/seuns/ s/bod/botte/ s/Aardklop/(Aardkloppe|Aardklops) s/(l|spr)eeu/(l|spr)eeus/ s/man/(manne|mans) s/(m)?eeu/(m)?eeue/ s/vrou/(vroue|vrouens) s/voël/(voëls|voële) s/kasteel/kastele/ s/bal/(balle|ballas) s/oom/ooms/ meaning constraints profile THING ICLC 2003
Categorisation Network GB van Huyssteen (PUCHE) ICLC 2003 ICLC 2003
NOMINALISING construction I construction NOMINALISING subclass of AFFIXATION constructional evokes DERIVATION constituents root : VERB|ADJ|ADV affix : NOM-PREFIX|NOM-SUFFIX|NOM-CIRCUMFIX constraints rootm.profile PROCESS|SAR|CAR affixm.profile THING form meaning constraints profile THING ICLC 2003
NOMINALISING construction II construction NOMINALISING-ge()[+$C]ery subclass of NOMINALISING-ge()ery constructional constituents root : VERB circumfix : ge()ery constraints rootf: /%VERB([áéíóú]$C$/ rootm.profile PROCESS circumfixf: /ge()[+$C]ery/ ranking : 1 form constraints s/%VERB([áéíóú]$C$/ge(%VERB)([áéíóú]$C$Cery$/ meaning constraints ICLC 2003
NOMINALISING construction III construction NOMINALISING-[-$V]$Cing subclass of NOMINALISING-ing constructional constituents root : VERB suffix : ing constraints rootf : /%VERB($V$V$C)/ rootm.profile PROCESS suffixf : /[-$V]$Cing/ ranking : 10 form constraints s/%VERB($V$V$C)/%VERB($V$C)ing/ meaning constraints ICLC 2003
NOMINALISING construction IV construction NOMINALISING-er subclass of NOMINALISING-SUF constructional constituents root : VERB suffix : er constraints rootf: /^(%VERB)$/ rootm.profile PROCESS suffixf: /er/ ranking : 12 form constraints s/^(%VERB)$/^(%VERB)er$/ meaning constraints attr +HUMAN ICLC 2003
Summary of adaptations • Our adaptations provided for our needs • added regexp’s as form constraints • added ranking as constructional constraints • added attributes as meaning constraints • added more CG concepts/constructs: • profile • valence factors: • profile determinacy • conceptual and phonological autonomy and dependency • constituency • ¿correspondence? • Make it therefore more accessible for us ICLC 2003
Evaluation: ECG as a Declarative Formalism • Accessible? • very little ECG material (specifically on morphology) available • isolated – “do whatever we want to do…” • Efficient • Linguistically efficient? • handled our data beautifully • Computationally efficient? • not our primary concern • improved communication with computational linguist and computer scientists • Flexibility • represents essence of Cognitive Linguistics beautifully • easy to add features/adaptations • Reusable? • not our primary concern • Main Advantage: • compatibility with Cognitive Grammar ICLC 2003
Conclusion • Your conclusion: • What are we doing wrong? • What are we missing? • Are we “abusing” ECG? ICLC 2003