1 / 51

SALAMA – Swahili Language Manager

SALAMA – Swahili Language Manager. Arvi Hurskainen University of Helsinki. Short history. Morphological description of Swahili started in 1985 - Two-level model using finite state automata Morphological description ready 1989 - To market in 1999 through Lingsoft

sailor
Télécharger la présentation

SALAMA – Swahili Language Manager

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SALAMA – Swahili Language Manager Arvi Hurskainen University of Helsinki

  2. Short history • Morphological description of Swahili started in 1985 • - Two-level model using finite state automata • Morphological description ready 1989 • - To market in 1999 through Lingsoft • - Now integrated to Ms Office 2007 • Disambiguation ‘ready’ 1996 • - Constraint Grammar Parser CG-2 (Connexor) • Language translator ‘ready’ 2003 • Dictionary Compiler ready 2007

  3. Morphological analysis • *serikali • "serikali" N CAP 9/10-SG { the } { government } PERS • "serikali" N CAP 9/10-PL { the } { government } PERS • ya • "ya" GEN-CON 3/4-PL { of } • "ya" GEN-CON 9/10-SG { of } • "ya" GEN-CON 5/6-PL { of } • "ya" GEN-CON 6-PLSG { of } • *tanzania • "*tanzania" N PROPNAME SG { *tanzania } • imefanya • "fanya" V 3/4-PL-SP VFIN { they } PERF:me z [fanya] { do } SVO • "fanya" V 9/10-SG-SP VFIN { it } PERF:me z [fanya] { do } SVO • uteuzi • "uteuzi" N 11-SG { the } DER:verb DER:zi { appointment } @OBJ

  4. Disambiguation • *serikali • "serikali" N 9/10-SG { the } { government } PERS • ya • "ya" GEN-CON 9/10-SG { of } • *tanzania • "*tanzania" N PROPNAME SG { *tanzania } • imefanya • "fanya" V 9/10-SG-SP VFIN { it } PERF:me z [fanya] { do } SVO • uteuzi • "uteuzi" N 11-SG { the } DER:zi { appointment }

  5. Syntactic mapping • *serikali • "serikali" N 9/10-SG { the } { government } PERS @SUBJ • ya • "ya" GEN-CON 9/10-SG { of } @GCON • *tanzania • "*tanzania" N PROPNAME SG { *tanzania } @<GN • imefanya • "fanya" V 9/10-SG-SP VFIN { it } PERF:me z [fanya] { do } SVO @FMAINVtr+OBJ> • uteuzi • "uteuzi" N 11-SG { the } DER:zi { appointment } @OBJ

  6. How and where to describe MWEs? • Two categories of multiword expressions (MWE): • - frozen clusters of words • kwa_ajili_ya PREP { because of } • - clusters of words, the members of which may inflect • aliyenipigia picha { he/she who photographed for me } • atakayenipigia picha { he/she who will photograph for me } • aliyekuwa amekwishanipigia picha { he/she who already had photographed for me } • atakayekuwa amekwishanipigia picha { he/she who will have had photographed for me }

  7. How and where to describe MWEs? • Frozen clusters of words • - may be described in the tokenizer and analyzed as a single unit • kwa ajili ya > kwa_ajili_ya

  8. How and where to describe MWEs? • Inflecting clusters of words • - cannot be described in the tokenizer • - they must be described after analysis • when all necessary word-level linguistic information is available

  9. How and where to describe MWEs? • One possible solution: • - describe frozen MWEs in the tokenizer • - describe inflecting MWEs alter morphological analysis • This was the earlier solution in Swahili Language Manager (SALAMA).

  10. How and where to describe MWEs? • Another solution: • - describe all MWEs after morphological analysis • - exceptions are a few fully lexicalized structures that are written as separate words • This solution is applied in current SALAMA .

  11. How and where to describe MWEs? • In describing inflecting MWEs, the following requirements apply: • - each member of the MWE must be described • - the relative location of each member must be described • - other words and punctuation marks in between members must be allowed • - manipulation of the linguistic information (i.e. tags) must be possible, because the whole cluster will be re-described • - it must be possible to isolate the newly described cluster and treat it as a single lexical unit

  12. CG in describing MWEs • Phase 1. • Analyze and disambiguate text: • ameikubali "kubali" V 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] { accept } SVO AR • shingo "shingo" N 9/10-0-SG { a/the } { neck } • upande "upande" ADV { aside }

  13. CG in describing MWEs • Phase 2. • Identify the MWE and describe its structure: • ameikubali "kubali" V 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] { accept } SVO AR • shingo "shingo" IN 9/10-0-SG { a/the } { neck} • upande "upande" <<IDIOM { accept unwillingly } • Note: Only the last member is reanalyzed, and the new lexical gloss is attached to it. Two words before it are part of the idiom (<<).

  14. CG in describing MWEs • Phase 3. • Modify the other members of the MWE: • ameikubali "kubali" V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } [kubali] IDIOM-V>> SVO AR • shingo "shingo" IDIOM<> • upande "upande" <<IDIOM { accept unwillingly } • Note: In the verb, gloss in English is removed, but necessary linguistic information is retained. In ‘shingo’, the gloss is removed.

  15. CG in describing MWEs • Phase 4. • Isolate the MWE as a single lexical unit: • ("kubali_shingo_upande" V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ { it } SVO AR IDIOM-V>> { accept unwillingly } )

  16. CG in describing MWEs • Phase 5. • Re-order the constituents: • (V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ SVO AR IDIOM-V>> { accept { it } unwillingly } ) • Note: The order of words, and their inclusion/exclusion is controlled by re-ordering rules.

  17. CG in describing MWEs • Phase 6. • Produce surface form in English: • (V CAP 1/2-SG3-SP VFIN { he/she } PERF:me 9/10-SG-OBJ SVO AR IDIOM-V>> { has accepted { it } unwillingly } ) • Note: Surface form is constructed using linguistic information inherited from Swahili.

  18. Phase 7. • Final translated form: • he/she has accepted it unwillingly

  19. Types of MWEs • Multiword expressions fall into various part-of-speech categories: • verbs • nouns • adjectives • adverbs • prepositions • multiword names • proverbs

  20. Adverb • After analysis: • kwa • "kwa" PREP { for } • "kwa" PREP { at } • "kwa" PREP { to } • "kwa" PREP { by } • "kwa" PREP { with } • "kwa" PREP { in } • "kwa" GEN-CON-KWA 15-SG { of } • "kwa" GEN-CON-KWA 17-SG { of } • kiasi • "asi" V SBJN VFIN 7/8-SG-OBJ OBJ { it } z [asi] { rebel } SVO AR • "asi" V SBJN 7/8-SG-SP VFIN { it } z [asi] { rebel } SVO AR • "kiasi" N 7/8-SG { the } { quantity } AR • "asi" ADJ A-INFL 7/8-SG { apostate } AR • "kiasi" ADV { reasonably } AR • "kiasi" AD-ADJ AR { amount } • "asi" ADV ADV:ki 9/10-SG { the } { rebel } AR • "asi" ADV ADV:ki 9/10-PL { the } { rebel } AR • kikubwa • "kubwa" ADJ A-INFL 7/8-SG { big }

  21. Adverb • After isolation: • kwa • "kwa" MW>> • kiasi • "kiasi" MW<> • kikubwa • "kubwa" ADV <<MW { to a large extent }

  22. Adverb • Adverbial expressions with genitive structure: • - number of forms limited • kwa bahati mbaya • Rule: • REPLACE ( ADV <<MW { unfortunately } ) TARGET ("baya") • (-2 ("kwa")) (-1 ("bahati")) ; • Modified result: • kwa "kwa_bahati_baya" MW>> bahati mbaya ADV { unfortunately }

  23. Adjective • mapambano • "pambano" N 5/6-PL { the } DER:verb DER:o { contest } • ya • "ya" GEN-CON 3/4-PL { of } • "ya" GEN-CON 9/10-SG { of } • "ya" GEN-CON 5/6-PL { of } • "ya" GEN-CON 6-PLSG { of } • kweli • "kweli" N 9/10-SG { the } { truth } • "kweli" N 9/10-PL { the } { truth } • "kweli" ADV { indeed }

  24. Adjective • Rule: • REPLACE ( ADJ <MW { genuine , serious , unaffected , undoubted , unfeigned , virtual } ) TARGET (“kweli") (-1 GEN-CON) ; • Result: • mapambano • "pambano" N 5/6-PL { the } DER:o { contest } • "ya" MW> • kweli • "kweli" ADJ <MW { genuine , serious , unaffected , undoubted , unfeigned , virtual }

  25. Adjectives • Adjectival expressions with relative structure: • - number of forms limited by the number of noun classes • mtu mwenye akili • Rule: • REPLACE (ADJ <MW { clever , cute }) TARGET ("akili") • (-1 ("enye")) (NOT 0 MW); • Modified result: • mtu "mtu" N 1/2-SG { the } { man } • mwenye "enye_akili" MW> akili ADJ { clever , cute }

  26. Adjectives • Adjectival expressions with relative structure: • - number of forms limited by the number of noun classes • - is often embedded in the verb structure • tendo lililohitimishwa vibaya • Rule: • REPLACE (ADJ <MW { illegitimate }) TARGET ("vibaya") • (-1 ("hitimishwa") + REL) (NOT 0 MW); • Modified result: • tendo "tendo" N 5/6-SG { the } { act } • lililohitimishwa "hitimishwa_vibaya" MW> vibaya ADJ { illegitimate }

  27. Verb • kupambana • "pambana" V INF { to } z [pamba] { contest } PREFR SVO REC • "pambana" V INF { to } z [pamba] { adorn } SVO EXT: REC { each other } :EXT • "pambana" V INF NO-TO z [pamba] { contest } PREFR SVO REC • "pambana" V INF NO-TO z [pamba] { adorn } SVO EXT: REC { each other } :EXT • na • "na" CC { and } • "na" AG-PART { by } • "na" PREP { with } • "na" NA-POSS { of } • "na" ADV NOART { past }

  28. Verb • kupambana • "pambana" V INF { to } z PREFR SVO REC IDIOM-V> • na • "na" <IDIOM { fight with } • One-line format with multiword lexical fom: • kupambana "pambana_na" V INF { to } z PREFR SVO REC IDIOM-V> • na <IDIOM { fight with }

  29. Verb • Rule: • REPLACE (<IDIOM { play piano }) TARGET ("kinanda") • (-1 ([piga])) ; • alipiga • "piga" V 1/2-SG3-SP VFIN { he/she } PAST [piga] { hit } SVO ACT • kinanda • "kinanda" N 778-SG { the } { piano } • One-line format with multiword lexical fom: • alipiga "piga_kinanda" V 1/2-SG3-SP VFIN { he/she } PAST SVO ACT IDIOM-V kinanda { play piano }

  30. Noun • kisomo • "kisomo" N 7/8-SG { the } DER:o { small lesson } • "somo" ADV ADV:ki 5/6-SG { the } DER:verb DER:o { :teaching subject } AR • "somo" ADV ADV:ki 9/10-SG { the } DER:o { namesake } HUM • "somo" ADV ADV:ki 9/10-PL { the } DER:o { namesake } HUM • cha • "cha" GEN-CON 7/8-SG { of } • watu_wazima • "mtu_mzima" N 1/2-PL { the } { :mature persons } • "mtu_mzima" N HUM 1/2-PL { the } { mature person } • Note that part of the MWE already fixed in tokenizer: (mtu_mzima).

  31. Noun • kisomo • "kisomo" N 7/8-SG { the } MW-N>> • cha • "cha" MW<> • watu_wazima • "mtu_mzima" <<MW { :adult education } • Note that part of the MWE already fixed in tokenizer: (mtu_mzima).

  32. Types of MWEs • Nouns with genitive structure: • - number of forms limited, often sg and pl • suala la jinsia • masuala ya jinsia • Rule: • REPLACE (<<MW { :gender issue }) TARGET ("jinsia") • (-2 ("suala")) (-1 GEN-CON); • Modified result: • suala "suala la jinsia" N 5/6-SG { the } AR MW-N la jinsia { :genderissue } • masuala "suala la jinsia" N 5/6-PL { the } AR MW-N ya jinsia { :genderissue }

  33. Proper names • Proper names with multiple members: • - fixed form • Wizara ya Mawasiliano na Uchukuzi • REPLACE (<<<<MW { *ministry of *communication et *transport }) TARGET ("uchukuzi") • (-4 ("wizara")) (-3 ("ya")) (-2 ("mawasiliano")) (-1 ("na")) ; • *wizara "wizara ya mawasiliano na uchukuzi" N 9/10-SG { the } AR MW-N ya *mawasiliano na *uchukuzi { *ministry of *communication et *transport }

  34. Proverbs • - ‘fixed’ form • - one rule for different variants • Baada ya dhiki faragha. • Baada ya dhiki faraja. • Baada ya dhiki faraji. • REPLACE (<<PROVERB { *after trouble there is relief } ) TARGET ("faragha") OR ("faraja") OR ("faraji") • (-2 ("baada_ya")) (-1 ("dhiki")) ;

  35. Proverbs • - ‘fixed’ form • "*baada_ya_dhiki_faragha" PROVERB>> { *after trouble there is relief } • "*baada_ya_dhiki_faraja" PROVERB>> { *after trouble there is relief } • "*baada_ya_dhiki_faraji" PROVERB>> { *after trouble there is relief }

  36. Serial verbs • Swahili uses serial verb constructions, where only the first verb inflects and the subsequent verbs are in infinitive.

  37. Serial verb construction analyzed • *mtu • "mtu" N CAP 1/2-SG { the } { man } • aliyepata • "pata" V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z [pata] { get } SVO • taarifa • "taarifa" N 9/10-SG { the } { report } AR • "taarifa" N 9/10-PL { the } { report } AR • alipiga • "piga" V 1/2-SG3-SP VFIN { he/she } PAST z [piga] { hit } SVO ACT • "piga" V 1/2-SG3-SP VFIN { he/she } PR:a 5/6-SG-OBJ OBJ { it } z [piga] { hit } SVO ACT • simu • "simu" N 9/10-SG { the } { telephone } • "simu" N 9/10-SG { the } { type of sardine or sprat } AN • "simu" N 9/10-PL { the } { telephone } • "simu" N 9/10-PL { the } { type of sardine or sprat } AN • , • "," COMMA { , } • kukaa • "kaa" V INF { to } z [kaa] { sit } SV SVO • "kaa" V INF NO-TO z [kaa] { sit } SV SVO • na • "na" CC { and } • "na" AG-PART { by } • "na" PREP { with } • "na" NA-POSS { of } • "na" ADV NOART { past } • kungoja • "ngoja" V INF { to } z [ngoja] { wait } SV • "ngoja" V INF NO-TO z [ngoja] { wait } SV

  38. Serial verb construction disambiguated • *mtu • "mtu" N 1/2-SG { the } { man } @SUBJ • aliyepata • "pata" V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z [pata] { get } SVO @FMAINVtr+OBJ> • taarifa • "taarifa" N 9/10-SG { the } { report } AR @OBJ • alipiga • "piga" V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> • simu • "simu" <IDIOM { call } • , • "," COMMA { , } • kukaa • "kaa" V INF { to } z [kaa] { sit } SV SVO @-FMAINV-n • "kaa" V INF NO-TO z [kaa] { sit } SV SVO @-FMAINV-n • na • "na" CC { and } @CC • kungoja • "ngoja" V INF { to } z [ngoja] { wait } SV SVO @-FMAINV-n • "ngoja" V INF NO-TO z [ngoja] { wait } SV SVO @-FMAINV-n

  39. The sentence contains an idiom. • Idiom isolated: • ( N 1/2-SG { the } { man } @SUBJ ) ( V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) ( N 9/10-SG { the } { report } @OBJ ) (V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call }) ( COMMA { , } ) ( V INF { to } z { sit } SV SVO @-FMAINV-n ) ( CC { and } @CC ) ( V INF { to } z { wait } SV @-FMAINV-n )

  40. Idiom isolated, a word-per-line format: • ( N 1/2-SG { the } { man } @SUBJ ) • ( V 1/2-SG3-SP VFIN { he/she } PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) • ( N 9/10-SG { the } { report } @OBJ ) • (V 1/2-SG3-SP VFIN { he/she } PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) • ( COMMA { , } ) • (V INF { to } z { sit } SV SVO @-FMAINV-n ) • ( CC { and } @CC ) • (V INF { to } z { wait } SV @-FMAINV-n )

  41. Linguistic information copied to other members of the verb series: • ( N 1/2-SG { the } { man } @SUBJ ) • ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { get } SVO @FMAINVtr+OBJ> ) • ( N 9/10-SG { the } { report } @OBJ ) • (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { call } ) • ( COMMA { , } ) • (V 1/2-SG3-SP VFIN PAST z { sit } SV SVO @FMAINV-n ) • ( CC { and } @CC ) • (V 1/2-SG3-SP VFIN PAST z { wait } SV SVO @FMAINV-n )

  42. The surface form in English converted: • ( N 1/2-SG { the } { man } @SUBJ ) • ( V 1/2-SG3-SP VFIN PAST 1/2-SG-REL { who } z { :got } SVO @FMAINVtr+OBJ> ) • ( N 9/10-SG { the } { report } @OBJ ) • (V 1/2-SG3-SP VFIN PAST z SVO ACT IDIOM-V> @FMAINVtr-OBJ> <IDIOM { :called } ) • ( COMMA { , } ) • ( V 1/2-SG3-SP VFIN PAST z { :sat } SV SVO @-FMAINV-n ) • ( CC { and } @CC ) • (V 1/2-SG3-SP VFIN PAST z { :waited } SV @-FMAINV-n ) • the man who got the report called, sat and waited

  43. Problems in identifying MWEs • A construction, which seems a MWE, may also be a normal sequence of words.

  44. Problematic cases • Original analysis: • amechukua "chukua" V 1/2-SG3-SP VFIN { he/she } PERF:me [chukua] { take} SVO • hatua "hatua" N 9/10-0-PL { step } AR • tatu "tatu" NUM 9/10-PL CARD { three } • Marking the idiom (wrong): • amechukua "chukua" V 1/2-SG3-SP VFIN { he/she } PERF:me SVO IDIOM-V> • hatua "hatua" <IDIOM { take action } • tatu "tatu" NUM 9/10-PL CARD { three }

  45. Safe cases • Safe case: • amepiga "piga" V 1/2-SG3-SP VFIN { he/she } PERF:me [piga] { hit } SVO • hatua "hatua" N 9/10-0-SG { a/the } { step } AR • amepiga "piga" V 1/2-SG3-SP VFIN { he/she } PERF:me SVO IDIOM-V> • hatua "hatua" <IDIOM { advance } • he/shehasadvanced

  46. MWEs in dictionary compilation • MWEs as separate dictionary entries: • {tia} V [tia] { put into, pour into, bring about, cause } 296 • {tia_akili} V IDIOM-V { take note of } 1 • [akili] taz. [tia_akili] V IDIOM-V { take note of } 1 • When sorted, the entries are located correctly in dictionary.

  47. MWEs in dictionary compilation • MWEs as separate dictionary entries: • {afya} N 9/10 { health, sound condition } AR 1226 • [afya] taz. [bwana_afya] MW> N 9/6 { health officer } 10 • [afya] taz. [enye_afya] MW> ADJ { bonny } 17 • [afya] taz. [enye_nguvu_na_afya] MW>>> ADJ { hale } 1

  48. MWEs in dictionary compilation • MWEs with use examples in dictionary: • {piga} V (piga) { hit, beat } 647 • {piga_picha} V IDIOM-V { photograph } 40 • [piga_picha] <ALA> Ikulu kunywa chai na kupiga [piga_picha] picha na Rais Mkapa (the State House to drink tea and to photograph and President Mkapa) • [piga_picha] <ALA> wapige [piga_picha] picha, alionekana kugoma (they should photograph, he/she was seen to boycott) • [piga_picha] <DWE> Au kumpiga [piga_picha] picha au hata kupeana naye (Or to photograph or even to give each other with him/her) • [piga_picha] <DWE> kutoka Ujerumani, walijitahidi kupiga [piga_picha] picha za ukumbusho na kiongozi wao (from Germany, they made an effort to photograph the commemoration and their leader)

  49. MWEs in dictionary compilation • MWEs with use examples in dictionary: • {piga_ramli} V IDIOM-V { divine } 4 • [piga_ramli] <KIO> anakwenda kwa mganga ili kupiga [piga_ramli] ramli na kuongeza imani za ushirikina (he/she goes to the medical person in order to divine and to increase the faith in superstition) • [piga_ramli] <KIO> ikambidi amtume mtaalam wa kupiga [piga_ramli] ramli kuhusu nyota hiyo (he/she was obliged to send to him/her the expert of divining concerning this star) • [piga_ramli] <KIO> kwenda kwa mganga wa kupiga [piga_ramli] ramli, hujui kuwa imani ya (going to the medical person of divining, you do not know that the faith of) • [piga_ramli] <RAI> kuachana na mtindo wa kupiga [piga_ramli] ramli (to leave with the style of divining)

  50. Conclusion • Detailed description of MWEs necessary at least in two applications • - machine translation • - automatic dictionary compilation

More Related