Progress in Developing Spoken Language Corpus of Indigenous Languages in South Africa

Progress in Developing Spoken Language Corpus of Indigenous Languages in South Africa Mtholeni N. Ngcobo and Nozibele Nomdebevana University of South Africa ngcobmn@unisa.ac.za nomden@unisa.ac.za

Outline • Introduction • The importance of the spoken corpus approach • The description of SLCP • Progress and Problems in the development of Spoken Language corpora of indigenous languages • Recommended solutions

INTRODUCTION • Multilingual language policy – 11 official languages – 2 developed – 9 underdeveloped • An urgent need to develop spoken language corpus for indigenous languages: explained by Allwood and Hendrikse (2003) • Corpus provides empirical research as opposed to Chomskyan intuitive approach • Written language corpus has already been done for some languages – i.e. Zulu and Sepedi (NS). • But a good start in spoken language corpus – Spoken Language Corpus Project (SLCP) – an open-ended corpus project • Started in 2000

Introd… • Collaboration – UNISA and Gothenburg University • Initially funded by NRF and Sida • UNISA is the host institution • UNISA has now approved funds for the project as it falls under the strategic projects • Goal: 1M words/tokens per language

AIMS • Establish a corpus research centre • To adapt and develop Computational linguistic software suitable for agglutinating languages of South Africa • Develop indigenous languages of South Africa • Understand the role of language and communication in real life situations

The Importance of Spoken Corpus • Allwood and Hagman (1994) - Spoken language • Fundamental trait of the human species • Integrated with the human brain and human society • There is a limited knowledge of spoken language as opposed to written language

The importance… • Corpus linguistics approach allows the use of statistical performance measures and observation of language use in real life. • This approach is in contrast with earlier Chomskyan linguistics which focused on ideal written language. • Allwood and Hagman 1994:1- “ the progress in audio, video and computer technology enables us to record and analyse spoken language without having to rely on either memory or written language.”

Contrast at a glance • Chomskyan linguistics focused on language competence (langue) while corpus linguistics also considers language performance (parole) as important • Chomskyan linguistics is unable to cope with many areas in linguistic study, since the emphasis is put on the ideal speaker/hearer to the exclusion of complexity/variation • Chomskyan linguistics views language as an innate mental faculty while corpus linguistics views language as a social phenomenon • Chomskyan linguistics relies on intuitive evidence whereas corpus linguistics relies on empirical evidence • Corpus linguistics looks at differences in languages while Chomskyan linguistics concentrates on universals • The focus of Chomskyan linguistics is on grammar (form) while corpus linguistics focuses also on meaning (semantics).

The Description of SLCP • First task: compilation of a body of texts (a corpus) • Computer: stores large quantities of data and allows statistical performance measures • Research potential: covers linguistic, social, cultural, educational, technological, inter-lingual and inter-communicational aspects

Descript… • SLCP has chosen video recordings. Why? • Allwood and Hendrikse (2003, 191) have mentioned the following reason: “…face-to-face spoken language is interactive, multimodal and context-dependent.” So we want to capture all the dynamics of language in communication • Compilation is a process with four major phases: • collecting video recorded spoken language activities • Transcribing video recordings • Quality control (checking and editing) • Annotation of raw data

Process diagram

1. Recording phase • Biber et al. (1998:246) state that “...a corpus is not simply a collection of texts. Rather, a corpus seeks to represent a language or some part of a language. The appropriate design for corpus therefore depends upon what it is meant to represent.” • Parameters: representativity of the corpus, control of variables in language varieties, recording, volume or size of the corpus and length of each sample • In SLCP we use socio-economic activities as a representativity measure, e.g. meetings, sermons, interviews, etc.

2. Transcription phase • Most crucial • Allwood and Hendrikse 2003:195 - without transcriptions there will be no computer readable corpus • Two components of a transcript: Header and Body

The Header E.G. @ Recorded activity Identity code (ID): U-ZV-01-01-01 @ Name of recorder: Magda Altman, Brenda Gonzales @ Duration of recorded activity: 3 hours @ Recorded activity date: 2006-08-04 @ Recorded activity type: Interview with Traditional Healers @ Recorded activity title: Interview with Traditional Healers @ Short name: TH Interview 1 @ Recorded activity location: Queen Ntuli’s home, Folweni, Umbumbulu @Activity mode: Face to face Interview @ Participant: B=F1 (Makhosi Queen Ntuli) @ Participant: M=F8 (Philisiwe Mkhize) @ Participant: K=F3 (Jabu Eunice Ncikazi) @ Participant: J=F2 (Thokozile Shezi) @ Participant: G=F4 (Thembeni Roge Magubane) @ @ Participant: H=GR (All participants) @ Tape ID code: U-ZV-01-01 @ Transcription name: U-ZV-01-01-T1 @ Transcriber: Mtholeni @ Transcription date: @ Transcription system: @ Electronic checking @ Editor: @ Checker: @ Checking dates: @ Section: @ Section: @ Time coding @ Comment(s):

The Body • 3 types of lines in the transcription body: § - section line (the topic of discussion) $ - contribution (interlocutor’s speech) @ - information line (comment)

The Body… • Standardised orthography is used, but no capital letters or punctuation marks • Plain text format is used to make transcription machine readable • Own communication management (i.e. hesitations) and interactive communication management (i.e. feedback) are indicated

The Body... • Certain symbols are used to transcribe the following: Elisions { } – curly brackets Overlaps [ ] – square brackets Comments < > - angle brackets Pauses / or // or /// - slashes Lengthening : - colon Unclear speech (. . .) three bracketed dots

The Body E.G. Example:Elisions, overlaps, comments, pauses, lengthening § Religion $A: uyakhonza konje $B: ngiyakhonza ngiyamthand{a} <1 unkulunkulu>1 [ ] ngiyamthanda angisoze ngimlahle @ <name: person> $A: [nanso_ke <1 sisi>1 // e: e:] @ <adoptive: English: sister> Example: Unclear speech and code-switching $Z: sekuphoqelekile ukuba (. . .) <1 neclaim>1 futhi (. . .) <2 that is why>2 <3 ngiclaimile>3 @ <1 code-mix: English> @ <2 code-switch: English> @ <3 code-mix: English>

3. The checking phase • The transcription is manually checked by another person than the transcriber to ensure quality control and reliability • The transcription is also checked electronically for correctness of format before it is inserted into the corpus • We currently use a GTS checking tool to monitor compliance with the transcription standards.

4. The tagging phase • The process whereby the corpus is annotated by means of various tags -enriching a raw corpus with grammatical tags. • E.G. abantwana - a«prepref»ba«pref»ntwa«nstem»ana«dimsuf» • Corpus driven approach (information retrieved from raw data) vs. corpus based approach (information retrieved from an annotated corpus)

4. Tagging … • Allwood and Hendrikse (2003, 199) argue that while the corpus driven approach works well with isolating languages, in agglutinating languages the corpus based approach may be used. They also note that Leech (1991) has warned against the danger of bias underlying any form of annotation. However, they argue that the tagging of corpora is now fairly general practice (Allwood and Hendrikse 2003, 199). The tagging set for the agglutinating languages has been discussed in detail by Allwood et al (2003).

Progress and success in SLCP • Only Xhosa out of the nine languages has been able to show greater progress • Why? It was used for piloting the project and has a consistent transcriber • Zulu is following behind with almost 20 000 transcribed tokens so far

Progress…

Progress… • Recordings Audio Video • N. of recordings 33 112 • Hours 38 128 • Un-transcribed 2 17 • Transcribed 31 68 • Checked 31 54 • Tokens 45 723 201 292

Progress… • We also have some un-transcribed recordings for Tsonga, especially for children speech • People behind the progress - the corpus group - share on issues of progress, motivating one another and presenting on key research aspects of the project

Problems • Little or nothing is currently happening in the development of corpora for the remaining official languages • Lack of appropriate monitoring - some of the video recordings get damaged and some of the digitized recordings have been lost • Poor quality of recordings • Lost data • Uncoordinated individual activities • Insufficient tools, financial and human resources • …etc.

Recommendations and solutions: hope for the future • Ultimately: A fully developed spoken corpus resource centre – • the establishment of a resource centre will not only be a sign of growth and prosperity, but also a sign of an investment in the future of languages and their speakers.

Recommend… • To this end, the following recommendations need to be considered: • More recordings and more trained transcribers are required in order to expedite the process. • Recorders and transcribers for all the languages should be remunerated to encourage them to do more in their work. • A network with other institutions, such as universities, should be created. • Short and medium term corpus development targets should be set up. • A server dedicated to corpus must be established as a matter of urgency. • A properly structured corpus archive must be set up and maintained by a web master. • All the various corpus-related projects should be re-organised under one corpus management structure. • Corpus maintenance, tagging and mining tools designed for the agglutinating languages and other peculiar searches (e.g. communicative gestures) must be developed. • Preliminary corpus mining should begin for the benefit of the tool development enterprise and to encourage the use of corpora for language research and development. • This will lead to the establishment of a dedicated corpus publication series for the indigenous languages of South Africa.

References • Allwood J, Grönqvist L, and Hendrikse AP. 2003. Developing a tag set and tagger for the African Languages of South Africa with special reference to Xhosa. Southern African Linguistics and Applied Language Studies 21 (4) 221-235. • Alwood J and Hagman J. 1994. Some simple automatic measures of spoken interaction. Proc. Of the 14th Scand. Conf. of Linguistics & 8th Conf. of Nordic and Gen. Linguistics, Vol. 72, Univ. of Göteborg. • Allwood J and Hendrikse AP. 2003. Spoken language corpora for the nine official languages of South Africa. Southern African Linguistics and Applied Language Studies 21 (4) 187-199. • Biber D, Condrad S, and Repen R. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Progress in Developing Spoken Language Corpus of Indigenous Languages in South Africa