6 Language corpora

6 Language corpora Liang Maocheng

7.1 Introduction • 7.1 Introduction • 7.2 Empiricism, corpus linguistics, and electronic corpora • 7.3 Applications of corpora in applied linguistics • 7.3.1 An overview of the use of corpora in applied linguistics • 7.3.2 The Lexical Syllabus: a corpus-based approach to syllabus designing • 7.3.3 Data-Driven Learning (DDL) • 7.3.4 Corpora in language testing • 7.3.5 Corpus-based interlanguage analysis • 7.4 Future trends

7.1 Introduction • The debates about whether corpus linguistics should be defined as a new discipline, a theory or a methodology. • Some linguists working with corpora tend to think that corpus linguistics goes well beyond the purely methodological role by re-uniting the activities of data gathering and theorizing, thus leading to a qualitative change in our understanding of language (Halliday, 1993:24),

More researchers (e.g., Botley & McEnery, 2000; Leech, 1992; McEnery & Wilson, 2001: 2; Meyer, 2002:28; ) seem to share the view that corpus linguistics is a methodology contributing to many hyphenated disciplines in linguistics. This methodological nature has brought about the convergence of corpus linguistics with many disciplines, of which applied linguistics probably enjoys the most benefits corpus linguistics can offer.

2. Empiricism, corpus linguistics, and electronic corpora • The 1950s witnessed the peak of empiricism in linguistics, with Behaviorists dominating American linguistics, and Firth, a leading figure of British linguistics at the time, forcefully advocating his data-based approach with his well-cited belief that “You shall know a word by the company it keeps” (Firth, 1957). • However, with the quick advent of generative linguistics in the late 1950s, empiricism gave way to Chomskyans, whose approach is based on the intuition of ‘ideal speakers’ rather than collected performance data. It was not until the 1990s that there was again a resurgence of interest in the empiricist approach (Church & Mercer, 1993).

Corpus linguistics is empirical. Its object is real language data (Biber et al., 1998; Teubert 2005). In fact, there is nothing new about working with real language data. In as early as the 1960s, when generative linguistics was at its peak, Francis and Kucera created their one-million-word corpus of American written English at Brown University (Francis and Kucera, 1982). • Then, in the 1970s, Johansson and Leech at the University of Oslo and the University of Lancaster compiled the Lancaster-Oslo/Bergen Corpus (LOB), the British counterpart of the American Brown Corpus.

Since then, there has been a boom of corpus compilation. Spoken corpora (such as the London-Lund Corpus, see Svartvik, 1990) and written corpora, diachronic corpora (such as ICAME) and synchronic corpora, large general corpora (such as the BNC and the Bank of English) and specialized corpora, native speakers’ language corpora and learner corpora and many other types of corpora were created one after another. • These English corpora are constantly providing data to meet the various needs of applied linguistics.

After a few decades of world-wide corpus-related work, several tendencies now seem to be obvious in corpus building in the new empiricist age. • First, modern corpora are becoming ever larger and more balanced, and therefore often claimed to be more representative of the language concerned.

Second, there is often the need for rationalism and empiricism to work together. • Third, many types of specialized language corpora are being built to serve specific purposes. • Fourth, many researchers in applied linguistics have realized the usefulness of corpora for the analysis of learner language.

Finally, as corpus annotation is believed to bring “added value” (Leech, 2005) to a corpus, there is a tendency that corpus annotations are becoming more refined, providing detailed information about lexical, phraseological, phonetic, prosodic, syntactic, semantic, and discourse aspects of language.

7.3 Applications of corpora in applied linguistics • 7.3.1 An overview of the use of corpora in applied linguistics • 7.3.2 The Lexical Syllabus: a corpus-based approach to syllabus designing • 7.3.3 Data-Driven Learning (DDL) • 7.3.4 Corpora in language testing • 7.3.5 Corpus-based interlanguage analysis

In order to provide an overall picture of the extensive use of corpora in applied linguistics, this section will first present an overview of the use of corpora in applied linguistics. • Following the overview, our discussion will focus on a few major areas of applied linguistics where corpora are playing increasingly important roles, namely, syllabus design, data-driven learning, language testing, and interlanguage analysis.

3.1 An overview of the use of corpora in applied linguistics • Leech (1997) summarizes the applications of corpora in language teaching with three concentric circles.

Direct use of corpora in teaching Use of corpora indirectly applied to teaching Further teaching-oriented corpus development Figure 1: The use of corpora in language teaching (from Leech, 1997)

Drawing on Fligelstone (1993), Leech (1997) claims that ‘the direct use of corpora in teaching’ (the innermost circle) involves teaching about [corpora], teaching to exploit [corpora], and exploiting [corpora] to teach. According to Leech (1997), teaching about corpora refers to the courses in corpus linguistics itself; teaching to exploit corpora refers to the courses which teach students to exploit corpora with concordance programs, and learn to use the target language from real-language data (hence data-driven learning, to be discussed in more detail later in this section). Finally, exploiting corpora to teach means making selective use of corpora in the teaching of language or linguistic courses which would traditionally be taught with non-corpus methods.

The more peripheral circle, ‘the use of corpora indirectly applied to teaching’, according to Leech (1997), involves the use of corpora for reference publishing, materials development, and language testing. In reference building, the Collins (now HarperCollins), Longman, Cambridge University Press, Oxford University Press and many other publishers have been actively involved in the publication of corpus-based dictionaries, electronic corpora, and other language reference resources, especially those in the electronic form.

As Hunston states, “corpora have so revolutionised the writing of dictionaries and grammar books for language learners that it is by now virtually unheard-of for a large publishing company to produce a learner’s dictionary or grammar reference book that does not claim to be based on a corpus” (Hunston, 2002:96) (For more detailed accounts of the use of corpora in dictionary writing, see Sinclair, 1987; Summers, 1996). In materials development, increasing attention is paid to the use of corpora in the compilation of syllabuses (to be discussed later in this section) and other teaching materials. Also in the second of the three concentric circles is language testing (See Section 3.4 for more detail), which “benefits from the conjunction of computers and corpora in offering an automated, learner-centred, open-ended and tailored confrontation with the wealth and variety of real-language data”.

Finally, the outermost circle, ‘further teaching-oriented corpus development’, involves the creation of specialized corpora for specific pedagogical purposes. To illustrate the need to build such specialized corpora, Leech (1997) mentions LSP (Language for Specific Purposes) corpora, L1 and L2 developmental corpora (also an important data source for Second Language Acquisition research), and bilingual/ multilingual corpora. Leech (1997) believes these are important resources that language teaching can benefit from.

Of course, Leech’s (1997) discussion focuses on the applications of corpora in language teaching. Second Language Acquisition research, one of our major concerns in this book, is not as important for him. In fact, corpus-based approaches to SLA studies, particularly to interlanguage analysis, have become one of the most prevalent research methodologies for the analysis of learner language. This issue will be brought to a more detailed discussion later in the chapter.

3.2 The Lexical Syllabus: a corpus-based approach to syllabus designing • The notion of a ‘lexical syllabus’ was originally proposed by Sinclair and Renouf (1988), and further developed by Willis (1990), who points out a contradiction between syllabus and methodology in language teaching:

The process of syllabus design involves itemising language to identify what is to be learned. Communicative methodology involves exposure to natural language use to enable learners to apply their innate faculties to recreate language systems. There is an obvious contradiction between the two. An approach which itemises language seems to imply that items can be learned discretely, and that the language can be built up from an accretion of these items. Communicative methodology is holistic in that it relies on the ability of learners to abstract from the language to which they are exposed, in order to recreate a picture of the target language. (Willis, 1990: viii)

The lexical syllabus attempts to reconcile the contradiction. Rather than relying on a clear distinction between grammar and lexis, the lexical syllabus blurs the distinction and builds itself on the notion of phraseology.

As early as a few decades ago, some researchers (e.g., Nattinger and DeCarrico, 1989; 1992; Pawley & Syder, 1983) came to realize that phrases (also called ‘lexical phrases’, ‘lexical bundles’, ‘lexical chunks’, ‘formulae’, ‘formulaic language’, ‘prefabricated routines’, etc.) are frequently used and are therefore important for the teaching and learning of English. • Later researchers such as Cowie (1998), Skehen (1998) and Wray (2002) are also convinced that a better command of such phrases can improve the fluency, accuracy and complexity of second language production.

Unfortunately, most syllabuses for ELT are based on traditional descriptive systems of English, which consist of grammar and lexis (Hunston, 2002). Such descriptive systems, according to Sinclair (1991), fail to give a satisfactory account of how language is used. • Besides, combining the building blocks (lexis) with grammar does not result in a good account of the phrases in a language. • Against this background, Sinclair (1991) proposes a new descriptive system of language, completely denying the distinction between grammar and lexis and putting phraseology at the heart of language description.

The lexical syllabus is not a syllabus consisting of vocabulary items. It comprises phraseology, which “encompasses all aspects of preferred sequencing as well as the occurrence of so-called ‘fixed’ phrases” (Hunston, 2002:138). • Such a syllabus, according to Hunston (2002), differs from a conventional syllabus only in that the central concept of organization is lexis. To put it simply, as a relatively small number of words in English account for a very high proportion of English text, it makes sense to teach learners the most frequent words in the target language. • However, as learning words out of their context does not seem to be a good idea, a syllabus had better not be designed in such a way that it only specifies the vocabulary items to be learned. Rather than the lexis of the target language, phraseology (words with their most frequently used patterns) is what is to be specified in detail in a lexical syllabus.

According to Sinclair & Renouf (1988), ‘the main focus of study should be on • (a) the commonest word forms in the language; • (b) the central patterns of usage; • (c) the combinations they usually form’ (1988:148). • This is exactly what is covered in Willis’ (1990) syllabus.

Willis (1990) argues that the lexical syllabus effectively addresses “the main focus of study” mentioned in Sinclair & Renouf (1988): (a) Willis’ (1990) syllabus consists of three different levels. • Level I covers the most frequent 700 words in English, which, according to Willis (1990), make up around 70% of all English text. • Level II covers the most frequent 1,500 and • Level III covers the most frequent 2,500 words in English. • (b) The lexical syllabus illustrates the central patterns of usage of the most frequent words in English. Such patterns of usage were later developed into a system of grammar called “pattern grammar” (Hunston & Francis, 2000). • (c) Typical combinations of word forms (collocations) from authentic language are highlighted to provide information about the phraseology of the words concerned.

It can be seen that the lexical syllabus is not a syllabus which lists all the vocabulary items learners are required to learn. • To Willis (1990), “English is a lexical language”, meaning that “many of the concepts we traditionally think of as belonging to ‘grammar’ can be better handled as aspects of vocabulary” (Hunston, 2002:190). Conditionals, for example, could be handled by looking at the hypothetical meaning of the word ‘would’. • The most productive way to interpret grammar, therefore, is as lexical patterning, rather than as rules about the sequence of words that often do not work.

However, the lexical syllabus, as proposed in Willis (1990), is not without challenges. • First, frequency is a useful factor to take into consideration when a syllabus is designed. However, language learning is not that simple. Native language influence, cultural difference, learnability, usefulness, and many other factors can all bring about some difficulty for language learning. • As Cook (1998:62) notes, “an item may be frequent but limited in range, or infrequent but useful in a wide range of contexts”. Proponents of the Lexical Syllabus seem to have ignored this fact. They believe their syllabus will work for learners of all linguistic and ethnic backgrounds. To many researchers and practitioners of applied linguistics, such a belief is too simplistic to account for complicated processes such as language learning.

Second, it is not an easy task to select a manageable yet meaningful number of items from the entire vocabulary of a language for inclusion in a lexical syllabus. • While it may be true that the 2,500 words in Willis’ syllabus can account for 80% of all English text, knowing 80% of the words in a text does not guarantee good comprehension. In fact, most of the 80% are functional words or delexicalized words, which do not have much content. In other words, the role of the commonest words in language comprehension and production may not be as significant as the high frequency counts suggest. Also, while it may be relatively easy to include a content word as an entry in a lexical syllabus, functional words can be a big problem. Such a word entry may take tens of pages long in the lexical syllabus. If the most frequent words are to be accounted in detail, it is very likely that the syllabus will become a comprehensive guide to the usage of functional words.

Finally, the size of the syllabus may also be a problem. Willis (1990), when creating his “elementary syllabus”, had to work with huge amounts of data. He complained that “The data sheets for Level I alone ran to hundreds of pages which we had to distil into fifteen units” (Willis, 1990:130). • We just cannot help wondering how many thousands of data sheets would have to be created if a syllabus had to be written for advanced learners.

3.3 Data-Driven Learning (DDL) • The use of corpora in the language teaching classroom owes much to Tim Johns (Leech, 1997), who began to make use of concordancers in the language classroom in as early as the 1980s. He then wrote a concordance program himself, and later developed the concept of Data-Driven Learning (DDL) (See Johns, 1991).

DDL is an approach to language learning by which the learner gains insights into the language that she is learning by using concordance programs to locate authentic examples of language in use. The use of authentic linguistic examples is claimed to better help those learning English as a second or foreign language than invented or artificial examples (Johns, 1994). • These authentic examples are believed to be far better than the examples the teachers make up themselves, which unavoidably lack authenticity. In DDL, the learning process is no longer based solely on what the teacher expects the learners to learn, but on the learners’ own discovery of rules, principles and patterns of usage in the foreign language.

Drawing from George Clemenceau’s (1841-1929) claim that ‘War is much too serious a thing to be left to the military’ (quoted in Leech, 1997), Johns believes that ‘research is too serious to be left to the researchers’ (Johns, 1991:2), and that research is a natural extension of learning. • The theory behind DDL is that students could act as ‘language detectives’ (Johns, 1997:101), exploring a wealth of authentic language data and discovering facts about the language they are learning. • In other words, learning is motivated by active involvement and driven by authentic language data, whereas learners become active explorers rather than passive recipients. To a great extent, this is very much like researchers working in the scientific laboratory, and the advantage lies in the direct confrontation with data (Leech, 1997).

Murison-Bowie, the author of the MicroConcord Manual, gives some very persuasive reasons for using a concordancer: • … any search using a concordancer is given a clearer focus if one starts out with a problem in mind, and some, however provisional, answer to it. You may decide that your answer was basically right, and that none of the exceptions is interesting enough to warrant a re-formulation of your answer. On the other hand, you may decide to tag on a bit to the answer, or abandon the answer completely and to take a closer look. Whichever you decide, it will frequently be the case that you will want to formulate another question, which will start you off down a winding road to who knows where. (Murison-Bowie, 1993:46, cited in Rezeau, 2001:153)

It can be seen that in DDL, language learning is a hypothesis-testing process, in which whenever the learner has a question, she goes to the concordancer for help. If what she discovers coincides with the authentic language data shown in the concordance lines, her hypothesis is tested to be true, and her knowledge of the language is reinforced; conversely, if the concordance data contradict her hypothesis, she modifies it and comes closer to how the language should be used.

Higgins (1991) mentions that classroom concordancing tends to have two characteristic objectives: • “using searches for function words as a way of helping learners discover grammatical rules, and • searching for pairs of near synonyms in order to give learners some of the evidence needed for distinguishing their use” (p.92).

What the foregoing example shows is that in data-driven learning, the learner often has a question in mind. She then goes to explore the data for an answer. The whole process, as mentioned before, involves testing a hypothesis with hands-on activities. In so doing, the learner is more likely to be motivated.

It is worth mentioning that many of the features of DDL fit exactly into the constructivist approach to language teaching, in which: • Learners are more likely to be motivated and actively involved; • The process is learner-centered; • Learning is experimentation; • Hands-on activities and experiential learning; • Teachers work as facilitators;

As pointed by Gavioli (2005:29), DDL raised several pedagogic questions, which we have to answer when we encourage our students to engage themselves in data-driven learning: • if learners are to behave as data analysts, what should be the role of the teacher? • the work of language learners is similar to that of language researchers insofar as “effective language learning is itself a form of linguistic research” (Johns 1991:2). So, should we ask the learners to perform linguistic research exactly like researchers?

3. provided that learners adopt the appropriate instruments and methodology to actually be able to perform language research, are the results worth the effort? • And, to this I would like to add Barnett’s (1993) note: 4. “the use of computer applications in the classroom can easily fall into the trap of leaving learners too much alone, overwhelmed by information and resources”. • What can we do to improve this situation in data-driven learning?

3.4 Corpora in language testing • It is only recently that language corpora have begun to be used for language testing purposes (Hunston, 2002), though language testing could potentially benefit from the conjunction of computers and corpora in offering an “automated, learner-centred, open-ended and tailored confrontation with the wealth and variety of real-language data” (Leech, 1997).

In fact, the use of corpora in language testing is such a new field of studies that Leech (1997) only mentions the advantages of corpus-based language testing, and predicts that “automatic scoring in terms of ‘correct’ and ‘incorrect’ responses is feasible”. • Hunston (2002:205) also states that the work reported in her book is “mostly in its early stages”. • Alderson (1996), in a paper entitled “Do corpora have a role in language assessment?”, has to claim that he could only “concentrate on exploring future possibilities for the application of computer-based language corpora in language assessment” (Alderson, 1996:248).

To the best of my knowledge, the use of corpora in language testing roughly falls into two categories. One is the use of corpora and corpus-related techniques in the automated evaluation of test questions (particularly subjective questions like essay questions or short-response questions), and • the other is the use of corpora to make up test questions (mostly objective questions). • The reason behind this is simple: making up subjective questions does not take a lot of efforts but evaluating them does; similarly, evaluating objective questions does not take a lot of efforts but making up these questions may take a lot of time.

The earliest attempt to automate essay scoring was made several decades ago, when Ellis Page and his collaborators devised a system called PEG (Project Essay Grade) to assign scores for essays written by students (Page, 1968). • From a collection of human-rated essays, Page extracted 28 text features (such as the number of word types, the number of word tokens, average word length, etc.) that correlate with human-assigned essay scores. • He then conducted a multiple regression, with human-assigned essay scores as the dependent variable and the extracted text features as independent variables. • The multiple regression yielded an estimated equation, which was then used to predict the scores for new essays.

Following Page’s methodology, ETS produced another automated essay scoring system called the E-rater, which has been used to score millions of essays written in the GMAT, GRE and TOEFL tests (Burstein, 2003). • Liang (2005) also made an exploratory study, in which he extracted 13 essay features to predict scores for new essays. • All the above-mentioned studies involved the use of corpora and corpus-related techniques. They reported good correlations between human-assigned essay scores and computer-assigned essay scores. • For more information about automated essay scoring, see Shermis & Burstein (2003).

The direct use of corpora in the automatic generation of exercises (or test questions) is also a relatively new field of study. To the best of my knowledge, two studies have been reported in the literature. • Wojnowska (2002) reported a corpus-based test-maker called TestBuilder. The Windows-based system extracts information from a tagged corpus. It can help teachers prepare two types of single-sentence test items --- gapfilling (cloze) and hangman. While the system may be useful for English teachers, the fact that the questions generated with the system can only be saved in the pure text format greatly reduces its practicality.

In other words, the test generated allows no interaction with the test-taker, and the teacher or the test-taker herself has to judge the test performance and calculate the scores manually. Besides, gapfilling and hangman are very similar types of exercises, both involving the test-taker filling gaps with letters or combinations of letters (words). • Obviously, this system has not made full use of the capacity of the computer and the corpora. Wilson (1997) reported her study on the automatic generation of CALL exercises from corpora. Unfortunately, the study has not resulted in a computer program, and the exercises generated are not interactive either.

To take better advantage of corpora for language testing purposes, we started a project of our own. • From this study a Windows-based program (called CoolTomatoes) has derived, which is capable of automatically generating interactive test questions or exercises from any user-defined corpus.

6 Language corpora

6 Language corpora

Presentation Transcript

Corpora and Language Teaching

Corpora, Language Technology and Maltese

Using Corpora for Language Research

How can corpora help in language pedagogy?

Sign Language corpora for analysis, processing and evaluation

Using corpora for bespoke language teaching

Querying Spoken Language Corpora

Mining the Web to Create Minority Language Corpora

Corpora in language variation studies

Corpora in language education

Corpora and Teaching Language for Special Purposes

Developing Asian Language Corpora: standards and practice

Corpora as norms in language pathology

Using corpora for bespoke language teaching

Chinese learner corpora and second language research

Using Corpora for Language Research

Corpora in language variation studies

Corpora in language variation studies

Tracking Language Development with Learner Corpora

Corpora in language education