1 / 43

Corpus 01

Corpus 01. Introduction Historical Review. Corpus Linguistics. Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation or elicitation, observations in spoken or written texts

amena
Télécharger la présentation

Corpus 01

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus 01 Introduction Historical Review

  2. Corpus Linguistics • Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation or elicitation, observations in spoken or written texts • Focus on performance rather on competence, on observation to theory rather than theory to observation • Scope: text as domain of study and as the source of evidence for linguistic description and argumentation • Methodologies: quantification of linguistic description

  3. Difference between corpus linguistics and other linguistics • Richness of the evidence • Confidence in generalizability • Validity and reliability

  4. Corpus Linguistic Activities 1. design and compilation of corpora collection of texts preparation and storage for later analysis 2. develop tools for the analysis of corpora: computational linguistics 3. use of computerized corpora to describe the lexicon and grammar of languages probalistic aspect of corpus-based description and study how often a particular form is used 4. language learning and teaching, natural language processing

  5. Function of corpus linguistics • Not that it is a faster way of description of language, but that it may reveal facts we might never have thought of seeking. e.g. Altenberg’s study of amplifier collocation in English (1991a): frequent maximizers such as quitetend to collocate with on-scalar words (quite obviously) while absolutely has a greater tendency than other maximizers to collocate with negatives (absolutely not) • Statistical distribution of linguistic items

  6. Topics of Corpus linguistics • Annotating corpora • Tagging of parts of speech and the senses of polysemous word forming • Improved automatic parsing • Identification of collocations • Phraseological units and discourse structure • Text categorization • Research methodology • Application in lexicography, syntactic description, translation, speech and handwriting recognition, language teaching

  7. Pre-electronic corpora • Biblical and literary studies • Lexicography • Dialect • Language education • Grammatical

  8. Biblical and literary studies • Alexander Cruden (1736): Concordance of the Authorized Fig. 2.1 • Similar works on Shakespeare

  9. Lexicography • Samuel Johnson (early 17th century): Dictionary of the English Language. Corpus of sentences from writers of the first reputation. • James Murray: OED (1928), corpus of the canon of literary written English. • Noah Webster (1828): An American Dictionary of the English Language

  10. Dialect • Wright (1898-1905): The English Dialect Dictionary • Ellis (1889): The Existing Phonology of English Dialects

  11. Language Education • Thorndike (1921): word frequency list based on a corpus of 45 million words from 41 different sources

  12. Grammatical • Jespersen (1909-49) • Cruisinga (1931-32) • Putsma (1926-29) • Fries (1940): American English Grammar. Corpus of letters to the US Government by persons of different educational and social background. Describe social class differences in usage.

  13. Grammatical • The Structure of English (1952): 250,000 word corpus of recorded telephone conversations. • Randolph Quirk: Survey of English Usage (1968). • 5,000 words X 200 samples >>>> 1,000,000 word corpus • representative of spoken and written English to describe the grammar and usage of educated adult native speakers of British English.

  14. Types of electronic corpora • General corpora (core corpora) • A text base for linguistic analysis to seek answers to particular questions about vocabulary, grammar, discourse structure. • Balanced, containing texts from different genres, and domains in speaking and writing, private and public

  15. Types of electronic corpora • Specialized corpora • designed with particular projects in mind, e.g. corpus for compilation of modern dictionary • Cartereet & Jones (1974): child language development • Zhu (1989): English used in petroleum geology exploration, drilling and refining. • People disagreeing with each other in radio interview • Teachers’ directives in high school classrooms

  16. Types of electronic corpora • Leech (1992): training corpora and test corpora for language models and language processing • Dialect corpora • Regional corpora • Non-standard corpora • Learners’ corpora

  17. Types of electronic corpora • full text corpus: complete texts • stylistic or discourse studies: 200 word samples may not be able to capture the internal structural characteristics of full texts. • raw corpus: tagging, parsing, concordance

  18. Major electronic corpora: first generation • Brown Corpus (1961): Brown University Standard Corpus of Present-Day American English • Significance: 1.first computer corpus 2.in the face of massive indifference • linguistic research is not to record but to describe while corpora are statistically based, with probabilistic model of competence derived from linguistic performance. • Structure (Table 2.2 p24)

  19. Major electronic corpora: first generation • Features: widely selected categories in written English: both formal and informal written English is taken into account • Selected by a method that makes it reasonably representative of current printed American English • Establish coding conventions: abbreviations, formula, quotations, punctuations. • Number of characters per line: 70 • Grammatically tagged: each word assigned to one of over 80 tags.

  20. Major electronic corpora: first generation • Lancaster-Oslo/Bergen Corpus (1970): LOB Corpus • A British counterpart to the Brown Corpus • 2000 words of 500 texts published in 1961 • same categories as the Brown Corpus • differences from the Brown Corpus • coding: sentence initial markers

  21. Major electronic corpora: first generation • abbreviations • partly analyzed version • versions for more different platforms. • More grammatical tags • Key word in context (KWIC) concordance

  22. Major electronic corpora: first generation • Other first generation corpora • Indian English: the Kolpapur Corpus of Indian English (1988) • Collected materials of 1978 • New Zealand: the Willington Corpus of Written New Zealand English and Australian Corpus of English (1986) • Collected texts in 1961

  23. Problems • 1.one million size is prohibitory and too small. • 2.difficult to find interesting differences between regional varieties: differences are sometime not in structure but in the frequency of the structure used • 3.additional words: sample ends at the first sentence ending after 2000 words. Thus in the Brown Corpus, the size is actually 1,014,312 and in LOB, 1,006,825. In word counting, LOB concordance size is 1,123,380

  24. London-Lund Corpus (LLC) • Spoken part of SEU which was half size for written and the other half for spoken • SEU original: 87 texts which make up 435,000 words plus 13 more texts. The total size is 5000 words of 100 texts which makes up 500,000 words. • Features: less detailed prosodic analysis

  25. London-Lund Corpus (LLC) • Tone units • Onset • Location of nuclei • Direction of nuclear tones

  26. London-Lund Corpus (LLC) • Boosters • Degree of pause • Degree of stress • Speaker identity • Simultaneous talk • Contextual comment • Incomprehensible words

  27. Corpora for special purposes • Algeo (1988): a corpus of 5 million words from the 18th century to present for studying Briticisms in the English language. • American Heritage Intermediate Corpus (1969) • 5.09 million words from 10,043 samples of 500 words long from the publications widely read among American schoolchildren aged 7 to 15.

  28. Corpora for special purposes • Categories: reading, English and grammar, composition, literature, mathematics, social studies, spelling, science, music, art, religion, home economics, library fiction, non-fiction, reference and magazines. • Words are not lemmatized • One of the first computer-based databases for lexicographical purposes.

  29. Other special corpora • The Nijmegen Corpus • Early 70s • Goal: grammatical description of British English • Size: 132,000 words

  30. Other special corpora • Composition: 20,000 words extract from 6 authors = 120,000 words • 12,000 words of transcribed sports commentary • categories: written, mainly literary English sports commentary • Sample span: 1962—1968 • Analysis: a large set of labeled trees or phrase markers.

  31. Other special corpora • TOSCA (Tools for Syntactic Corpus Analysis Corpus • Later then the Nijmegen Corpus • Size: 20,000 X 75=1,500,000

  32. Other special corpora • Categories: various fiction and nonfiction, genres in written British English • Span: 1976—1986 • Composition: 45 samples from 21 nonfiction genres ((auto)biography, history, literary criticism, politics, women’s studies, chemistry, economics, physics. • 30 samples from 9 fiction genres: horror, humor, love and romance, general fiction

  33. Other special corpora • Hong Kong University of Science and Technology • Size: 1,000,000 of computer science English • 2,000 word sample from 166 English language textbooks used in computer science course in the early 90s • goal: assist the teaching of English for computer science students.

  34. Other special corpora • Jiao Tong University Corpus for English in Science and Technology (JTEST) • 1980s • 1,000,000 words from written English texts in the physical science, engineering and technology • goal: facilitate lexical analysis of particular registers, e.e. count of high frequency words

  35. Other special corpora • Guangzhou Petroleum English Corpus (GPEC) • 411,000 words from 700 texts from the petroleum industry from written American and British English of the mid 1980s. • goal: the same as JTEST

  36. Second generation mega-corpora • COBUILD • Collins Birmingham University International Language Database • 25% from spoken texts • reflect broadly general rather than technical language

  37. Second generation mega-corpora • current usage from 1960 on • naturally occurring texts • Prose included but not poetry • Contributions: commercial research and development project for dictionaries, grammars and language teaching courses.

  38. Longman Corpus Network • Three major corpora • LLELC: the Longman/Lancaster English Language Corpus • LSC: Longman Spoken Corpus • LCLE: Longman Corpus of Learners’ English

  39. British National Corpus (BNC) • 100 million words of contemporary spoken and written British English. • Structure: Table 2.3 p.51 • Automatic word-class tagging with CLAWS

  40. Issues in corpus design and compilation • Static or dynamic: • Representativeness and balance • Size • Written and spoken

  41. Issues in corpus design and compilation • Extralinguistic variables: text origin, participants, medium genre, style, factuality, topic, date of publication, authorship (age, gender, nationality), audience • Storage • Text capture: keyboarding, CD-ROM or electronic version, scanning (software, quality of printing) • Spoken text: transcribing (conventions for transcribing prosodic phenomena: ICE project) • Markup: marks for tagging (Standard Generalized markup language—SGML) p.84

  42. Organizations and professional associations • Descriptive linguistics: the International Computer Archive of Modern English (ICANE) • ICASME CD-ROM: The Brown, LOB, Kolhapur, London_Lund and Helsinki corpora • And softwares: WordCruncher, TACT and Free Text Browser

  43. Organizations and professional associations • Bibliographic overview: Humanities Computing yearbook • Computational linguistics: the Association for Computational Linguistics (ACL) • Literary studies: The Association for Computers and Humanities (ACH) • Association for Literary and Linguistic Computing (ALLC)

More Related