1 / 21

Compiling a corpus II

Compiling a corpus II. “It’s a capital mistake to theorize before one has data” ( A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia). Discourse (s).

stella
Télécharger la présentation

Compiling a corpus II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiling a corpus II “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

  2. Discourse(s) • ‘a group of statements which provide a language for talking about - a way of representing the knowledge about - a particular topic at a particular historical moment. [...] Discourse is about the production of knowledge through language’ (Hall, 1992: 291). • ‘It is the continuous reinforcement, through massive repetition and consistency in discourse, which is required to construct and maintain • reality’ (Stubbs 1996: 92). • the incremental effect of discourse (Baker 2006)

  3. Corpus • A finite size, non random collection of naturally occurring language, in a computer readable form. • Non-random = representative of a language or text type and compiled for an intendedfunctionalpurpose. • (see McEnery et al. 2006)

  4. Quantitative breadth • coverage (i.e. representative, generalisable results) • statisticalrelevance • descriptivepower • reliability (NOT objectivity) • the principle of total accountability • replicability

  5. Qualitative depth • contextualisation • socio-cultural relevance • explanatorypower

  6. Cumulative power • Qualitative change cannot be understood, let alone achieved, without noting the accumulation of quantities • (Gerbner, 1983: 361)

  7. CADS • recognition and quantification of patterns • systematic analysis of serendipitous discoveries • ‘[W]here one examines the boiled-down extract, the list of words, the concordance. It is here that something not far different from the sometimes-scorned “intuition” comes in. This is imagination. Insight. Human beings are unable to see shapes, lists, displays, or sets without insight, without seeing in them “patterns”. It seems to be a characteristic of the homo sapiens mind that it is often unable to see things “as they are” but imposes on them a tendency, a trend, a pattern.’ (Scott and Tribble 2006: 6)

  8. Select a phenomenon for investigation • Collect a relevant data set • Look inside the data-set for systematic patterns • Formalizesignificantpatternsasrulesdescribingnaturalevents

  9. The researchprocess • finding a researchquestion • designing the appropriate corpus to answer it • compiling the dataset • analysing the corpus • fine-tuning the RQ / coming up with more questions • findinganswers (?)

  10. concordances • ‘A concordance is a collection of the occurrences of a word form, each in its own textual environment.’ (Sinclair 1991:32) • A concordance brings together a series of fragments of text displaced from their original sequence and by juxtaposing them vertically, one after the other, it makes repetition visible and countable and makes patterns emerge to the surface, while the individual texts are eclipsed.

  11. collocation • The idea behind collocation is that a word is defined by the relationships it establishes with other words. • ‘you shall judge a word by the company it keeps’ (Firth 1957)

  12. The collocationprinciple • The ubiquity of collocation challenges current theories of language because it demands explanation, and the only explanation that seems to account for the existence of collocation is that each lexical item is primed for collocational use. By primed , I mean that as the word is learnt through encounters with it in speech and writing, it is loaded with the cumulative effects of those encounters such that it is part of our knowledge of the word that it co-occurs with other words. (Hoey 2002)

  13. Collocates and statistics • A collocate is an ‘item that appears with greater than random probability in its (textual) context’ (Hoey1991: 7). • Measures of statistical significance (e.g. log-likelihood, z-score, MI score ...)

  14. T-score and MI • two measures of relative statistical significance • T-score measures certainty of collocation, whereas MI score measures strength of collocation (Hunston 2002:73; McEnery & Wilson 2001:86). • T-score directs our attention to high-frequency collocates such as grammatical words (and is thus likely to be more useful to the grammarian or lexicographer than to the sociolinguist or discourse analyst), • whereas MI score highlights lexical items that are relatively infrequent by themselves but have a higher-than-random probability of co-occurring with the node word (Clear 1993:281). • The two scores are useful, above all, in ranking collocations (Manning & Schütze1999:166).

  15. measureofstatisticalsignificance • z-score: is the number of standard deviation from the mean frequency, it compares the observed frequency with the frequency experienced if only chance is affecting distribution. • It does not measure the strength of the relationship, but its significance.

  16. Quantitative indicatorshighlightparticularlypromising entry points into the data. • identifying key leads worth pursuing qualitatively, according to the tried and tested principle of corpus linguistics, “Decide on the ‘strongest’ pattern and start there” (Sinclair 2003:xvi).

  17. What to do with collocates • recurrent lexical patterns • classification of collocates (semantic grouping) • recurrent semantic patterns • recurrent evaluative patterns • concordancing of co-occurrences and 2nd level collocation • analysis

  18. Collocation and prosody • The node’s property of being associated with a ‘semanticallyconsistentset of collocates’ (Bublitz, 1996: 9). • Semantic/evaluative (Morley and Partington 2009) prosody is an expression of evaluation (good/bad; desirable/undesirable; beneficial /dangerous; favourable/unfavourable... • also can be about control vs. lack of control).

  19. keywords • ‘A key word may be defined as a word which occurs with unusual frequency in a given text. This does not mean high frequency but unusual frequency, by comparison with a reference corpus of some kind’ (Scott, 1997: 236).

  20. Whatyou can do withkeywords • Identify the specificity, trends and the aboutnessof the study corpus compared to a reference corpus. • Keywords are a very good source of insights and help identifying potentially interesting items for closer observation, but they must be treated with caution.

  21. Workingwithkeywords • Keywords lists do not account for textual position of words, they do not allow a distinction to be made between polysemous meanings and are independent from the context. For these reasons keywords analysis does not reveal discourses, but it directs the researcher’s attention by highlighting patterns of difference that could otherwise go undetected. • As with collocation analysis, the software makes the pattern visible, the human works on it.

More Related