SEL3053: Analyzing Geordie Lecture 1. Introduction

SEL3053: Analyzing GeordieLecture 1. Introduction This introductory lecture states the aim of the module and gives an overview of its content.

SEL3053: Analyzing GeordieLecture 1. Introduction 1. Module aim Linguistics is a science, and the aim of this linguistics module is stated in scientific terms. The purpose of science is to understand the reality around us. Philosophy of science is devoted to explicating the nature of science and its relationship to reality, and, perhaps predictably, both are controversial. In practice, though, most scientists explicitly or implicitly assume a view of scientific methodology based on the philosophy of Karl Popper, which is centred on the concept of the falsifiable hypothesis: a research question is asked about some domain of interest, a hypothesis is proposed in answer to the question, the hypothesis is tested to see if its claims and implications are compatible with observation of the domain: if it is the hypothesis is taken to be confirmed as a valid statement about the domain, and if not it is taken to be falsified and must then either be modified so as to make it compatible with observation, or abandoned.

SEL3053: Analyzing GeordieLecture 1. Introduction Module aim Because the falsifiable hypothesis is central in contemporary science, it is natural to ask how hypotheses are generated. The consensus in philosophy of science is that hypothesis generation is non-algorithmic, that is, not reducible to a formula, but is rather driven by human intellectual creativity in response to a research question. In principle any one of us, whatever our background, could suddenly articulate an utterly novel and brilliant hypothesis that, say, unifies quantum mechanics and Einsteinian relativity, but this kind of inspiration is highly unlikely and must be exceedingly rare.

SEL3053: Analyzing GeordieLecture 1. Introduction 1. Module aim In practice, hypothesis generation is a matter of becoming familiar with the domain of interest by observation of it, reading the associated research literature, formulating a research question which, if convincingly answered, will enhance scientific understanding, abstracting data from the domain and drawing inferences from it, and on the basis of these inferences formulating a hypothesis that interestingly answers the research question.

SEL3053: Analyzing GeordieLecture 1. Introduction 1. Module aim Until now, this latter approach to hypothesis generation has served the linguistics research community well, but the appearance and rapid proliferation of digital electronic text since the second half of the twentieth century is undermining its usefulness for two main reasons.

SEL3053: Analyzing GeordieLecture 1. Introduction 1. Module aim On the one hand, data abstraction has traditionally been a matter of the researcher listening to or reading through some collection of speech or text, henceforth referred to as a corpus (pl. 'corpora'), noting features of interest, and then formulating a hypothesis. The advent of information technology in general and of digital representation of text in particular in the past few decades has made this often-onerous process much easier via a range of computational tools, but, as the amount of digitally-represented language available to linguists has grown, a new problem has emerged: text overload.

SEL3053: Analyzing GeordieLecture 1. Introduction 1. Module aim Actual and potential corpora are growing ever-larger, and even now they are often on the limit of what the individual researcher can work through efficiently in the traditional way. Moreover, as we shall see, data abstracted from such corpora can be so complex as to be impenetrable to understanding.

SEL3053: Analyzing GeordieLecture 1. Introduction Module aim On the other hand, though linguistics in all its subdisciplines has studied a wide range of world languages, any survey of the literature will show that the main focus has been on western European languages and on English in particular. Because these latter languages have been well studied, and because most of the world's linguists have until recently been and probably still are native speakers of them, interesting hypotheses about these languages are relatively easy to formulate, being supported on the one hand by an extensive literature and on the other by native speaker intuition. The advent of information technology is, however, also generating large amounts of digital text in world languages that have been less well studied, and electronic collections of dialectal and historical language varieties as well as of endangered languages are now appearing. In the absence of extensive research literatures and native speaker intuitions, how easy is it to formulate interesting hypotheses about these?

SEL3053: Analyzing GeordieLecture 1. Introduction Module aim One approach is to stick with what one knows, that is, to deal only with language collections of tractable size in languages whose characteristics are well known. But ignoring evidence is not scientifically respectable. The other is to exploit the rich new source of data about the world's present and past languages and dialects that electronic speech and text offer, and to formulate hypotheses based on that data. The question is: how?

SEL3053: Analyzing GeordieLecture 1. Introduction Module aim The answer is to look at what is done in other sciences. Information technology has generated not just huge volumes of text but also vast amounts of digital data of all kinds across a wide range of science and engineering disciplines, and, because these disciplines have historically been quantitatively-oriented, they have developed mathematically and statistically based computational technologies for data interpretation. The general solution to the problem of how to deal with large and diverse digital electronic corpora in linguistics is to adapt these technologies to analysis of data derived from them.

SEL3053: Analyzing GeordieLecture 1. Introduction Module aim The aim of this module is to show how sociolinguistic hypotheses can be generated by abstraction of data from a corpus and analysis of that data using one particular mathematically based computational technology, cluster analysis.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline This section outlines the module content. Each of the topics mentioned will be dealt with in greater detail in subsequent lectures.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline 2.1 Historical context An historical sketch of the history of north-east England and of the local dialect provides a context for understanding the linguistic material covered in the module. 2.1.1Historical sketch of north-east England 2.1.2 Linguistic history of North-East England a) Language, language variation, dialect b) The North-East dialect

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline 2.2 Electronic corpora Because the module is based on analysis of data abstracted from a digital electronic corpus, some knowledge of the nature of such corpora is required. 2.2.1 Digital electronic representation of language 2.2.2 The development and current state of corpus linguistics

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline 2.3 Data creation There is a fundamental distinction between the natural world on the one hand, and how we conceptualise it for the purpose of scientific study on the other. This part of the discussion deals with the nature of data and how it can be represented digitally for computational analysis.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline 2.4 Data transformation Once created, data may have characteristics that render it unsuitable for analysis in its raw state. Two such characteristics are identified and methods for correcting them are presented: 2.4.1 Normalization for variation in document length 2.4.2 Dimensionality reduction

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline 2.5 Cluster analysis Cluster analysis is a family of computational methods for identification and graphical display of structure in data when the data is too large either in terms of the number of variables or of the number of objects described, or both, for it to be readily interpretable by direct inspection. Details of how it works are presented in due course, but to give some idea of what's involved, let's select, say, a dozen Tyneside speakers and describe each one in terms of how many times each of them uses each a set of phonetic segments, as shown in the following table:

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline Visual examination of the table shows that the speakers differ from one another in terms of their pattern of usage of these phonetic segments. But is this variation purely random, or is there some linguistically interesting structure in the variation? Direct examination makes this very difficult if not impossible to determine.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline Cluster analysis gives the answer: The relationship among speakers is shown as a hierarchical tree in which the lengths of the horizontal lines represent degrees of similarity of phonetic usage. It's immediately clear that there are two main groups or clusters of speakers, shown as A and B in the diagram, that cluster A subclusters into C and D, and so on. The variation in phonetic usage among speakers is, in other words, far from random and is in fact highly structured.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline 2.6 Hypothesis generation The cluster tree is analyzed to determine whether its structure correlates with social variables associated with the speakers in a sociolinguistically interesting way. It is found that there is a systematic correlation between phonetic usage and social characteristics of the speakers; the sociolinguistic interpretation of the cluster tree is shown below.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline The most obvious correlation is between the place of residence of the speakers and the cluster analyses: nectetlsn01 and nectrtlsn02 are from Newcastle and all the others are from Gateshead. In the sample selected, therefore, Newcastle and Gateshead speakers are very strongly distinguished in terms of their phonetic usage.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline Gender has the most obvious correlation with cluster structure for the Gateshead speakers: all the trees agree in clustering the male (nectetlsg02, nectetlsg24, nectetlsg13) against the remaining seven female speakers.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline Age shows no obvious correlation among the males (given the small number of them this is unsurprising), but there is a correlation in all the trees for the females: the older ones (nectetlsg01, nectetlsg40,nectetlsg03) cluster against the younger ones (nectetlsg10, nectetlsg08, nectetlsg05, nectetlsg09).

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline Education shows a moderate correlation: most of the speakers have minimum education, but the two females with day-release level (nectetlsg05, nectetlsg09) cluster in all three trees.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline Employment also shows a moderate correlation: most of the speakers have unskilled or skilled manual employment except for the two females (nectetlsg05, nectetlsg09) with administrative jobs.

SEL3053: Analyzing GeordieLecture 1. Introduction 2. Module outline The hypothesis generated from this analysis is, therefore, that: The pattern of phonetic variation among Tyneside speakers is structured. 2. Newcastle speakers differ strongly from Gateshead speakers. 3. Gender is the primary determinant of variation among the Gateshead speakers. 4. Education and employment of moderate factors in variation among Gateshead speakers. 5. Age is not a factor in variation among Gateshead speakers.

SEL3053: Analyzing GeordieLecture 1. Introduction Reading Philosophy of Science Wikipedia. Philosophy of Science: http://en.wikipedia.org/wiki/Philosophy_of_science Wikipedia. Scientific Method: http://en.wikipedia.org/wiki/Scientific_method The Nature and Philosophy of Science: http://www.angelfire.com/mn2/tisthammerw/science.html Intute. History and Philosophy of Science: http://www.intute.ac.uk/hps/ A. Chalmers, What is this thing called science?, 3rd ed., Open University Press, 1999 Karl Popper Wikipedia. Karl Popper: http://en.wikipedia.org/wiki/Karl_Popper Stanford Encyclopedia of Philosophy. Karl Popper: http://plato.stanford.edu/entries/popper/ The Karl Popper Web: http://www.tkpw.net/

SEL3053: Analyzing Geordie Lecture 1. Introduction