SEL3053: Analyzing Geordie Lecture 8. Creation of the DECTE phonetic data matrix

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix This lecture applies the principles of data creation described in Lecture 7 to creation of a data matrix extracted from the DECTE phonetic transcriptions. This DECTE matrix will be the basis for subsequent analysis.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Formulate a research question. For the remainder of the module, this will be: Is there systematic phonetic variation in the Tyneside speech community as represented by DECTE, and , if so, does that variation correlate systematically with social variables?

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix 2. A set of variables is defined to describe the phonetic usage of the TLS speakers. The variables used to do this are the 156 PDV symbols of the TLS transcription scheme. 3. The type of value assigned to the variables needs to be decided. What is appropriate in the present case? When it did its analyses, TLS described the phonetic usage of each speaker by counting the number of times that that speaker used each of the PDV phonetic segments, which resulted in a set of phonetic usage profiles, one per speaker. We shall do the same.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix 4. Each speaker profile is represented as an 156-element vector in which every element represents a different PDV variable, and the value in any given vector element is the frequency with which the speaker uses the associated PDV variable. The following figure shows such a vector, including phonetic segment symbols and the corresponding PDV codes. Speaker dectetlsg01 uses phonetic segment 0244 31 times, 0112 28 times, and so on.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix 5. The set of speaker vectors is assembled into a matrix M in which the rows i (for i = 1..n, where n is the number of speakers) represent the speakers, the columns j (for j = 1..156) represent the PDV variables, and the value at Mi,j is the number of times speaker i uses the phonetic segment j. We shall only be looking at a 12-speaker selection from the full set of 64 speakers in the lectures; analysis of the full set constitutes the project component of the module assessment, for which see here. A fragment of this 12 x 156 matrix M is shown below.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix This matrix M will be the basis for analysis in lectures subsequent to this one: If all the profiles are the same or very similar, the research question is answered in one way: there is little or no phonetic variation among speakers. If the profiles are generally different, but randomly so, then the research question is answered in a second way: there is phonetic variation among speakers, but not systematic variation. If the profiles differ, but the differences allow speakers to be grouped such that the phonetic differences within a group are small, and the differences among groups large, then the research question is answered in a third way: there is phonetic variation among speakers, and it is systematic.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization Once a data matrix has been constructed, it may have characteristics that can adversely affect the validity of any analysis based on it. In the remainder of this and next week's lectures we look at two such characteristics and how to overcome them. The one considered in this lecture is the effect of variation in interview length.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 1. The nature of the problem A common characteristic of many document collections is that the documents vary in length. Where the data is based on counting the frequency of occurrence of some number of variables, it is self-evident that a longer document will, in general, contain more instances of those variables than a shorter one. A newspaper will, for example, contain many more instances of, say, 'the' than an average-length email, and a novel many more instances of 'he' or 'she' than a short story.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 1. The nature of the problem If frequency profiles for varying-length documents are constructed, as described in previous lectures, then the profiles for the longer documents will, in general, have relatively high values and those for the shorter documents relatively low ones. The result is that, when the profiles are grouped according to their relative frequencies, the grouping will be in accordance with document length.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization The nature of the problem To show this, we will use an edited version of M, our data matrix, in which the number of codes has been modified.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization The nature of the problem There is a large disparity of lengths here, from 10 codes to 2500 codes, and the sizes jump in regular steps. This matrix was cluster analyzed, and the result is shown below.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization The nature of the problem Reference to the numerical values at the left of the tree, which represent the number of phonetic codes in the corresponding transcription, shows that the transcriptions have been grouped by relative length, as expected. This tells us nothing we didn't already know, and the analysis is consequently useless.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 2. Variation in the length of the TLS / DECTE phonetic transcriptions The TLS / DECTE phonetic transcriptions vary in length. This can be seen in the table below, which lists in ascending order of magnitude the number of codes in the 12 speakers selected for analysis.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 2. Variation in the length of the TLS / DECTE phonetic transcriptions The differences in length are not as extreme as earlier, and the effect on clustering can therefore be expected to be more subtle, but an effect there will be. Because grouping by relative document length obscures any interesting similarity structures among the documents, its effect must be mitigated or eliminated if the analysis is to be useful. The remainder of this lecture presents a way of doing this.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length The solution to the problem of clustering in accordance with document length is to transform or 'normalize' the values in the data matrix in such a way as to mitigate or eliminate the effect of the variation. Such normalization is an important issue in Information Retrieval because, without it, longer documents in general have a higher probability of retrieval than shorter ones relative to any given query. The associated literature consequently contains various proposals for how such normalization should be done. Normalization by mean document length is used as the basis for discussion in what follows because of its intuitive simplicity, though the choice of method from among those currently available is not critical for present purposes.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length Mean document length normalization involves transformation of the row vectors of the data matrix in relation to the average length of documents in the corpus being used, and, in the present case, transformation of the row vectors of M in relation to the average length of the 12 TLS / DECTE phonetic transcriptions.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length Here Mi is the matrix row representing the frequency profile of transcription Ti, length(Ti) is the total number of phonetic codes in Ti, and μ is the mean number of codes across all the transcriptions in T.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length Applying this to the present case, the table below shows the number of codes in each of the selected interviews together with their sum and mean.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length The values each row of the data matrix M are now normalized using the above formula 1:

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length Note that dectetlsg01and dectetlsg02 are both slightly shorter than the mean interview length, so in both cases their frequencies are increased only slightly to compensate. Dectetlsn02 is, on the other hand, much shorter than the mean, and the normalization formula increases the frequency much more. In general, mean document length normalization decreases the values in the row vectors representing documents longer than average, and increases the values in row vectors that represent documents shorter than average.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length The following figure shows cluster analyses of the unnormalized and normalized versions of M.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length The normalized tree differs from the unnormalized one primarily in two ways: The placement of dectetlsg09. In (a) it is off on its own at the bottom of the tree, far removed from the usage of any other speaker, but in (b) this extreme difference has disappeared. Reference to figure 4 will show that dectetlsg09 is much longer than any other interview; normalization has removed the effect of this and made it much more similar to the other interviews.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length The degree to which dectetlsn01 and dectetlsn02 differ from the other speakers. In (a) they are strongly distinguished from the other interviews, and in (b) the distinction is even stronger. Figure 4 shows that these two interviews are substantially shorter than the others, and normalization has accentuated their difference from the others and from one another.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Normalization 3. Elimination of variation in interview length In both cases, failure to normalize would have given a false impression of how the interviews relate to one another.

SEL3053: Analyzing GeordieLecture 8. Creation of the DECTE phonetic data matrix Practicalities Because manual counting of the phonetic segments for each speaker, and doing so accurately, would be vastly time-consuming and tedious, software that constructs a data matrix from a given set of phonetic transcriptions is provided, and can be accessed from the Materials section on the main page of this website. This software will be required for the assessed project, so its use will be exemplified in the seminars.

SEL3053: Analyzing Geordie Lecture 8. Creation of the DECTE phonetic data matrix

SEL3053: Analyzing Geordie Lecture 8. Creation of the DECTE phonetic data matrix

Presentation Transcript

Analyzing electoral utilities by way of a stacked data-matrix

SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction

SEL3053: Analyzing Geordie Lecture 14. Hierarchical cluster analysis of the DECTE data

SEL3053: Analyzing Geordie Lecture 6. The TLS / DECTE phonetic transcriptions

Geordie

SEL3053: Analyzing Geordie Lecture 10. Dimensionality reduction 2

Data Structures Lecture 8

DATA MINING LECTURE 8

SEL3053: Analyzing Geordie Lecture 7. Data creation

SEL3053: Analyzing Geordie Lecture 2. North-East England: historical sketch

SEL3053: Analyzing Geordie Lecture 4. Digital electronic corpora

SEL3053: Analyzing Geordie Lecture 9. Dimensionality reduction 1

DATA MINING LECTURE 8

SEL3053: Analyzing Geordie Lecture 18. Outline of requirements for submitted project

SEL3053: Analyzing Geordie Lecture 17. How to write a research paper

SEL3053: Analyzing Geordie Lecture 1. Introduction

SEL3053: Analyzing Geordie Lecture 15. Hypothesis formulation

SEL3053: Analyzing Geordie Lecture 3. North-East English dialect: historical sketch

DATA MINING LECTURE 8

Employment Creation matrix

Geordie

Lecture 8 DATA ANALYSIS