Rationale for the Experiment

SpeakerIdentification: Function Words and BeyondPeter W.H. Smith and Gea de JongCity University, London, UK2 July 2005Presentation at IAFL2005, Cardiff, UK.

Rationale for the Experiment • Large police operations using audio surveillance. • Recent developments in the UK: transcripts of both reference and disputed material more readily available. • Complementary approach to forensic phonetics.

Stylometric Analysis • A set of methods for author identification. • Largely used on volume data. • Employs statistical techniques on features within the texts. • Used in cases of disputed or unknown authorship.

Stylometric Analysis - Methods • Methods include: • Simple methods – word length distributions, sentence lengths, letter collocations. • Markov Models based on letter collocations (Khmelev and Tweedie 2001). • Non-linear Neural Network models (Tweedie, Singh and Holmes 1996) • Function word methods, using multivariate statistics, e.g. principal components analysis or discriminant analysis (W. Smith 1986, Binongo 1997, Smith and Smart 2001)

Function Word Approach • Typical Method • Select texts • Break texts into blocks • Count specific function words within each block • Perform a multivariate analysis on the resultant data. • Present results

Function Word Approach to Stylometric Analysis • Some very successful results, e.g. • Civil War Letters (Holmes, Gordon and Wilson 2001) • De Doctrina Christiana and Milton (Tweedie et al. 1998) • Consolatio – falsely attributed to Cicero (Forsyth et al.1999) • The function word approach appears to work, i.e. it can distinguish authors with a claim to some reliability.

Recent Study using Function Words • 7 Texts by Author A, 4 Texts by author B and one disputed text. • A and B are 18 century American political writers – large volume of available written material. • Compare A with B • A1,A2,B1 • A1,A2,B2 • …A1,A3,B1 (separates A from B in 69/84 cases) 40/42 cases. • Compare B with A • A1,B1,B2 • A2,B1,B2 • … (separates A from B in 40/42 cases)

Function Words Appear to Separate out Author in Written Texts? • In this test the results are significant statistically. • Other studies find similar results. • Questions: • Why do function words appear to work? • Can they be used in transcripts of spontaneous speech (forensic cases)? • Do they work in lower volume studies? • Do they work across different registers? • Function words have multiple uses in language, e.g. • “He said that that that that he said was not that that that he said earlier.”

Possible Problems with Texts • Transcription errors. • False starts, stutters, disfluencies etc. • “yeah erm i wouldn’t know you know it would be that year it would be say the i don’t know i would only be guessing but it would be prior to that date” • Speaker/speaker influence

Other Differences in Texts • Differences in register (Biber et al. 2000). • Differences between formal interviews and telephone calls, for example. • Regional Variation – what is it, how might it affect our study? • e.g. “With prepositional phrases, begin our sentences, we must.”

Speaker Variation • What is it and how do we measure it? • Are grammatical features and their use clustered or evenly distributed amongst populations? • Works for written text, can it work for spoken transcriptions?

Speaker Identification – a joint Phonetic/grammatical approach • Phonetic analyses of (spoken) forensic data: • Accent features, e.g.“widespreadspirantisation of oral stops, affrication of plosives in prosodically strong positions…” • Word quality, e.g. “ a very close monophthongal quality in the vowel of words such as mate, brake…” • Other features, e.g. “the sustained pause syllable ‘ehm’ was selected for detailed acoustic analysis with equivocal results.” • Conclusion and caveat e.g. “phonetic and linguistic comparisons make it possible to draw conclusions as to the likelihood of two speech samples having come from the same source.”

Other Grammar Based Approaches • (Baayern, Van Halteren and Tweedie 1996) Using syntactic annotation to enhance authorship attribution. • (Lancashire 1992) Phrasal repeatends in literary stylistics. • (Chaski 2001) Empirical studies of language-based author identification techniques.

Sample texts • 7 Texts were available. • 3 transcripts of interviewers (1716, 3301, 1808 words) • 3 transcripts of police interviewees (926,4700, 2653 words). • 1 transcript of recorded telephone conversations (967 words).

Function Word Usage • Each text is analysed by function word usage for a total of 58 function words. • Words chosen according frequency of occurrence and multiple grammatical function. • 58 function words usage then sub-divided into over 700 categories (Quirk et al. 1989). • Comparisons can then be made with corpus based spoken text studies (e.g. Biber et al. 2000)

Grammatical Categories for “to”

Distribution of “to”

Distribution of “in” Distribution of “in” Distribution of “in” Distribution of “in” Distribution of “in”

Distribution of “so” – influence of register

Report on Texts 6 and 7 • Comparison of Texts6 and 7 (consulting a Corpus-Based study – Biber et al. 2000) • Consistent use of attitudinal disjunct "to be honest". Fairly high and consistent use of Non-finite clauses without subject. Slightly unusual and consistent use of that clauses as a subject complement. What-interrogative clauses are also frequently and consistently used across the two sample texts as a subject complement. The use of "and" is well above average for the spoken register and may indicate nervousness, but both texts demonstrate a high degree of multiple clausal connection using "and". Other uses of "and" which are more consistent with the register are co-ordination with the second phrase being chronologically sequent or a comment on the first. The occurrence of these is slightly above average. • The two texts show a consistent use of the preposition "in" as a metaphorical or abstract place preposition. There was also an unusually high frequency of occurrence of the preposition of direction "into", though this may be partially explained by the subject matter. The "signature" of usage of the preposition "in" was consistent for both texts. "So" was used consistently in both texts as a conjunct of result. The use of correlative subordinators "as…as" was used consistently in both texts with an abnormally high frequency of usage. "or" was used consistently in both texts as a pro-form for a noun phrase.

Report • There was an unusual frequency of occurrence of the use of “who” in both interrogative clauses as subject and subject complement. There was a consistent and frequent usage of “there” used in an existential sentence formulated as SVOC. “But” is also used with well above average frequency and its use as a sentence/clause connector and for the linking of subordinate clauses is consistent across texts. • “Like” was used consistently in both texts in a common colloquial form and we also noted an unusual frequency of occurrence of its use as a pro-form for an adverbial in text A, but not in text B. • The most unusual feature that both texts had in common was the frequent use of that, what and who clauses as a subject complement. Overall, the grammatical features that appear in texts A and B suggest quite strongly that they could have been spoken by the same person.

Report on Texts 1 and 2 • Comparison of Texts 1 and 2 • (Text 1 – reference) • (Text 2 – telephone calls) • Both text1 and text2 show an unusually high frequency of to-infinitive verbs with noun phrase as indirect object. This is significantly higher than from a corpus of transcribed spoken text. Both texts have a similar frequency of usage of empty it subjects, though this is not out of line with the average. The usage of “in” was consistent in both texts, including the absence of use of of “in” as a preposition of dimension and a complete absence of usage of “in” as a attitudinal disjunct (e.g. in fact), or a focusing adjunct. The usage of “so” was largely consistent despite differences in register between text1 and text2. “Then” was used consistently in both texts as a focusing adjunct in post-position. Both texts demonstrated a well-below average usage of “or” and a complete absence of its use as a co-ordinator, except in the case of co-ordination of simple noun phrases. Both texts had an unusually low frequency of occurrence of “but” and indeed in 2 instances “and” was used in places where the use of “but” was clearly indicated. “for” is also used with a well-below average frequency. The texts were similarly lacking in their use of the function word “if”. “which” only appeared once in both texts. • Overall, the usage and distribution of usage of function words appears broadly consistent in both texts.

Discussion • Problems with Approach • Large number of categories (approx. 700 for 55 function words) – intolerably sensitive, makes a precise statistical test problematical. • What grammatical features should we choose? • The processing of textual data is highly labour intensive (lack of automated tools).

Grammatical features • Word-level features. • Phrase level features. • Clause level features. • Grammar-based features. • Semantic features. • - Problems with class overlap • - Problems with categorisation

Further Work • Move away from purely function-word based approach. • Dealing with compound subordinators (e.g. “in order that”). • Examination of, for example, common clause construction in conjunction with common verbs and nouns. • Choice of features.

Further Work -2 • More corpus-based studies. • Studies on differences in register (e.g. interviews, telephone calls etc.) • Feature clustering (how are these distributed amongst a population of language users?)

Summary • The approach demonstrates that it is possible to discern patterns of usage from spoken (forensic) texts. • Joint approach with a phonetic study makes sense for spoken data. • Written texts are easier. • Refinement of features necessary.

Acknowledgements • We would like DC Jon for allowing us access to the textual data and to unknown for a (largely accurate transcription).

Rationale for the Experiment

Rationale for the Experiment

Presentation Transcript

THE PRIMITIVISM RATIONALE

Rationale for Policy

Rationale for Study

Rationale for Inclusion

Rationale for Development

Rationale for the research

Rationale for the study

The Rationale for Literature

Rationale for Encoding

Rationale for the CCSS

Rationale for Update

Rationale for the SOFIE Project

Rationale for the NO16967 study

Rationale for the study

Rationale for research

Rationale for Presentation

Rationale for the EPM Progam

Rationale for the SOFIE Project

Rationale for session

The Rationale For Differentiated Instruction

Rationale for the EPM Progam