320 likes | 552 Vues
Homing in on the Text-Initial Cluster. Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguistics. Starting Questions.
E N D
Homing in on the Text-Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguistics
Starting Questions • Are clusters like “Once upon a time” and “lived happily ever after” oddities in marking text position? • Or do many n-grams characterise the beginnings, middles or ends of certain kinds of text? • If so, are there any common patterns in text-initial clusters?
Context • Textual Priming Project, University of Liverpool • Michael Hoey • Michaela Mahlberg • Matthew O’Donnell • Mike Scott
Textual Priming Project: Aims • to investigate how many (and what types of) lexical items are primed to appear in text-initial or paragraph-initial position • to identify lexico-grammatical patterns and see how these patterns can be functionally interpreted in the textual contexts. • to relate these lexical and corpus-driven facts to current textual descriptions of (hard) news stories that might provide explanations for the positive primings of relevant lexis. from O’Donnell et al 2007
Hard News Corpus • “Home News” sections of the Guardian and Observer • 1998 to 2004 • 115,654 articles • divided thus: • headline & lead • 1st sentence of 1st paragraph (TISC) • all other sentences • TISC contains 3.2 million tokens • The rest: 51.2 million tokens • About 470 words per article
Research Questions Using the hard news corpus, • How many 3-5 word clusters are found to be key in TISC sections? • How many are positively and how many are negatively key? • What recurrent patterns can be found in the two types of key cluster?
Methods (1) • Format the corpus in XML and separate out all TISC sections (done by Matt O’Donnell) • Use WordSmith’s WordList tool to compute wordlist indexes of • all the text • all the TISC sections • Using WordList, compute 3-5 word clusters for each index, save as .lst
Top clusters, all sections GUARDIAN CO UK ONE OF THE A HREF HTTP, WWW GUARDIAN CO and similar web links THE PRIME MINISTER THE END OF AS WELL AS THE NUMBER OF THERE IS A SOME OF THE THERE IS NO
Top clusters, TISC ONE OF THE ACCORDING TO A LAST NIGHT AFTER FOR THE FIRST THE FIRST TIME IS TO BE FOR THE FIRST TIME THE MURDER OF ARE TO BE THE DEATH OF OF THE MOST THE HOME SECRETARY WAS LAST NIGHT IT EMERGED YESTERDAY AS PART OF AN ATTEMPT TO THE UNITED STATES THE NUMBER OF ONE OF THE MOST ACCORDING TO THE
Methods (2) • Use KeyWords tool to compute KWs for the TISC 3-5 word clusters using all the text as a reference corpus • Identify patterns in the KW clusters
TISC key clusters WERE LAST NIGHT YESTERDAY AFTER A TONY BLAIR YESTERDAY COURT HEARD YESTERDAY WAS TOLD YESTERDAY WAS JAILED FOR THE DEATH OF YEAR OLD BOY YESTERDAY WHEN THE WITH THE MURDER OF ACCORDING TO A LAST NIGHT AFTER IT EMERGED YESTERDAY WAS LAST NIGHT ARE TO BE THE MURDER OF LAST NIGHT WHEN THE GOVERNMENT YESTERDAY LAST NIGHT AS IS TO BE
RQs 1 & 2: Numbers of KW clusters using a p value of 0.0000001 and minimum frequency of 3 and log likelihood statistic, • 8,132 key clusters altogether (in 3.2 million words of text) • of which 7,631 were positively key • and 501 negatively key though there is repetition as these are 3-5 word n-grams Research Question 2
Repetition YESTERDAY FOUND GUILTY YESTERDAY FOUND GUILTY OF YESTERDAY FROM A YESTERDAY FROM THE YESTERDAY GAVE A YESTERDAY GAVE HIS YESTERDAY GAVE THE YESTERDAY GIVEN A YESTERDAY GIVEN THE YESTERDAY GIVEN THE GO YESTERDAY GIVEN THE GO AHEAD
Negatively key: SPOKESMAN FOR THE PER CENT OF WE HAVE TO SAID THAT THE BUT IT IS AT A TIME A SPOKESMAN FOR THE SAID HE WAS IT IS NOT THERE WAS NO A LOT OF A SPOKESMAN FOR THERE IS NO HE SAID THE SAID IT WAS THERE IS A THIS IS A THE FACT THAT AS WELL AS IT WOULD BE
RQ 1: Numbers of KW clusters • Is 8 thousand a large number of distinct key text-initial clusters? • In the same amount of text there are 84 thousand 3-5 word clusters of frequency at least 5 altogether… • about one in 10 is associated with text initial position at the .0000001 level of significance
RQ 1, continued • … is 1 in 10 a large number to be key? • In the case of SISC (sentences from paragraphs with only one sentence in), we get • 507 thousand clusters, of which • 2,192 are key (1,747 positively and 445 negatively) • which is about 1 in 230
RQ 3: patterns • recency: • in the top 200, seventy express time, generally using yesterday or last night
Recency clusters YESTERDAY IN A IT EMERGED LAST NIGHT A COURT HEARD YESTERDAY YESTERDAY WHEN A YESTERDAY AFTER THE EMERGED LAST NIGHT LAST NIGHT TO YESTERDAY AS THE YESTERDAY WHEN THE WAS TOLD YESTERDAY COURT HEARD YESTERDAY TONY BLAIR YESTERDAY YESTERDAY AFTER A WERE LAST NIGHT LAST NIGHT AS THE GOVERNMENT YESTERDAY LAST NIGHT WHEN WAS LAST NIGHT IT EMERGED YESTERDAY LAST NIGHT AFTER
Superlatives ONE OF BRITAIN'S MOST ONE OF THE MOST OF THE WORLD'S THE FIRST TIME OF BRITAIN'S MOST FOR THE FIRST FOR THE FIRST TIME
Research, Report etc. ACCORDING TO A REPORT A COURT HEARD (YESTERDAY) ACCORDING TO RESEARCH TO A SURVEY IT EMERGED LAST NIGHT IT WAS ANNOUNCED YESTERDAY IT WAS REVEALED YESTERDAY A REPORT PUBLISHED ACCORDING TO A STUDY TO RESEARCH PUBLISHED
Attention-grabbers IT EMERGED THAT OBSERVER CAN REVEAL THE OBSERVER CAN REVEAL
Indefinite articles positively key…. A LABOUR MP A LANDMARK RULING A LAST DITCH ATTEMPT TO A LAST MINUTE A LEADING BRITISH A LEADING SCIENTIST A LEGAL BATTLE A LEGAL CHALLENGE A BABY GIRL A BAN ON A BEACH IN A BID TO A BITTER ROW A BLACK MAN A BLISTERING ATTACK ON A JURY WAS TOLD YESTERDAY
Indefinite articles negatively key A KIND OF A COUPLE OF A GREAT DEAL A KIND OF A LOT MORE
IT + reporting verb – positively key IT WAS ANNOUNCED LAST NIGHT IT WAS CLAIMED LAST NIGHT IT WAS CONFIRMED LAST NIGHT IT IS REVEALED TODAY
IT otherwise negatively key: IT IS A IT IS ABOUT IT IS EXPECTED IT IS GOING IT IS ONLY IT IS POSSIBLE IT SEEMS TO
SAID YESTERDAY – positively key SAID YESTERDAY AFTER SAID YESTERDAY THAT HE SAID YESTERDAY THEY HAD
SAID without time – negatively key SAID AT THE SAID HE HAD SAID HE WOULD SAID THE GOVERNMENT SAID THERE WAS NO
Conclusions • The “once upon a time” syndrome seems to be much more common than might be thought. • In text-initial sections of 115 thousand hard news stories (3.2 m. words), out of 8 thousand 3-5 word clusters, about 1 in 10 had text-initial significance • whereas in non text-initial sections only 1 in 230 was key
Other patterns • recency • superlatives • research, report • attention-grabbers • indefinite articles • IT + reporting verb; SAID + time
References • O’Donnell, Matthew, Mike Scott, Michaela Malhberg & Michael Hoey (forthcoming) ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics. Paper presented at PALC, Łodz.. April 2007.