Annotation of corpora

Annotation of corpora • A. Part-of-speech tagging • B. Syntactic annotation • C. Semantic annotation • D. Discourse annotation • E. Pragmatic annotation

Annotation of corpora • perfectly plain: produced by scanning; no information about text (usually, not even edition) • marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics, etc. • annotated with identifying information, e.g. edition date, author, genre, register, etc. • annotated for part of speech, syntactic structure, discourse information, etc.

A. Part-of-speech tagging LOB sample with POS tagging A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._. A01 3 ^ by_IN Trevor_NP Williams_NP ._. A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN A01 4 nominating_VBG any_DTI more_AP labour_NN A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.

A. Part-of-speech tagging • Main steps: • Divide the text into word tokens (tokenization) • Select a set of tags • Apply tag set to tokens • Tokenization: • orthographic word - morpho-syntactic unit? • multiwords, e.g., in spite of label as in_PREP31 spite_PREP32 of_PREP33 • mergers, e.g., clitics as in hasn’t, je t’aime, vendetelo label as vendete_VERBlo_PRON • compounds, e.g., tag set label as tagset_NOUN or tag_NOUN set_NOUN?

A. Part-of-speech tagging • Choice of tag set • sophisticated, linguistically well grounded set of tags… • BUT: not automatically applicable without loss of accuracy • example: come - present plural indicative, imperative, subjunctive; Lancaster corpus: distinguish from to-infinitive, LOB, Brown corpus: don’t distinguish

A. Part-of-speech tagging • tag = word class • label = alphanumeric characters • examples: preposition preposition prep IN singular proper noun NOUN:prop:sing N-p-sg NP1 • logically organized (taxonomy), e.g., in Lancaster, BNC, C7 • presentation: horizontal or vertical

A. Part-of-speech tagging • encoding of tags • TEI (SGML), e.g., BNC <w AV0>Even <w AT0>the <w AJ0> old <w NN2>women <w VVB>manage <c PUN>, <w AVO>just <w CJS>as <w PNP>they <w VVB>’re <w VVG>passing <wPNP>you <c PUN>.</PUN> (Garside et al., 1997)

A. Part-of-speech tagging • Applying tags to words • tagging scheme should include a procedure of how to assign tags to words (both for humans and machines) • need a lexicon: it will say which tags are assignable to which words • again: ambuguity is a problem

B. Syntactic annotation • syntactic annotation = parsed corpora • purposes: • training automatic parsers (computational linguistics, e.g. probabilistic parsers - inductive training through extraction of frequency counts) • extracting information (linguistics, e.g., building a lexicon, investigating subcategorization frames, collocations or other linguistic things, describing sublanguages)

B. Syntactic annotation • a parsing scheme needs (cf. POS tagging): • a list of symbols • definitions of symbols • description of how to apply symbols to text • syntactically annotated corpora: tree banks • examples of tree banks: Penn Treebank, Nijmegen Treebank, Susanne Corpus , Helsinki Constraint Grammar (ENGCG), Lancaster/IBM SEC treebank

B. Syntactic annotation • Parsing • the (automatic) analysis of texts (sentences) in terms of syntactic categories S NP VP NP PP NP NP ADJP NP Pierre 61 old will join the as an executive Nov 29 Vinken years board director

B. Syntactic annotation • Penn Treebank • skeleton parsing: partial parse, leaving out the “hard” things (such as PP-attachment) • phrase structure model (Garside et al., 1997, p.42) ((S (NP (NP Pierre Vinken) , (ADJP (NP 61 years) old ,)) will (VP join (NP the board) (PP as (NP a nonexecutive director)) (NP Nov 29))) .)

B. Syntactic annotation • Penn Treebank • available through LDC • size: 3,300,000 words (Feb 97) • Brown corpus, Wall Street Journal • in the current phase: • add function labels (Subj, Obj etc.) • add null constituents or traces (e.g., It’s easy [t] to eat) • add indices for coreferences (e.g., Mary[i] saw herself[i] in the mirror) • discontinuous constituents • add semantic roles (Agent, Goal etc) • may get too complex for large-scale reliable analysis…

B. Syntactic annotation • Susanne Corpus • part of the Brown corpus, 128,000 words • result of manual analysis • parsing scheme specified in great detail • available from Oxford Text Archive: • sable.ox.ac.uk/ota (http) • ota.ox.ac.uk/pub/ota/public (ftp)

A./B. Demo • TIGER • NEGRA

C. Semantic annotation • problem (1): more than one way of referring to a concept, e.g., • text analysis: choice of expression may reflect ideologies in the text or relationships between participants in conversation, for example, in doctor-patient interaction abdomen --- tummy • information retrieval: historian in fashion seeks information about trousers trousers --- slacks, shorts, leggings, breeches --> cf. RECALL in IR

C. Semantic annotation • problem (2): one single word can refer to different concepts, e.g., • information retrieval: historian in fashion wants to know about boots boot --- may refer to shoe, computer, kick, car --> cf. PRECISION in IR • so: • need to identify related words (problem 1) • need to identify the different senses of a word (problem 2)

C. Semantic annotation • labeling words according to semantic field (word senses) so that you can • … extract all the related words by querying on the semantic field • … extract only those instances of ambiguous words with the specific senses you want by querying on the combination of word and semantic field

C. Semantic annotation • semantic fields: sense relations and other kinds of relations (e.g., part-of, related-to etc.) • annotation (cf. PoS tagging): • definition of the tagging scheme (labels and their meanings) • guidelines for applying the tagging scheme • in semantics: this is not as easy and straightforward as for PoS tagging! • requirements: • should make linguistic/psycholinguistic sense • should be able to account for the vocabulary in the corpus exhaustively • should be suitable for texts from different periods and register (comprehensiveness) • should preferably have a hierarchical structure

C. Semantic annotation • multiple membership, e.g., deepened: color and change/remain • multiword units, e.g., stubbed out: encoded as two separate words, but belonging together • one recent ambitious attempt at a taxonomy of such semantic relations (sense relations, thesaurus-type relations, semantic fields etc.): WORDNET at www.cogsci.princeton.edu/~wn/ • you can try it online: www.cogsci.princeton.edu/~wn/online/

C. Semantic annotation • How to do it? • manually • computer-assisted (need at least a computer-readable lexicon and a disambiguation process - similar to PoS tagging) • fully automatic (not really feasible): • semantic analysis is even harder than syntactic parsing • no integrated ‘parse’ of meaning possible at the present time

D. Discourse annotation • discourse features: what are they? • Typically: cohesion and coherence • coherence: what makes a text hang together in terms of content • cohesion: the means of making a text hang together • reference, substitution, ellipsis, conjunctive relations (cause, result, effect etc.), thematic development • Halliday & Hasan, 76

D. Discourse annotation • example: anaphoric relations in the IBM/Lancaster corpus (UCREL) • try to build up sth. like an ‘anaphoric treebank’ • what are anaphoric relations? • links between a proform and an antecedent • example: The married couple said that they were happy with their lot. The married couple said that they were happy with their lot.

D. Discourse annotation • anaphoric annotation in UCREL: categories used are based n Halliday & Hasan, 76 • example of annotation: (1 Feodor Baumenk 1), a former Nazi death camp guard, has asked the U.S Supreme Court to allow <REF=1 him to retain <REF=1 his American citizenship. (2 The Hartford Courant 2) said… • symbols: (1), (2)… = antecedent < = anaphoric (> = cataphoric) REF = central pronoun

D. Discourse annotation • few corpora annotated for discourse features… • how to do it? • manually • computer-assisted: either interactive hand annotation, using some kind of specialized editor or automatic annotation with the possibility of hand correction or disambiguation • a tool supporting annotation of anaphora: XANADU in Lancaster

E. Pragmatic annotation • anything beyond sentences and discourse: contexts of situation and culture • examples of things people look at in pragmatics • carry-on signals in conversation (e.g., Stenstroem 87): which functions do carry-on signals such as “well”, “you know” etc. have in conversation? • speech acts (e.g., Stiles 92): speech act types in conversation, e.g., in doctor-patient interactions PATIENT: I have the headaches to the point that I have to vomit (D)DOCTOR: Mm -hm (K) PATIENT: Then I have to go to bed and I sleep for a while (E) D = Disclosure K = Acknowledgment E = Edification

E. Pragmatic annotation • how to do it? • manually • computer-assisted: ? • fully-automatic: - • You have to use your imagination! • Stenstroem example: Can be done with a concordance program because it’s essentially word-based • Stiles example: would probably have to be done manually (then use a concordance program on the annotated texts?)

Higher-level annotation: tools • Tools that support specialized analysis, such as • specialized editors, e.g., Xanadu for anaphoric relations • specialized in terms of linguistics models, • e.g., Sys-Tools for Systemic Functional Grammar (minerva.ling.mq.edu.au/) (http://cirrus.dai.ed.ac.uk:8000/Coder/index.html) • e.g., RSTTools for rhetorical relations analysis (www.dai.ed.ac.uk/daidb/people/homes/micko/RSTTool/index.html) • Tools that support various kinds of analysis (but not quite everything you might want to do): • TATOE (www.darmstadt.gmd.de/~rostek/tatoe.htm)

References • Garside R., G. Leech & A. McEnery (eds.), 1997. Corpus Annotation. Linguistic Information from Computer Text Corpora. Longman: London • Fellbaum C. (ed), 1998. WordNet. An Electronic Lexical Database. MIT Press. • Garside et al., 1997. Corpus annotation. London, Longman. • Halliday M.A.K. & R. Hasan, 1976. Cohesion in English. Longman, London. • Mindt, 1991. Syntactic evidence for semantic distinctions in English. In Aijmer & Altenberg (eds), English Corpus Linguistics: Studies in Honour of Jan Svartvik, London, Longman. • Stenstroem, 1987. Carry-on signals in English conversation. In Meijs (ed), Corpus Linguistics and Beyond. Amsterdam, Rodopi. • Stiles, 1992. Describing talk: a taxonomy of verbal response models. Beverly Hills, Sage.

Annotation of corpora