Experimental Design for Linguists

Experimental Design for Linguists Charles Clifton, Jr. University of Massachusetts Amherst Slides available at http://people.umass.edu/cec/teaching.html

Goals of Course • Why should linguists do experiments? • How should linguists do experiments? • Part 1: General principles of experimental design • How should linguists do experiments? • Part 2: Specific techniques for (psycho)linguistic experiments Schütze, C. (1996). The empirical basis of linguistics. Chicago: University of Chicago Press. Cowart, W. (1997). Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks, CA: Sage Publications Inc. Myers, J. L., & Well, A. D. (in preparation). Research design and statistical analysis (3d ed.). Mahwah, NJ: Erlbaum.

1. Acceptability judgments • Check theorists’ intuitions about acceptability of sentences • Acceptability, grammaticality, naturalness, comprehensibility, felicity, appropriateness… • Aren’t theorists’ intuitions solid?

Example of acceptability judgment: Cowart, 1997 • Subject extraction: I wonder who you think (that) likes John. • Object extraction: I wonder who you think (that) John likes.

Stability of ratings (Cowart,1997)

2. Sometimes linguists are wrong… • Superiority effects • I’d like to know who hid it where. • *I’d like to know where who hid it. • Ameliorated by a third wh-phrase? • ?I’d like to know where who hid it when.

…maybe. Paired-comparison preference judgments • a. I’d like to know who hid it where. 86% • b. (*)I’d like to know where who hid it. 14% 76% • c. (*)I’d like to know where who hid it when. 24%. 49% • d. I’d like to know who hid it where when. 51% • a-b basic superiority violation • b-c heads-on comparison, extra wh “when” hurts, doesn’t help • c-d the “ameliorated” superiority violation, c, seems good when compared to its non-superiority-violation counterpart Clifton, C. Jr., Fanselow, G., & Frazier, L. (2006). Amnestying superiority violations: Processing multiple questions. Linguistic Inguiry, 37(51-68).

Another instance… • Question: is the antecedent of an ellipsis a syntactic or a semantic object? Why is (a) good and (b) bad? • The problem was to have been looked into, but obviously nobody did. • #The problem was looked into by John, and Bob did too. • Andrew Kehler’s suggestion: semantic objects for cause-effect discourse relations, syntactic objects for resemblance relations. Corpus data bear his suggestion out.

But an experimental approach… Kim looked into the problem even though Lee did. (causal, syntactic parallel) Kim looked into the problem just like Lee did. (resemblance) The problem was looked into by Kim even though Lee did. (causal, nonparallel) The problem was looked into by Kim just like Lee did. (resemblance) Frazier, L., & Clifton, C. J. (2006). Ellipsis and discourse coherence. Linguistics and Philosophy, 29, 315-346.

Context effects • Linguists: think of minimal pairs • The contrast between a pair may affect judgments • Hirotani: Production of Japanese sentences • The experimental context in which sentences are produced affects their prosody

Hirotani experiment • a. Embedded wh-question (ka associated to na’ni-o) (# = Major phrase boundary) • Mi’nako-san-wa Ya’tabe-kun-ga na’ni-o moyasita’ka (#) gumon-sita’-nokai? • Minako-Ms.-TOP Yatabe-Mr.-NOM what-ACC burned-Q stupid question-did-Q (-wh) • ‘Did Minako ask stupidly what Yatabe burned?’ (Yes, it seems (she) asked such a question.’) • Matrix wh-question (ndai associated to na’ni-o) • Mi’nako-san-wa Ya’tabe-kun-ga na’ni-o moyasita’ka (#) gumon-sita’-ndai? • Minako-Ms.-TOP Yatabe-Mr.-NOM what-ACC burned-Q stupid question-did-Q (+wh) • ‘What did Minako ask stupidly whether Yatabe burned?” (‘The letters (he) received from (his) ex-girlfriend.’)

Hirotani results Percentage of insertion of MaP before phrase with question particle Hirotani, Mako. (submitted). Prosodic phrasing of wh-questions in Japanese

3. Unacceptable grammaticality • Old multiple self-embedding sentence experiments • Miller & Isard 1964: sentence recall, right-branching vs. self-embedded (1-4) • She liked the man that visited the jeweler that made the ring that won the prize that was given at the fair. • The prize that the ring that the jeweler that the man that she liked visited made won was given at the fair. • Median trial of first perfect recall: 2.25 vs never • Stolz 1967, clausal paraphrases: subjects never understood the self-embedded sentences anyway Miller, G. A., & Isard, S. (1964). Free recall of self-embedded English sentences. Information and Control, 4, 292-303. Stolz, W. (1967). A study of the ability to decode grammatically novel sentences. Journal of verbal Learning and verbal Behavior, 6, 867-873..

3’. Acceptable ungrammaticality • Speeded acceptability judgment and acceptability rating • %OK Rating • OK • None of the astronomers saw the comet, but John did. 83% 4.36 • B. Embedded VP • Seeing the comet was nearly impossible, but John did. 66% 3.71 • C. VP w/ trace • The comet was nearly impossible to see, / but John did. 44% 3.27 • D. Neg adj • The comet was nearly unseeable, / but John did. 17% 2.21 Arregui, A., Clifton, C. J., Frazier, L., & Moulton, K. (2006). Processing elided verb phrases with flawed antecedents: The recycling hypothesis. Journal of Memory and Language, 55, 232-246.

4. Provide additional evidence about linguistic structure • A direct experimental reflex of structure would be nice • But we don’t have one • Are traces real? • Filled gap effect: reading slowed at us in My brother wanted to know who Ruth will bring (t) us home to at Christmas. • Compared to My brother wanted to know if Ruth will bring us home to at Christmas. Stowe, L. (1986). Parsing wh-constructions: Evidence for on-line gap location. Language and Cognitive Processes, 1, 227-246.

Are traces real, cont. • Pickering and Barry. “no.” • Possible evidence • That’s the pistol with which the heartless killer shot the hapless man yesterday afternoon t. • That’s the garage with which the heartless killer shot the hapless man yesterday afternoon t. • Reading disrupted at shot in the second example, far before the trace position • But who’s to say that the parser has to wait to project the trace? Pickering, M., & Barry, G. (1991). Sentence processing without empty categories. Language and Cognitive Processes, 6, 229-259. Traxler, M. J., & Pickering, M. J. (1996). Plausibility and the processing of unbounded dependencies: An eye-tracking study. Journal of Memory and Language, 35, 454-475.

5. Is grammatical knowledge used? • Serious question early on • “psychological reality” experiments • Direct experimental attack did not succeed • Derivational theory of complexity • Indirect experimental attack has succeeded • Build experimentally-based theory of processing

6. Test theories of how grammatical knowledge is used • Moving beyond modularity debate – more articulated questions about real-time use of grammar • Phillips: parasitic gaps, selfpaced reading • The superintendent learned which schools/students the plan to expand _ … overburdened _. (slowed at expand after students – plausibility effect) • The superintendent learned which schools/students the plan that expanded _ … overburdened _. (no differential slowing at expand – no plausibility effect) Phillips, C. (2006) The real-time status of island phenomena. Language, 82, 795-823.

II: How to do experiments. Part 1, General design principles • Dictum 1: Formulate your question clearly • Dictum 2: Keep everything constant that you don’t want to vary • Dictum 3: Know how to deal with unavoidable extraneous variability • Dictum 4: Have enough power in your experiment • Dictum 5: Pay attention to your data, not just your statistical tests

Dictum 1: Formulate your question clearly • Independent variable: variation controlled be experimenter, not by what subject does • Dependent variable: variation observed in subject’s behavior, perhaps dependent on IV • Operationalization of variables

Formulate your question • Question: Do you identify a focused word faster than a non-focused word? • Must clarify: Syntactic focus? Prosodic focus? Semantic focus? • Must operationalize • Syntactic focus – Clefting? Fronting? Other device? • Prosodic focus – Natural speech? Manipulated speech? Synthetic speech? Target word or context?

Formulate your question • Question: does discourse context guide or filter parsing decisions? • Clarify question: does discourse satisfy reference? establish plausibility? set up pragmatic implications? create syntactic structure biases? • Operationalize IV: Lots of choices here • But also have to worry about dependent variable…

Choose appropriate task, DV • Question about focus: need measure of speed of word identification • Conventional possibilities: lexical decision, naming, phoneme detection, reading time • Question about “guide vs filter:” probably need explicit theory of your task • Tanenhaus: linking hypothesis • E.g. eye movements in reading: tempting to think that “guide” implicates “early measures,” “filter” implicated “late measures.” • But what’s early, what’s late? Need model of eye movement control in parsing.

Subdictum A: Never leave your subjects to their own devices • It may not matter a lot • Cowart example: 5-point acceptability rating • A. “….base your responses solely on your gut reaction” • B. “…would you expect the professor to accept this sentence [for a term paper in an advanced English course]?” • But sometimes it does matter…

Cowart 1997

Dictum 2: Try to keep everything constant except what you want to vary • Try to hold extraneous variables constant through norms, pretests, corpora… • When you can’t hold them constant, make sure they are not associated (confounded) with your IV

An example: Staub, in press Eyetracking: does the reader honor intransitivity? Compare unaccusative (a), unergative (b), and optionally transitive) a. When the dog arrived the vet1 and his new assistant took off the muzzle2. b. When the dog struggled the vet1 and his new assistant took off the muzzle2. c. When the dog scratched the vet1 and his new assistant took off the muzzle2. Critical regions: held constant (the vet…; took off the muzzle). Manipulated variable (verb): conditions equated on average length and average word frequency of occurrence. Better: match on additional factors (number of stressed syllables, concreteness, plausibility as intransitive, ….) Better: don’t just have overall match, but match the items in each triple. Staub, A. (in press). The parser doesn't ignore intransitivity, after all. Journal of Experimental Psychology: Learning, Memory and Cognition.

Another example: NP vs S-comp bias • Kennison (2001), eyetracking during reading of sentences like: • The athlete admitted/revealed (that) his problem worried his parents…. • The athlete admitted/revealed his problem because his parents worried… • Conflicting results from previous research (Ferreira & Henderson, 1990; Trueswell, Tanenhaus, & Kello, 1993): does a bias toward use as S-complement (admit) reduce the disruption at the disambiguating word worried? • Problems in previous research: plausibility of direct object analysis not controlled (e.g., Trueswell et al., ambiguous NP (his problem) rated as implausible as direct object of S-biased verb) • Kennison, normed material, equated plausibility of subject-verb-object fragment for NP- and S-comp biased verbs; found reading disrupted equally at disambiguating verb worried for both types of verbs. Kennison, S. M. (2001). Limitations on the use of verb information during sentence comprehension. Psychonomic Bulletin & Review, 8, 132-137.

What happens when there is unavoidable variation? • Subdictum B: When in doubt, randomize • Random assignment of subjects to conditions • Questionnaire: order of presentation of items? • Single randomization: problems • Different randomization for each subject • Constrained randomizations • Equate confounds by balancing and counterbalancing • Alternative to random assignment of subject to conditions: match squads of subjects

Counterbalancing of materials • Counterbalancing • Ensure that each item is tested equally often in each condition. • Ensure that each subject receives an equal number of items in each condition. • Why is it necessary? • Since items and subjects may differ in ways that affect your DV, you can’t have some items (or subjects) contribute more to one level of your IV than another level.

Sometimes you don’t have to counterbalance • If you can test each subject on each item in each condition, life is sweet • E.g., Ganong effect (identification of consonant in context) • Vary VOT in 8 5-ms steps • /dais/ - /tais/ • /daip/ - /taip/ • Classify initial segment as /d/ or /t/ • Present each of the 80 items to each subject 10 times • Ganong effect: biased toward /t/ in “type,” /d/ in “dice” Connine, C. M., & Clifton, C., Jr. (1987). Interactive use of information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 13, 291-299.

If you have to counterbalance… • Simple example • Questionnaire, 2 conditions, N items • Need 2 versions, each with N items, N/2 in condition 1, remaining half in condition 2 • Versions 1 and 2, opposite assignment of items to conditions • More general version • M conditions, need some multiple of M items, and need M different versions • Embarrassing if you have 15 items, 4 conditions… • That means that some subjects contributed more to some conditions than others did; bad, if there are true differences among subjects

Counterbalancing things besides items • Order of testing • Don’t test all Ss in one condition, then the next condition… • At least, cycle through one condition before testing a second subject • Fancier, latin square • Avoid minor confound if always test cond 1 before cond 2 etc. • N x n square, sequence x squad, containing condition numbers, such that each condition occurs once in each column, each order • Location of testing • E.g., 2 experiment stations

Experimental Design for Linguists Charles Clifton, Jr. University of Massachusetts Amherst Slides available at http://people.umass.edu/cec/teaching.html and at http://coursework.stanford.edu

Goals of Course • Why should linguists do experiments? • How should linguists do experiments? • Part 1: General principles of experimental design • How should linguists do experiments? • Part 2: Specific techniques for (psycho)linguistic experiments Schütze, C. (1996). The empirical basis of linguistics. Chicago: University of Chicago Press. Cowart, W. (1997). Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks, CA: Sage Publications Inc. Myers, J. L., & Well, A. D. (in preparation). Research design and statistical analysis (3d ed.). Mahwah, NJ: Erlbaum.

II: How to do experiments. Part 1, General design principles • Dictum 1: Formulate your question clearly • Dictum 2: Keep everything constant that you don’t want to vary • Dictum 3: Know how to deal with unavoidable extraneous variability • Dictum 4: Have enough power in your experiment • Dictum 5: Pay attention to your data, not just your statistical tests

So how do you randomize? • E-mail me (cec@psych.umass.edu) and I’ll send you a powerful program • But for most purposes, check out http://www-users.york.ac.uk/~mb55/guide/randsery.htm Or http://www.randomizer.org/index.htm

Factor out confounds • Factorial design • An example, discussed earlier: Arregui et al., 2006 • Initial experiment contained a confound; corrected in second experiment by adding a second factor

Arregui et al., rating study • Acceptability rating • Rating Rating • clause 1 • OK • None of the astronomers saw the comet, but John did. 4.36 4.53 • B. Embedded VP • Seeing the comet was nearly impossible, but John did. 3.71 4.41 • C. VP w/ trace • The comet was nearly impossible to see, but John did. 3.27 4.81 • D. Neg adj • The comet was nearly unseeable, but John did. 2.21 4.39 Arregui, A., Clifton, C. J., Frazier, L., & Moulton, K. (2006). Processing elided verb phrases with flawed antecedents: The recycling hypothesis. Journal of Memory and Language, 55, 232-246.

Factorial Design Factor 1: syntactic form of initial clause (4 levels) Factor 2: presence or absence of ellipsis (2 levels)

An interaction Interaction: The size of the effect of one factor differs among the different levels of the other factor.

Factorial Designs in Hypothesis Testing • Cowart (1997), that-trace effect • Question: is it bad to extract a subject over that • ?I wonder who you think (that) t likes John. • Acceptability judgment: worse with that • But: underlying theory talks just about extracting a subject. • Does acceptability suffer with extraction of object over that? • I wonder who you think (that) John likes t. • Need to do factorial experiment • Factor 1: presence vs. absence of that • Factor 2: subject vs. object extraction

The results (from before) A clear interaction.

A worry about scales • Interactions of the form “the effect of Factor A is bigger at Level 1 than at Level 2 of Factor B. • Cowart, effect of that bigger at subject than object extraction • Types of scales • Ratio: true zero, equal intervals, can talk about ratios (time, distance, weight) • Interval: equal intervals, but no true zero (temperature, dates on a calendar) • Ordinal: only more or less (ratings on rating scale, measures of acceptability, measures of difficulty)

Is there really an interaction?

Disordinal and crossover interactions

An example of an important but problematic experiment: Frazier & Rayner, 1982 Closure: LC: Since Jay always jogs a mile and a halfthis seems like a short distance to him. 4040 ms/ch EC: Since Jay always jogs a mile and a halfseems like a very short distance to him. 3554 ms/ch Attachment: MA: The lawyers think his second wife will claim the entire family inheritance. 36 ms/ch NMA: The second wife will claim the entire family inheritancebelongs to her. 3751 ms/ch Data shown: ms/character first pass times for the colored regions. Problems??? Frazier, L., & Rayner, K. (1982). Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology, 14, 178-210.

Dictum 3: Know how to deal with unavoidable extraneous variability • i.e., know some statistics • Measures of central tendency (“typical”) • Mean (average, sum/N) • Median (middle value) • Mode (most frequent value) • Measures of variability • Variance (Average squared deviation from mean) • Average deviation (Average absolute deviation from median)

Computation of Variance

Variance in an experiment • Systematic variance: variability due to manipulation of IV and other variables you can identify • Random variance: variability whose origin you’re ignorant of • Point of inferential statistics: is there really variability associated with IV, on top of other variability? • Is there a signal in the noise?

Experimental Design for Linguists

Experimental Design for Linguists

Presentation Transcript

Experimental design for TMS

Experimental Design

Experimental Design

Programming for Linguists

Programming for Linguists

Programming for Linguists

Programming for Linguists

Programming for Linguists

Programming for Linguists

Experimental Design

Programming for Linguists

Programming for Linguists

Programming for Linguists

Experimental Design

Experimental Design

EXPERIMENTAL DESIGN

Experimental Design

Probability for linguists

Programming for Linguists

Experimental Design

Experimental Design