Statistics for Linguistics Students

Statistics for Linguistics Students Michaelmas 2004 Week 4 Bettina Braun www.phon.ox.ac.uk/~bettina

Overview • Discussion of last assignment • z-distribution vs. t-distribution • Between-subjects design vs. Within-subjects design • t-tests • for independent samples • for dependent samples

Exercise z-scores • 1) The mean pause duration in a read text is 200ms with a standard deviation of 50ms. For the calculations please specify how you reached your conclusion! • a) Is this a statistic or a parameter? • If we are interested in describing this particular read test, then it’s a parameter. If we use this text to draw inferences about pause duration in any text then it’s a statistic. • b) What proportion of the data is above 70ms?z=2.60.47% of the data lie below 70ms99.53% of the data lie above 70ms • c) What proportion of the data falls between 100ms and 300ms?z=22,28% lie below 100ms and 2.28% lie above 300ms95.44% lie between 100ms and 300ms

Exercise sampling distribution 2) If we have a sample size of 50, what does the sampling distribution of the means look like if the population is • U-shaped • skewed-left, and • normally distributed? Because of the central limit theorem, the sampling distribution of the mean will be normally distributed, irrespective of the form of the parent distribution

Exercise central limit theorem, standard error 3) What happens, if the sample size increases for the following statistics. Does the • estimated mean increase, decrease, or stay approximately the same? Why?Stays the same as the sample mean is an adequate estimate for the population mean (central limit theorem) • standard error increase, decrease, or stay approximately the same? Why?Standard error decreases with the square root of the sample size (see formula for standard error)

What are frequency data? • Number of subjects/events in a given category • You can then test whether the observed frequencies deviate from your expected frequencies • E.g. In an election, there is an a priori change of 50-50 for each candidate.

X2-test • Null-hypothesis: there is no difference between expected and observed frequency • Data • Calculation

X2-test • Limitations: • All raw data for X2 must be frequencies • Each subject or event is counted only once(if we wish to find out whether boys or girls are more likely to pass or fail a test, we might observe the performance of 100 children on a test. We may not observe the performance of 25 children on 4 tests, however) • The total number of observations should be greater than 20 • The expected frequency in any cell should be greater than 5

Looking up the p-value • Degrees of freedom: • If there is one independent variabledf = (a – 1) • Iif there are two independent variables:df = (a-1)(b-1)

Exercise dependent and independent variables • Generally, in hypothesis testing, the independent variable is hardly ever interval. Mostly it is nominal, or ordinal • Differentiate between • Number of independent variables (e.g. gender and exam year for score example => 2) • Levels of an independent variable are the number of values it can take (e.g. gender: generally 2) • The null-hypothesis is formulated to deny a relation between dependent and independent variable

Exercise dependent and independent variables Imagine you have a text-to-speech synthesis system. You are interested to find out whether the acceptability (from 1 to 5) is increased if you model short pauses at syntactic phrases. • dependent variable: acceptability (ordinal data) • independent variable: TTS with/without pause model (2 levels) • Null-Hypothesis: Duration model does not influence acceptability rating

Exercise dependent and independent variables Subjects learned 20 nonsense-words presented visually. 30 minutes later they were tested for retention. The next day, the same subjects learned another 20 nonsense-words, this time in a combined visual and auditory presentation. Again, after 30 minutes they were tested for retention. The researcher measured the number of correct nonsense-words. • dependent variable: number of correct responses (interval data) • independent variable: kind of presentation (2 levels) • null hypothesis: The number of correct responses will be the same in the two conditions

Further influencing factors • Besides the independent variable, there might be further factors that influence your dependent variable. • Other factors might be confounded with our independent variable (e.g. in the nonword retention task, the audio-visual presentation was on a different day than the auditory presentation. Presentation kind can thus be confounded with presentation time) • Systematic error

Counterbalancing • To avoid confounding variables, the conditions have to be counterbalanced. Examle: • Half the subjects are doing the auditory presentation first and the audio-visual presentation second • Half the subjects are doing the task in opposite order • We often have a group of subjects to perform the task (not just one subject) • Also, in linguistic research, we often use multiple repetitions or different lexicalisations for a given condition (e.g. different words that all have a CVCV strucure)

Exercise drawing error-bars • Variables need to have the correct type! • Error bars show the 95% confidence interval for the mean (i.e. the mean and the area where 95% of the data fall in) • One independent variable • Simple error bar for groups of variables • Two independent variables • Clustered error bar for groups of variables

Exercise drawing error-bars Clustered error bars for two independent variables

Example: testing if a sample is drawn from a given population • A lecturer at Oxford University expects that students at this university have a higher IQ-score than the average British population. • Since records are taken, he knows that the mean IQ-score in Britain is 200 with a standard deviation of 32

Experimental Procedure • The Null-hypothesis H0 is that the IQ of Oxford students is no different from the general public. • He randomly selects 40 students and gives them the standard IQ test. • This results in an IQ-score of 210 • Questions: • Can he conclude that Oxford students have a higher IQ? • Can he compare his sample to the population?

Comparison to population • The sample mean cannot directly be compared to the whole population, but to the sampling distribution of the sample mean (with samples of size n=40). • The sampling distribution has the same mean as the population (200) and the standard error of

Calculating z-score • Since the sampling distribution will be normally distributed (for n > 30), we can calculate the z-score to see how likely a mean of 210 is, given the null-hypothesis were true There is a chance of 2.4% that the sample mean falls within the sampling distribution

What if the population is unknown? • Often, we compare two different samples and we do not know the population parameters (e.g. are exam scores of the year 1990 and 2000 from the same distribution?) • Independent variable (# levels?): • Dependent variable (type?):

What if the population is unknown? • Often, we compare two different samples and we do not know the population parameters (e.g. are exam scores of the year 1990 and 2000 from the same distribution?) • Independent variable (# levels?):exam year (2 levels) • Dependent variable (type?):exam score (interval data)

Hypothesis • Null-hypothesis: The scores in the 2 exam years were drawn from the same distribution • Comparison of the means of the two populations (estimated from two representatitve samples) • What statistical test do we have to perform?

Between-subjects design (completely randomised) • All comparisons between the different conditions are based on comparisons between different (groups of) subjects • Each subject provides data for only one research condition • Example:You want to test whether the pitch of children under the age of 10 is dependent on their gender (a given child is either male or female!)

Within-subjects design (repeated measures) • All comparisions between different conditions are based on comparisons within the same group of subjects • Each subject provides data for all experimental conditions (as many scores as experimental conditions) • Example:You want to test whether the number of reading errors is higher when a subject is sober or slightly drunk.

Why is this difference important? • On average, two scores from P1 and two scores from P2 will be more alike than two scores, one from P1 and one from P2 • Scores from one person on the same task will be correlated; this is taken into account by within-subjects tests. • If between-subjects test is used for within-subjects design, we may fail to find an effect (type II error) • If within-subjects test is used for between-subjects design, we might find an effect that is actually not there (type I error)

Example • You want to test whether the precontext has an effect on the prosodic realisation of sentence-initial accents. • You construct 20 sentences, which can appear in two different contexts, say contrastive and non-contrastive. • Then you ask 20 subjects to read the 40 short paragraphs and measure the pitch height of the initial accent and the duration of the initial word. • You want to know if accents are realised differently in contrastive and non-contrastive context.

Difficult cases • Different classes of dependent variables • If you are interested in articulatory precision at two different speech rates, you might measure the formant values of the vowels and the number of sound elisions • These two dependent variables are taken from the same speaker but this is not a within-subjects design

Difficult cases • More than one measurement per subject, combined to give one score • You are interested in the formant values of male and female /a/. You have a list of 20 words, containing an /a/. Each group of 10 speakers reads the 20 words and you measure the formant values. Then you build the mean formant value of /a/ for every speaker • Since the analysis is performed on only one score per subject, no within-subjects design

Which statistical test, when you’ve score data (parametric tests)? Between, within, mixed? Significance test Number of indepen-dent variables? Indep. t-Test (2 levels) One One-way ANoVA Between Two-/Three-way ANoVA More than one Paired t-Test (2 levels) One a x s ANoVA Within b x b (x c) x s ANoVA More than one Mixed

Assumptions for statistical tests on score data (parametric tests) • The scores must be from an interval scale • The scores must be normally distributed in the population • The variances in the conditions must be homogenious Note: You can perform parametric tests only if these assumptions are met!

T-Test • Student’s T-test • How likely is it that two samples are taken from the same population? • T-test looks at the ratio of the difference in group means to the variance Sample 1 Sample 2 Figure taken from http://esa21.kennesaw.edu/modules/basics/exercise3/3-8.htm

T-Tests • Calculating t-statistic • Comparable to z-statistic, but dependent on the degrees of freedoms (df) • Degrees of freedom (df) • Independent t-test: N1+N2-2 • Paired t-test: N-1 • The critical t-value for α = 0.05 (5% risk of finding an effect that is not actually there) is dependent on df

T-distribution • The more degrees of freedom, the closer the closer the t- distribution is to the normal distribution

T-Table

One-tailed vs. two-tailed predictions • If we predict a direction of the difference, we are making a one-tailed prediction • If we predict that there is a difference (irrespective of direction), we are making a two-tailed prediction • If there is not enough evidence for a directional difference, a two-tailed test is safe.

Example • Hypothesis: reaction time in cond a is significantly different from cond b • Null-hypothesis: the reaction times are not different in conditions a and b

Independent t-test in SPSS Organise independent and dependent variables in separate columns!

Independent t-test in SPSS • Independent variable(s):Test variable(s) • Dependent variable:Grouping variable You have to specify the levels of the independent variable (can only have two!)

How to interpret the output? Descriptive statistics If p > 0.05, variances are homogenious There is an effect of condition on rt

How to interpret the output? • Group statistics (descriptive statistics for the conditions) • Independent samples test • Levene’s test for equality of variances(if p > 0.05, then variances are homogenious) • t-test for equality of means • t-value • df (N-2) • Significance level (2-tailed) • mean difference (difference between the means)

What do we report? • There is a significant effect of condition on reaction time. The average reaction time in condition a was 238.7ms longer than in condition b (t = 6.12, df = 62, p < 0.001). • Interpretation?

Paired t-test in SPSS • Variables of different conditions have to be in parallel columns. • Click on variables to compare and then

How to interpret the output? • Paired samples statistic (descriptive statistics) • Paired samples correlation (naturally, there should be a rather strong correlation. Subjects with a low rt will have a slow one in both conditions) • Paired samples t-test(t, df (N-1), significance level)

What if the basic assumptions are not met • For example • if the distributions are very skewed • if you have ordinal data instead of interval data • You have to use non-parametric tests • There is a whole range of non-parametric tests; I’ll only show the most common ones

Non-parametric statistical tests (for one independent variable only) Between, within, mixed? Significance test Number of levels of independent variable? Mann-Whitney Test Two Between Kruskal-Wallis Test More than two Two Wilcoxon Signed Ranks Test Within Freedman Test More than two

Statistics for Linguistics Students