BME STATS WORKSHOP

BME STATS WORKSHOP Introduction to Statistics

Part 1 of workshop

The way to think about inferential statistics • They are tools that allow us to make black and white statements even though the data does not clearly provide answers. • This is to say that we will use probabilities which speak of shades of grey but will make statements with respect to rejecting or failing to reject some null hypothesis.

Inferencing from data analysis • As scientists we have the unique privilege of using ingenious tools and methods that help us make informed decisions. • One of those tools is statistical analysis. It allows us to more accurately determine the reality of our data. • This workshop should help you make better conclusions on your data by using simple but effective statistical tools to cut through the levels of grey often encountered in research.

The Essence of Inferential Statistics • We compare a statistic obtained from acquired data to a theoretical distribution of that statistic. Thus, relativity is important in statistics. • You will surely have conducted t-tests in the past to compare measures from a control with an experimental group. • That t value is evaluated against a distribution of ts. • In statistics, size does mater. Large t values increase the likelihood of the investigator stating that he has significant results.

Essence con’d • Signal to noise ratio. • Most statistics used in this workshop such as the t statistic are made up of differences due to treatment and differences due to individuals (also called error). Error is simply random variation.

Essence Con’d • Rare events • This is related directly to point one. • In order for treatment to be successful, the obtained statistic has to be sufficiently rare. • We will find out that large statistical values are considered rare. • For a better understanding of these points we will describe a Monte Carlo experiment.

The Plan! • Constructing a distribution. • How to apply a statistic obtained from an experimental. • Interpretation of a result. • What does a significant result mean?

Constructing a Distribution:Some Definitions • Sample distribution: • A distribution of values from some measurement. • This measurement can be of anything, such as height, weight or age to name a few. • Sampling distribution: • A distribution of a statistic obtained from a sample distribution. • This statistic can be a mean, mode, median, variance or anything else that is a calculation from individual measures. • As we will see, the t statistic can be used to construct a sampling distribution.

Distributions • Sample distributions are often bell shaped or normal but this is guaranteed. On occasion exponential, rectangular or odd shaped distributions are observed. • Sampling distributions on the other hand are almost always normally shaped. This is true even is the measurements used to calculate a statistic are from non-normal distributions.

How to construct a sampling distribution of the t statistic. An example under the null hypothesis of equal means • We first have to have a sample distribution of some measure from some population with specific parameters such as 25 year old women. The measurement of interest could be height. • We then randomly sample from this distribution to make up two groups of individuals of a specified sample size. • Ex. Two groups of ten individual. • From these two groups a t value is calculated. This t value is then plotted. After this calculation, the individuals are returned to the sample distribution. • The process of “sampling” with replacement is repeated as many times as possible. Using computers you might opt for 1000 or more samplings. Thus, you would have a sampling distribution of 1000 ts.

How to use a sampling distribution of ts • In any sampling distribution there are a number of values that are extreme. This is normal and we will use this concept to make decisions about our experiments. • Traditionally, we determine the t value at which point all values greater make up 5% of all values in that distribution. If we are concerned about both tails of that distribution we will find the value at which point all values greater make up 2.5% of all values on the positive tail and 2.5% on the negative tail.

How to use con’d. • We then conduct an experiment in which we have a control and an experimental group. • We calculate a t statistic from this experiment. • This t value is evaluated against the sampling distribution of ts we have constructed. • If our obtained value is greater than the value from the distribution that marks the 5% cutoff we end up stating that the experiment produced a significant result. In other words the control was significantly different from the experimental group. Sig. Not Sig. Sig.

Some specifics about using a t distribution. What does stating significance really mean? • First of all when find a t value that is outside of the critical values in a distribution we should really start by saying, “the obtained value is rare if when calculated from two groups obtained from the same population.” • We would then follow up that statement with, “Since that value is rare and is obtained from an experiment, it is reasonable to conclude that the groups do not come from the same population.” • This is indeed saying that the treatment was effective. Thus, we have a significant result.

Monte Carlo How will building a distribution help us understand statistics

Monte Carlo Building a t distribution

Distributions: ts How do you build distributions of a statistic? In this case t. 1) You start with a population of interest. 2) Calculate means from two samples with a specific number of individuals. 3) Calculate the t statistic using those two samples. 4) Do this again and again. Possibly 1000 times or more. Remember that these distributions are built under the null hypothesis. n1=xx n2=xx Repeat the process as often as you can.

Family of ts The larger the sample size used the less variability in the results. As we can see here, the greater the degrees of freedom (df) the less extreme are the obtain values resulting in a tighter distribution. Note: Degrees of freedom when using the t-test are calculated at n1+n2-2. Thus, for a sample size of 10 per group the dfs are 18.

Theoretical Distribution of ts. We use this table to determine the critical values. The computer uses the density functions.

Variables • Independent variable: • That variable you manipulate. • Subjects are allocated to groups • Dependent variable • That variable which depends on the manipulation. • Measures such as weight or height or some other variable that varies depending on treatment

Cause and effect • Cause can only be inferred when subjects are randomly allocated to groups. • Random allocation ensures that all characteristics are evenly distributed across all groups. • This way, differences between groups cannot be due to biases in the subject selection, a very important element of experimental design.

An example of data analysis

Comparing Reaction time Following Alcohol Consumption. • University males were recruited to participate in an experiment in which they consumed a specific amount of alcohol. • The males were randomly separated into two groups. One group consumed the alcohol and the other some non-alcoholic drink. • Ten minutes after the second drink was consumed the subjects were asked to push a button on a box the moment they heard a buzzer. • When the button was pushed the buzzer stopped. The investigator recorded the amount of time the buzzer sounded in milliseconds.

Hypotheses We state hypotheses in terms of populations. This is to say that we are making statements on what we think exists in the real world. From our sample we will reject or fail to reject the null hypothesis. Here we have a situation in which we are predicting differences only. This is a non- directional hypothesis. H0: c = a H1: c≠a

The data (Time in ms) • Control Alcohol group • 150 200 • 110 250 • 200 220 • 135 225 • 90 250 • 111 234

Results from an output provided by SPSS Probability of a Type1 error is provided inside the red box added by myself (not SPSS). Commonly, investigators call this the significance level. It should be noted that statisticians would not label that value as such.

Critical Values • A critical value is that value using a theoretical distribution that marks the point beyond which less than a specific percent of values can be found. • We typically use 5%. • In our example we have 12 scores from 12 individuals, thus 10 degrees of freedom. • From the distribution of all ts we can determine how large a calculated t from our experiment must be for us to reject the null hypothesis of equal means. • That value (see table previously shown) is 2.228. • Our obtained t is larger than the critical value (-5.465). We reject the null hypothesis in favour of the alternate. • You will notice that the t value is negative for our experiment. What is important is the magnitude, not the direction. If we were to reverse the groups in our calculations the value would have been positive.

Interpretation of the results • Alcohol increases the amount of time needed to turn off the buzzer suggesting that the subjects are impaired in their reactions. • We are able to make this statement because the t value obtained here would be rare if the samples came from the same population. Due to this situation, we give ourselves permission to reject the null hypothesis of equal means in the population.

Some Important Concepts

The standard deviation • The concept of variance and standard deviation (SD) is everything in statistics. • It is used to determine if individuals or samples are inside or outside of normal. • Anyone that is more than 1.96 SD away from the population mean of some measure is said to not belong to that population. However, this is only true when we have population parameters (more on this later).

A few formulas to help us along. Variance: Standard Deviation (SD): Standard error of the mean=SEM:

Variability is Important • The greater the variability the greater the noise. Note here that with greater variability in the data, more overlap of the sample distributions is observed. • This will result in smaller signal to noise ratios. Thus, when we have more variability we will need larger sample sizes to detect mean differences (more on this later). Keep this in mind when reviewing the upcoming slides.

T-Test • Two Sample t-test • Comparing two sample means. It is evident from the formula that the smaller the variability, the larger the t value.

Hypothesis Testing revisited. • We always determine whether or not a statistic is rare given the null hypothesis never from the alternate hypothesis. You might remember this from the Monte Carlo studies. • Thus we have to deal with the concept of the Type1 and the Type2 error.

Type 1 error • The probability of being wrong when stating that samples are from different populations. • This is the p<.05 that we use to reject the null hypothesis of equal means in the population. • If we have a p of .02, it means that the probability of being wrong when stating that two samples come from different populations is .02. • The .05 is a cutoff that is said to be acceptable.

Type 2 error. • The probability of failing to reject the null hypothesis when the null is not true. • In truth, the samples are most likely from different populations. Often, we simply don’t have enough power or the tools are not sensitive enough to detect these differences.

Assumptions of a Distribution What are they and why are they important?

Assumptions are rules • They are the rules by which distributions are constructed. • These rules must be followed in order for a statistic obtained from an experiment to be compared to the theoretical distribution. • If your experiment breaks these rules, it is possible that you will either to conservative or to liberal when making a statement about the reality of the population.

Assumptions • Samples come from a normally distributed population • Both samples have equal variances (homogeneity of variance) • Samples are made up of randomly selected individuals • Both samples be of equal sample size.

What to do when we violate assumptions • 1. We can transform the data so that the sample can have the characteristics desired. • 2. We can use distribution free statistics. • These statistics are insensitive to violations of assumptions. • However, they do have limitations (more in later sessions).

Part 2 of workshop

Starting out with PASW (formerly SPSS but now SPSS again) An introduction

What is SPSS • It is “Statistical Package for the Social Sciences). • It started life as a text driven program (SPSSx), migrated to the PC as line code and, finally made it to the Windows environment. This is the version we enjoy today.

Do you need the latest version? • No. • With each new version there are graphical changes and on occasion additional statistical tools. • However, the basics do not change. An analysis of variance conducted with version 10 will produce the same results as those with version 19 (the latest at the time of this workshop).

Latest version cont’d • One problem is with the output of different versions. • Older versions of SPSS cannot read the output of newer versions. Thus, the outputs are not backward compatible. • One way to get around this issue is to use the export function in the newer versions to save the outputs as PDF, DOC, or PPT so that the results can be read.

Getting started • If you’ve used Excel in the past, then you have a base from which to work. • SPSS uses a worksheet that is similar but not identical to Excel. • However, the similarities end there.

Learning Curve • If you use SPSS on a regular basis, you should be somewhat proficient in a week or two. • Developing an expertise will take you somewhat longer depending on your interest and statistics knowledge. • Lets get started!

This is what you see when you start the program. In front of you is the worksheet in the “data view”. You enter all your data in the worksheet.

You also have the option of “variable view” by clicking on the tab below or clicking on the column heading “var”.

The variable view is where you write down the name of your variable (variable name). Also in this view you have the option of providing variable labels and other descriptors that can help you recognize your data. Name your variable.

BME STATS WORKSHOP