EM Stats Reading Group

EM Stats Reading Group

Housekeeping • The text: Field, Discovering Statistics Using SPSS, 3rd Edition • The software: SPSS, version [xxx], available through MCIT • Session 1 (today): • Chapters 1 and 2 • Homework • Chapters 3 and 4 • Installing SPSS and making graphs • Session 2 (to be scheduled) • Chapter 5 • Testing assumptions

Today • Chapter One • Nuts and bolts • Types of variables • Levels of measurement • Measurement error • Validity and reliability • Observation and experiment • Frequency distributions • Chapter Two • Population and sample • Summary statistics • The Central Limit Theorem • Test statistics

Variables, Parameters, and Statistics • Variable: • Usually, a feature of some subject that you measure • E.g., the speed of a car • Sometimes, something you infer from indirect measurements • E.g., your intelligence • Synonymous with attribute • Parameter • A number that relates two variables • E.g., the displacement of an accelerator and the speed of a car • E.g., your annual income and your intelligence

Variables, Parameters, and Statistics • Statistic • A measure of some feature of a sample of a variable • E.g., • Collect the height of everyone in this room • Calculate the mean height • The mean is a summary statistic and an estimate 0f the height of the population in general • E.g., • Collect the height of everyone downstairs • Compare to the height of everyone in this room with a t test • The t value is a test statistic • A statistic does not necessarily reflect any measurable feature of an individual • E.g., • You do not have a mean height

Variables, Parameters, and Statistics • Independent and dependent variables • Independent variable is one that you can adjust • E.g., the thermostat setting in this room • Dependent variable is one (potentially) driven by the independent variable • E.g., the temperature in this room (likely related) • E.g., the price of gold this morning (likely not related) • Sometimes defined in the opposite way • Not useful terminology -- probably best to just drop them • Predictor and outcome variables • Same idea, more clearly stated • Careful to note that this is not a statement of causality

Levels of Measurement • Categorical Variables • Variable may take only some finite (and usually small) number of values • E.g., • Binary variable: Only two possible states: Heads/Tails, Yes/No, True/False • Nominal variable: More than two states: Red/Violet/Green/Yellow • Ordinal variable: Multiple states, ordered: • Annual income: < $10,000; $10,001 - $30,000; $30,001 - $50,000; > $50,000 • Red/Yellow/Green/Violet…? • Continuous Variables • E.g., • Interval variable: Like an ordinal variable, but distance between any two states is assumed to be equal • Rate your pain on a scale of 1:10 • Ratio variable: ratios along the states also must ‘hold up’ • Middle ‘C’ on a piano is ½ the frequency of one octave above Middle ‘C’

Levels of Measurement • The distinctions can get blurry • Some ‘two state’ systems may have more states • E.g., Heads; Tails; Coin rolled under Fridge • Most ‘ratio data’ aren’t really • No experimental instrument can provide infinitely resolute measurements • E.g., ED oral thermometers only have 3-4 significant figures • While your temperature can be 39.0304327 …o C, your measured temp can only be 39.0 or 39.1 • Many attributes of a patient, sprocket, etc., can be measured using different scales • When possible, go with the ratio method • Most test statistics are more ‘efficient’ for ratio data • They detect differences with fewer observations

Measurement Error • For now we see through a glass, darkly… • The things you want to measure cannot be absolutely measured • In the course of statistical analysis, there are many sources of ‘error’ • Error does not apply a mistake • Measurement error is the discrepancy between the value of the feature you’re evaluating and the measurement you take of it. • Allmeasurement is associated with error. • In research, an important goal is to make the measurement error as small as economically feasible, or at least much, much smaller than the natural variability in the feature you are measuring.

Validity and Reliability • Validity • Is your measuring scheme capturing the feature you’re interested in? • Does the measurement change in some predictable way with a change in the feature? • Does the measurement not change when things other than the feature are changed? • Many different types of validity have been described (e.g., content validity, etc.) • Reliability • Does the measuring scheme produce the same results when applied repeatedly to the same experimental condition?

Study Design: Observation vs. Experiment • Observational studies aim to take measurements without influencing the system under examination • Typically easier than experiments • Sometimes the only means feasible for studying a problem • Some study designs based on observation are referred to as quasiexperimental methods • Establishing causality in observational studies is not possible. • Experimental studies take measures on a system that the investigator is intentionally perturbing • When well-designed, these methods are typically more powerful than observations • May get you closer to establishing causality

Frequency Distributions • A frequency distribution is a tabulation of values taken by a sample of some variable under study (if graphed, it’s called a histogram) • In statistical analysis, it is usually the case that you will approximate your empiric distribution with a probability distribution • Eg. Normal (Gaussian), t, Beta, Exponential, Gamma distributions • Swapping your actual distribution with one of these allows you to use statistical tools which are well behaved and thoroughly understood • Always remember that when you assume your data follow some distribution, you must do some homework to make sure that assumption is true! (See Chapter 5)

Frequency Distributions • An attractive feature of defined probability distributions as a means of describing a frequency distribution is that they can usually be fully described by just a few parameters.

Frequency Distributions • An attractive feature of defined probability distributions as a means of describing a frequency distribution is that they can usually be fully described by just a few parameters. Mean (m)= 6.96 Standard Deviation (s)= 2.35 These are NOT your data – they are a model of your data!

Frequency Distributions • Normal (Gaussian) distribution • Common, but not universal • Described with 2 parameters • Mean (m) • Standard deviation (s) • Turns out it’s really easy to calculate these two parameters given a set of data

Frequency Distributions • Finding the mean and standard deviation for a normal distribution Mean Variance Standard Deviation

Frequency Distributions • A counter example: The time to arrival of the next patient in triage • A sample from the UM ED was taken at brisk activity – 10 new patients an hour • The question is asked – given a patient arriving at t = 0, what’s the range of likely times until the next patient arrives? E.g., how many minutes do I have to get this patient triaged?

Frequency Distributions • Given 10 arrivals per hour, you know the average time to the next patient should be around 6 minutes. • As a first approximation, a normal distribution is chosen. • Mean = 9.9 minutes • SD = 9.3 minutes • The normal distribution does a very poor job at estimating the distribution of arrival times.

Frequency Distributions • An alternative distribution • The exponential distribution • Describes the data very nicely • The mean and standard deviation have different forms and are derived from the intensity of the process – the number of cases per hour, usually abbreviated with l

Frequency Distributions • Finding the mean and standard deviation for an exponential distribution • Weird: The mean and the standard deviation are the same… • With the exponential distribution, you only need one number (l) to described the whole thing.

Frequency Distributions • F0r Discrete Variables (e.g., count data) • Bernoulli • Rademacher • Binomial • Hypergeometric • For Continuous Variables • Beta • Von Mises-Fisher • Chi squared • Exponential • Gamma • Log-normal • Normal • Poisson • Weibull • Many, many more. • Quantum mechanics is based on probability distributions over complex numbers (with real and imaginary parts)

Frequency Distributions • Take home point: • The frequency distribution of your data is just a tabulation of your observations • A probability distribution function is a mathematical tool that you can use to reduce a lot of observations into a small number of summary statistic values to describe your data and perform analysis • The mean and standard deviation are two common summary statistics • They will be calculated differently depending on which distribution you choose • Many probability functions exist • Your choice will depend on the problem at hand and available software

Populations and Samples • A population is the entire set of individuals about which you want to learn something from and potentially infer something to. • In general, it is assumed that a true census– an observation of every member of a population– is not possible • Thus, the actual features of a population are essentially unknowable. • A sample is a subset of a population selected for measurement • A random sample is one in which every member of the population has the possibility of joining • A uniform random sample is one in which every member of the population has an equal probability of joining • Central themes in statistics include: • Collecting representative samples • Choosing appropriate summary measures to describe those samples • Quantifying the likely discrepancy between one’s summary measures and the true values in the population (e.g., quantifying confidence)

Summary Statistics • Central tendency • Median: The 50%ile value – half of observations are greater than, half are less than, the median value • Mode: The most frequency single observation • Mean: The calculated summary statistic based on the probability distribution under consideration Median: 8 Mode: 8 Mean: 8

Summary Statistics • Outlier • An observation that appears to deviate greatly from other members of the sample • Effect of outlier on central tendency summary statistics: • Median – relatively resistant to outliers • Mode – relatively resistant to outliers, but a sample may have more than one mode • Mean – resistance to outliers is a proportional to the sample size – small samples may sway dramatically to outliers Median: 8 Mode: 8 Mean: 10

Summary Statistics • Dispersion • Interquartile range: A pair of numbers – the 25th and 75th%ile of the sample • Standard Deviation: Calculated value based on the distribution under consideration • Confidence intervals: Variation on the standard deviation IQR: 6,9 SD: 2.3

Summary Statistics • Effect of outliers on dispersion summary statistics • IQR: Relatively resistant to outliers • SD, CI: Potentially very sensitive to outliers. As with the mean, the extent is a function of the sample size (small samples are very susceptible). IQR: 7,10 SD: 6.4

Summary Statistics Normal distribution, Excluding outlier Normal distribution, Including outlier

Choosing Summary Statistics to Use • Distribution-Based Statistics • Pros: • Commonly used • Provide a natural link to statistical tests and methods to be developed later • Cons: • Are only as good as the choice of probability distribution • May significantly misrepresent the data when outliers are present • Distribution-Free (i.e., Non-parametric) Statistics • Pros: • Commonly used • Avoid (potentially wrong) assumptions associated with probability distributions • Cons • Don’t benefit from the powerful toolkit in parametric statistical methods • Many real-world phenomena likely follow theoretical distributions – avoiding them may suggest ‘pathologic’ data

The Central Limit Theorem • An Example: • ~ 750,000 cases of sepsis each year in US • Suppose you could measure every one’s WBC on admission (i.e., you knew the population frequency distribution of WBC) • Find some summary statistics of that distribution • Mean = 14.3 • SD = 6.5 • IQR = 8.9, 19.4

The Central Limit Theorem • Now consider this experiment: • You enroll 50 patients in a sepsis study at UM • You calculate your summary statistics: • Mean = 14.6 (v. 14.4) • SD = 7.1 (v. 6.5) • IQR = 7.9, 20.2 (v. 8.9, 19.4) • A reasonable question to ask is: • If my sample is representative of the whole population, how close are my summary statistics likely to be to ‘the truth?’

The Central Limit Theorem • To figure it out, do your experiment over, enrolling 50 new patients • Now, repeat your experiment 1,000 times…

The Central Limit Theorem • Sampling Distribution of the Mean • Approaches a Normal Distribution in the Limit (Wow!) • The SD of this curve is the Standard Error of the Mean • The SEM is estimated by: • where sm = SEM,s = standard deviation of the original sample, and N = sample size

The Central Limit Theorem • The distribution of the sampled mean value will, when resampled many many times, approach a normal distribution with a mean equal to the population mean and a standard deviation equal to the standard deviation / square root of the sample size. • To double the precision of your estimate of the mean, you need to increase your sample size 4-fold • Other statistics of the distribution (e.g., the standard deviation, IQR, etc.) will also approach fixed distribution • These distributions are not necessarily normal (e.g, because the SD by definition is a positive real number, it obeys a distribution that avoids ‘0’) • These distributions are closely tied to the idea of Bayesian statistical methods

The Standard Deviation and the Standard Error of the Mean • Standard deviation describes the likely distance between any measurement in your sample and the mean of the sample (i.e., the variation or dispersion in your sample) • The SEM describes the likely distance between your sample mean and the (unknown) sample mean of the population • SEM is by definition always smaller; many will report this number because it ‘looks better.’ • I prefer SD as it more naturally implies the spread of the measurements actually taken and does not refer to a population the details of which aren’t really knowable.

Confidence Intervals • CI’s are a simple extension of standard deviation • They proceed from a probability distribution • Easily calculated by numerous software packages • An ‘n’% confidence interval represents the range that a summary statistic (e.g., the mean) will fall n% of the time when new experimental samples are drawn from a population • This is an odd definition that stems from details of the probability logic involved • A close, but not identical definition is the range in which there is a 95% probability of finding the true value of the summary statistic (e.g., the mean) • See text for calculation details

Test Statistics • The Foundation of All Statistical Testing: • Don’t freak out, and keep in mind: • The test statistic is a number being used to describe some feature of a sample • Because it is derived from a random sample, the value of a test statistic will vary from one sample to another • Statistical significance is a comment on the probability of finding the value of a test statistic higher than what you’d expect if Effect < Error • A statistical ‘test’ is comparison of a sample’s test statistic versus the distribution of that test statistic seen when Effect < Error • Accordingly, we’re going to need to rely on someone having figured out what the sampling distribution of a test statistic is when Effect < Error

Test Statistics • Much, much more to come in later chapters

EM Stats Reading Group

EM Stats Reading Group

Presentation Transcript

stats

MSc Reading Group

Whole Group Reading

Small Group Guided Reading

Stats

Distributed Representative Reading Group

Math Facilitator Reading Group

Oxford Reading Group

Stats

Complexity reading group

MIT INDIA READING GROUP

Stats group: Priorities for 2012

Stats

Stats

The mobile ‘ reading group ’

Embedded Systems Reading Group

Small Group reading instruction

Small Group reading instruction

Whole Group Reading

Stats

Small Group Reading Instruction

Reading Sports Betting Stats