1 / 37

A BRIEF INTRODUCTION TO STATISTICS WITH R

A BRIEF INTRODUCTION TO STATISTICS WITH R. Workshop Notes. http://www.cs.utsa.edu/~jroy/workshop Data is from the University of York project on variation in British liquids. JK Local, Alan Wrench, Paul Carter. References.

Télécharger la présentation

A BRIEF INTRODUCTION TO STATISTICS WITH R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A BRIEF INTRODUCTION TO STATISTICS WITH R

  2. Workshop Notes • http://www.cs.utsa.edu/~jroy/workshop • Data is from the University of York project on variation in British liquids. • JK Local, Alan Wrench, Paul Carter

  3. References • Woods, Anthony, Paul Fletcher and Arthur Hughes. 1986. Statistics in Languages Studies. Cambridge: Cambridge University Press. • Rietveld, Toni and Roeland van Hout. 2005. Statistics in Language Research: Analysis of Variance. New York: Mouton de Gruyter. • Dalgaard, Peter. 2002. Introductory Statistics with R. New York: Springer. • Venables, W.N. and B.D. Ripley. 2002. Modern Applied Statistics with S. New York: Springer.

  4. Modeling Data • Statistical Modeling • Randomness • Main Focus: Account for the randomness so that significant patterns can be seen. • Other types: Deterministic (mathematical), Information Theoretic.

  5. Types of Data • Continuous [Interval] • Formant Frequency • Continuous in theory [not necessarily in practice] • Discrete • Counts (e.g. # of students) • Binary (e.g. /t,d/ or Ø) • Rates (e.g. 3 per 2 hours) • Non-numerical • Ordinal (some implied order) • High, Medium, Low • A+, A, A-, B+, B, C+, C, D, F • Non-ordinal (no intrinsic order) • Sex

  6. Probability • [0,1] • Probabilities cannot be larger than 1 (or 100 %) or less than 0 (or 0%) • If I have only three possible events (E1, E2 and E3) their sum of probabilities is one. • P(E1)+ P(E2) + P(E3) =1

  7. Probability Functions • Mathematical functions of probability • For discrete variables probability is fairly straightforward to calculate, but for continuous variables you have to use calculus. • Continuous variables often have lookup tables that make this easier.

  8. Binomial • If we have n events that can be classified as either success or failure and success has the same chance for each event, then we can use a binomial probability distribution as the model. • Examples of binomial variables: • Tossing a coin {Heads or Tails} • -t,d deletion {/t,d/, Ø}

  9. Binomial Distribution e • p(x) = probability of x successes • p = probability of a success • n = number of trials • x = number of successes.

  10. Example of Binomial • Flipping a quarter 10 times. Each flip has a .5 probability of success. • X (the number of heads) is unknown. • What is the probability of 5 heads? P(X=5) = 5!/(5!5!) * .5^5 (1-.5)^(10-5) = .25

  11. Normal Distribution • [Bell Curve]: Probability is symmetric about a mean. • Example of normal (assumed normal variables) • Grades • IQ • Income • Two population parameters: Mean (µ) and Variance (2) [we usually write x~N(µ, 2) read as “X is distributed normally with a mean mu and variance sigma squared] Continuity creates a problem for measuring probability (e.g. P(IQ = 100) is 0, but the P(70<IQ<120)=.8) •  is the scale parameter and represents the width of the distribution. • µ is the location parameter and represents the center of the distribution • Normal distributions are symmetric about the mean.

  12. Standard Normal • Usually, we try to reduce a normal variable to its standardized form. • This standardized form is usually referred to with the variable z and is distributed N(0,1).

  13. Normal Distibutions

  14. Normal Distribution Table P(0 ≤ z ≤ a)

  15. Measures of Central Tendenancy • Mean -- Intuitively the average value of a variable. (The average GPA of the undergrad students is 7.0) • This can be skewed (4 people have a gpa of 9.5 and one person has a 5.0 the mean is 8.6) • Median -- The value of x that divides the total probability into .5 on both sides.

  16. Measures of Spread • Range: The distance between the highest and the lowest variable. • Variance: The average squared distance of all x weighted by p(x) from the mean. • Standard Deviation: The square root of the variance.

  17. Population Parameters vs Sample Statistics • Each of the previous measures exist for probability distributions. These are usually referred to as population parameters. • Sample statistics are calculated from random samples of the population.

  18. Sample Statistics • Descriptive Statistics • Sample Mean • Sample Median • Sample Variance • Sample Standard Variation

  19. Sample Statistics • Inferential Statistics • Testing hypotheses about the population from which the sample(s) originated. • Forming intervals that describe the possible values of population parameters based on a sample • Provides a framework for interpreting samples in a consistent methodical manner.

  20. Hypothesis Testing • Formulating a hypothesis into a null and alternative hypothesis. • Suppose I want to test the hypothesis that the population mean of men in the production of /l/ for the first formant at the first measure is less than 1220? • H0: µ=455 • HA: µ≠455 • Selecting a alpha-value • Probability of rejecting the null hypothesis when the null hypothesis is true. • Usually this is .05 (depending on what your doing, alpha values of up to .10 may not be unreasonable)

  21. Errors

  22. Hypothesis Testing • From your alpha value, you can select your critical region (or rejection region). • This is usually done from a look-up table (or computer program). • Calculate our test statistic. We are going to assume we know the population variance. Since we have more than 30 measurements, we can use this test.

  23. Hypothesis Testing • T=1.78 • Rejection Region for (since we are doing a two-sided test) is Z(.975) = 1.96 and Z(.025)=-1.96. • .05/2 = .025; 1-.025=.975 and 0+.025 = .025 • We Reject the null hypothesis if T>Z(.975) or T<Z(.025). • Neither is true, so we fail to reject the null hypothesis.

  24. P-values • P-values are the probability that we reject the null hypothesis given the null hypothesis is true. Low p-values indicate greater statistical significance. • For our data, the p-value is .0375.

  25. Hypothesis Testing One Sample t-test data: york.male$F1.0 t = 1.7859, df = 159, p-value = 0.07602 alternative hypothesis: true mean is not equal to 455 95 percent confidence interval: 452.3576 507.5638 sample estimates: mean of x 479.9607

  26. T-tests • Suppose I want to test the hypothesis that men and women are different in their production of /l/ for the first measurement of the first formant. • What is the null? • What test should I use?[I have a large sample and I assume equal variance] • df = n1 + n2 − 2

  27. T-tests Two Sample t-test data: york.data$F1.0 by york.data$Sex t = 3.9047, df = 318, p-value = 0.0001152 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 48.68323 147.56921 sample estimates: mean in group female mean in group male 578.0869 479.9607

  28. Confidence Interval • We estimate CI’s for populations parameters based on the data and a set of assumptions.

  29. Confidence Interval • For Men, we get that a 95% CI for the mean value of F1.0 is (452.3576, 507.5638)

  30. Interpretation of 95% CI • Many people want to say that a 95% confidence interval means that there is a 95% chance that the confidence interval contains the population mean. But any particular confidence interval either contains the population mean, or it doesn’t. The confidence interval shouldn’t be interpreted as a probability. • If samples of the same size are drawn repeatedly from a population, and a confidence interval is calculated from each sample, then 95% of these intervals should contain the population mean.

  31. Assumption of Equal Variance • We can test for equal variance in the same manner we test for equal mean. F test to compare two variances data: york.data$F1.0 by york.data$Sex F = 2.2331, num df = 159, denom df = 159, p-value = 5.986e-07 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 1.634604 3.050717 sample estimates: ratio of variances 2.233096

  32. Correlation • Many times we are not interested in the differences between two groups, but instead the relationship between two variables on the same set of subjects. • Ex: Are post-graduate salary and gpa related? • Ex: Is the F1.0 measurement related to the F1.1 measurement? • Correlation is a measurement of LINEAR dependence. Non-linear dependencies have to be modeled in a separate manner.

  33. Correlation • There is a theoretical correlation, usually represented by ρX,Y • We can calculate the sample correlation between two variables (x,y) The Pearson Coefficient is given to the left. • This will vary between -1.0 and 1.0 indicating the direction of the relationship.

  34. Correlation Pearson's product-moment correlation data: york.data$F1.0 and york.data$F1.1 t = 45.9262, df = 318, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9161942 0.9452264 sample estimates: cor 0.932194

  35. Now to R

More Related