Statistics for Linguistics Students

Statistics for Linguistics Students

Syllabus

Assessment • 10% participation • 90% final exam

Course Books • Woods, A., Fletcher, P., and Hughes, A. 1986. Statistics in Language Studies. Cambridge: Cambridge University Press. • Butler, C. 1985. Statistics in Linguistics. New York: Basil Blackwell. • 李绍山，2001，《语言研究中的统计学》，西安：西安交通大学出版社。

References • Brown, J. D. 1988. Understanding Research in Second Language Learning. Cambridge: Cambridge University Press. • Brown, J. D. 2002. Doing Second Language Research. Oxford: Oxford University Press. • Hatch, E. and Lazaraton, A. 1991. The Research Manual: Design and Statistics for Applied Linguistics. New York: Newbury House Publishers. • Hinton, P. 2004. Statistics Explained. London: Routledge. • Muijs, D. 2004. Doing Quantitative Research in Education with SPSS. London: Sage Publications.

1. Introduction Overview of research methodology

The State of the Art • In terms of research domains, 35% dealt with foreign language teaching, 44% dealt with various branches of linguistics, 21% dealt with theoretical studies. • In terms of research methods, 54% of the studies were argumentative, or speculative with a few examples, 20% were descriptive ,13% were introductory,10% were theoretical, and 3% were experimental. .

The State of the Art • In terms of data types, 80% were not data-driven, 12% were enumeration of data without statistical measures, only 7% use data arranged by statistical means. A simple conclusion: Teachers of foreign languages in China do not seem to have the adequate methodology to study their areas of interest. They are least familiar with quantitative and statistical methods.

Statistical Software • Excel • SPSS (Statistical Package for Social Sciences) • Statistica • NCSS

Statistics and Linguistic Studies

2. Describing Variables

Data • Data types • Nominal data • Naming, not measurement • Mother tongue, sex, social status • Ordinal data • Ordering, can’t tell the size of the difference • Class rank：1－2≠8－9 • Interval data • 数据之间的间距是相等的 • 考试分数：95－90＝70－65 • Ratio data • 具有绝对零点 • 身高，时间，距离 / 温度 • 语言学中很少用到

Nominal Data The question of interest is whether the pass probabilities for CAI and Tutoring are the same. So the H0 hypothesis is the two variables are independent, and the H1 hypothesis is they are not independent. Explanatory (Independent Variable) Response (Dependent variable)

Ordinal Data The question of interest is how these two ordinal variables are related to each other.

Interval Students rank Ordinal 按学生成绩分组 nominal Data • 英语期末考试分数数据的变化： Test score ranks High group, low group

Data • 连续型数据和离散型数据 • 连续型：可以取某一范围那任何值，其精确度没有任何限制。如，分数，朗读一个句子的时间 • 离散型：只能取某些值，单位之间不能做细微的划分。如，一个单词的字母数 • 频率数据和分数数据 • 频率：一个变量出现多少次（名称） • 分数：一个变量是多少（顺序、间隔）

Data • 频率数据和分数数据 • 频率：一个变量出现多少次（名称） • 分数：一个变量是多少（顺序、间隔）

Data

Variables • 因变量（DV） • 研究中测量的主要变量 • 自变量（IV） • 影响因变量的变量 • 调节变量 • 次自变量 • 控制变量 • 研究中受到控制的变量，不在研究之列但可能影响研究结果 • 介入变量 • 类似调节变量，区别在于介入变量是抽象的，不能精确的辨认

Variables

Variables 研究授课时数与语言能力之间的关系：自变量：授课时数因变量：语言能力调节变量：年龄控制变量：性别介入变量：学习

Variables • 在Tuckman的指导下，Orefice对两种性格不同的学生(抽象型和具体型)试验了四种不同的方法：(1)用录音带和手册来自学；(2)在课室使用程序教学；(3)程序教学加课堂讲授；(4)传统的课堂讲授加讨论。他的假设是抽象型的学生容易接受(1)和(2)种方法，时间花得少，效果较好；而具体型的学生则容易接受(3)和(4)种方法。在整个实验里： • ●自变量：4种不同水平的教学方法的比较。 • ●调节变量：2种不同水平的性格。 • ●控制变量：没有在假设中提到，可能是教学内容、教学班大小、学生的年龄、性别。阅读水平也可能是一个重要的控制变量。 • 教学班的大小、学生的年龄、性别。阅读水 • ●介入变量：4种教学方式的课堂组织形式和不同类型的学生之间的关系。 • ●依变量：学生的成绩、所花的时间、对教学方法的喜欢程度。

总体抽样框架随机抽样 Populations, Samples, and Random Sampling • 总体 • 作为研究对象的任何个体的集合或目标群体 • 样本 • 从总体中抽取一部分个体加以研究 • 随机抽样 • 抽签 • 随机数表 • 计算机生成随机数字 • 分层随机抽样 • 多级抽样 • 总体参数(parameter) vs. 样本统计量（statistic)

Populations, Samples, and Random Sampling

Descriptive and inferential statistics • 描述统计学（Samples） • 总结或描述观察到的结果 • 推理统计学（Populations） • 通过观察的结果做估计或预测，对未观察的情景做推理 1. On average, I read at the speed of about 100 words per minute. 2. We can expect a lot of rain at this time of year. 3. The earlier you start revising, the better you are likely to do in the exam.

Inference Population (Parameter) Inferential Statistic Estimation Sampling Descriptive Statistic Sample (Statistic) Calculation Descriptive and inferential statistics

Homework 1. Identify the function of each variable: • The following example is drawn from unpublished research projects of ESL teachers. • Lennon (1986) wanted to develop some ESL materials to help university students learn how to express uncertainty in seminar discussions. However, he could find no descriptions that said how native speakers carry out this specific speech function. In his research project, then, he tape-recorded seminars and abstracted all the uncertainty expressions uttered by teachers and students. He categorized these expressions into five major types plus one "other" category. First, the subject variable is status with two levels. The variable type is nominal. If he assigned numbers to the levels of a variable, these numbers would have no arithmetic value. Assume that a 1 was arbitrarily assigned to one level and a 2 to the other. If he added all the 1s for this variable, he would obtain the frequency for the number of students in the study. If he added the 2s, this would be the frequency for the number of teachers in the study. • The second variable in the research is uncertainty expressions. It, too, is a nominal variable. If Lennon found six hedge types in the data, he could assign numbers to identify each of the six hedge types. The numbers would represent the six levels. The total number of instances of each would be tallied from the data.

2. To be certain that the distinction between frequency counts and interval/ordinal scores is clear, work through each of the following situations. • Brusasco (1984) wanted to compare how much information translators could give under two different conditions: when they could stop the tape they were translating by using the pause button and when they could not. There were a number of information units in the taped text. What are the variables? If the number of information units are totaled under each condition, are these frequencies or scores (e.g., how often or how much)? • Li (1986) wondered whether technical vocabulary proved to be a source of difficulty for Chinese students reading science texts in English. He asked 60 Chinese students to read a text on cybernetics and underline any word they weren't sure of. Each underlined word was then categorized as ±technical. Each S, then, had a percent of total problem words which were technical in nature. Among the 60 students, 20 were in engineering, 20 physics, and 20 geology. What arc the variables? Are the data frequencies or interval/ordinal scores? If Li has compared the number (rather than the percentage) of ± technical words, would your answers change? • The Second Language Research Forum is a conference traditionally run by graduate students of applied linguistics. Papers for the conference are selected, by the graduate students. The chair of the conference wondered if M.A. and Ph.D. students rated abstracts for the conference in the same way. Papers were rated on five different criteria using a 5-point scale (with 5 being high). Each paper, then, had the possibility of 0 to 25 points. What are the variables? Are the data frequencies or interval/ordinal scores?

3. Describing data

Frequency distribution • An organized tabulation of the number of individual scores located in each category on the scale of measurement. • Structured either as a table or as a graph, both present the same two elements: • The set of categories that make up the original measurement scale • A record of the frequency, or number of individuals in each category

Frequency distribution tables • 8,9,8,7,10,9,6,4,9,8,7,8,10,9,8,6.9,7,8,8 Σfx=158

Frequency distribution tables • Proportions and percentages

Frequency distribution tables • Grouped frequency distribution tables • Four rules: • 1. about 10 class intervals – easy to see and understand the data • 2. each class interval should be a relatively simple number, e.g. 2, 5, 10, or 20. • 3. each interval should start with a score that is a multiple of the width, e.g. 10  30, 40, 50. • 4. all intervals should be the same width. They should cover the range of scores completely, with no gaps and no overlaps.

Frequency distribution tables • 82,75,88,93,53,84,87,58,72,94,69,84,61,91,64,87,84,70,76,89,75,80,73,78,60 • 1. range of score: 53~94

Frequency distribution tables • 2. intervals: 5

Frequency distribution tables • Continuous variables and real limits • Real limits are the boundaries of intervals for scores that are represented on a continuous number line. The real limit separating two adjacent scores is located exactly halfway between the scores. Each score has two real limits. The upper real limit is at the top of the interval, and the lower real limit is at the bottom.

Frequency distribution tables

Frequency distribution graphs • Histograms and bar graphs • Histograms • The height of the bar corresponds to the frequency • The width of the bar extends to the real limits of the score • Bar graphs • The height of the bar corresponds to the frequency • There is a space separating each bar from the next.

Frequency distribution graphs

Frequency distribution graphs • Polygons • The dot is centered above the score. • The vertical location (height) of the dot corresponds to the frequency.

Frequency distribution graphs

Homework

Statistics for Linguistics Students