480 likes | 489 Vues
This course provides an introduction to common ways of presenting statistics, such as tables, charts, and graphs. It also covers measures of central tendency and measures of dispersion for both discrete and continuous variables.
E N D
Introduction to Quantitative Data Analysis (continued) Reading on Quantitative Data Analysis: Baxter and Babbie, 2004, Chapter 11. Course website: http://www.sfu.ca/cmns/faculty/marontate_j/260/07-spring/ Audio recordings of Thursday lectures available on-line (for students registered in the course) at www.sfu.ca/lectures
Last Day: Beginning of Quantitative Data Analysis • Introduction to Common Ways of Presenting Statistics & Importance for Analysis (descriptive statistics) • Tables • Charts • Graphs • Univariate Statistics • Measures of Central Tendancy • Measures of Dispersion
Discrete & Continuous Variables • Continuous • Variable can take infinite (or large) number of values within range • Ex. Age measured by exact date of birth • Discrete • Attributes of variable that are distinct but not necessarily continuous • Ex. Age measured by age groups (Note: techniques exist for making assumptions about discrete variables in order to use techniques developed for continuous variables)
Core Notions in Basic Univariate Statistics • Ways of describing data about one variable (“uni”=one) • Measures of central tendency • Summarize information about one variable (“averages”) • Measures of dispersion • Variations or “spread”
most common or frequently occurring category or value (for all types of data) Mode Babbie (1995: 378)
Bimodal • When there are two “most common” values that are almost the same (or the same)
middle point of rank-ordered list of all values (only for ordinal, interval or ratio data) Median Babbie (1995: 378)
Arithmetic “average” = sum of values divided by number of cases (only for ratio and interval data) Mean (arithmetic mean) Babbie (1995: 378)
Another Diagram of Normal Curve (Showing Ideal Random Sampling Distribution, Standard Deviation & Z-scores)
Symmetric Also called the “Bell Curve” Normal Distribution & Measures of Central Tendency Neuman (2000: 319)
Skewed Distributions & Measures of Central Tendency Skewed to the left Skewed to the right Neuman (2000: 319)
Why Measures of Central Tendency are not enough to describe distributions • 7 people at bus stop in front of bar aged 25,26,27,30,33,34,35 • median= 30, mean= 30 • 7 people in front of ice-cream parlour aged 5,10,20,30,40,50,55 • median= 30, mean= 30 • BUT issue of “spread” socially significant
Measures of Variation or Dispersion • range: distance between largest and smallest scores • standard deviation: for comparing distributions • percentiles:% up to and including the number (from below) • z-scores: for comparing individual scores taking into account the context of different distributions
Range & Interquartile range • distance between largest and smallest scores • what does a short distance between the scores tell us about the sample? • But problems of “outliers” or extreme values may occur
Interquartile range (IQR) • distance between the 75th percentile and the 25th percentile • range of the middle 50% (approximately) of the data • Eliminates problem of outliers or extreme values • Example from StatCan website (11 in sample) • Data set: 6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36 • Ordered data set:6, 7, 15, 36, 39, 41, 41, 43, 43, 47, 49 • Median:41 • Upper quartile: 41 • Lower quartile: 15 • IQR= 41-15
Standard Deviation and Variance • Inter quartile range eliminates problem of outliers BUT eliminates half the data • Solution? measure variability from the center of the distribution. • standard deviation & variance measure how far on average scores deviate or differ from the mean.
1 6 2 3 4 5 7 8 Calculation of Standard Deviation 1 Neuman (2000: 321)
Calculation of Standard Deviation Neuman (2000: 321)
Standard Deviation Formula Neuman (2000: 321)
Details on the Calculation of Standard Deviation Neuman (2000: 321)
Discussion of Preceding Diagram • “Many biological, psychological and social phenomena occur in the population in the distribution we call the bell curve (Portney & Watkins, 2000).” link to source • Preceding picture • a symmetrical bell curve, • average score [i.e., the mean] in the middle, where the ‘bell’ shape tallest. • Most of the people [i.e., 68% of them, or 34% + 34%] have performance within 1 segment [i.e., a standard deviation] of the average score.”
amount of variation from mean Illustration: high & low standard deviation meaning depends on exact case Interpreting Standard Deviation
Recall: Central Tendency & Dispersion (description of distributions) • 7 people at bus stop in front of bar aged 25,26,27,30,33,34,35 • median= 30, mean= 30 • Range= 10, standard deviation=10.5 • 7 people in front of ice-cream parlour aged 5,10,20,30,40,50,55 • median= 30, mean= 30 • Range= 50, standard deviation=17.9
Other ways of characterizing dispersion or spread • Techniques for understanding position of a case (or group of cases) in the context all of cases • Percentiles • Standard Scores • z-scores
Percentile • 1st Calculate rank then choose a rank (score) and figure out percentage equal to or less than the rank (score) • Link to more complex definition of percentile • % up to and including the number (from below) • “A percentile rank is typically defined as the proportion of scores in a distribution that a specific score is greater than or equal to. For instance, if you received a score of 95 on a math test and this score was greater than or equal to the scores of 88% of the students taking the test, then your percentile rank would be 88. You would be in the 88th percentile” • Also used in other ways (for example to eliminate cases)
z-scores • For understanding how a score is positioned in the data set • to enable comparisons with other scores from other data sets • (comparing individual scores in different distributions) • example of two students from different schools with different GPAs • comparing sample distributions to population. How representative is sample to population under study? (Link to more complete discussion of use of z-scores to understand sampling distribution)
Calculating Z-Scores • z-score=(score – sample mean)/standard deviation of set • Link to formula • Link to z-score calculator
Using Z-scores to compare two students’ from different schools: A • Susan with GPA of 3.62 and Jorge with GPA of 3.64 • Susan from College A • Susan’s Grade Point Average =3.62 • Mean GPA= 2.62 • SD= .50 • Susan’s z-score= 3.62-2.62=1.00/.50=2 • Susan’s grade is two Standard deviations above mean at her school
Using Z-scores to compare two students’ from different schools: B • Jorge from College B • Jorge’s GPA =3.64 • Mean GPA= 3.24 • SD=.40 • Jorge’s z-score= 3.64-3.24=.40/.40=1 • Jorge’s grade is one standard deviation above the mean at his school • Susan’s absolute grade is lower but her position relative to other students at her school is much higher than Jorge’s position at his school
Another Diagram of Normal Curve with Standard Deviation & Z-scores
Discussion of Previous Case • Relationship of sampling distribution to population (use mean of sample to estimate mean of population)
Statistical relationships between two variables Covariation (vary together) a type of association Not necessarily causal Independence (Null hypothesis): no relationship between the two variables Cases with values in one variable do not have any particular value on the other variable Recall: Results with two Variables-- Bivariate Statistics
Standard Error (recall tutorial task about average ages in family) • Calculate mean for all possible samples • Divide by number of samples • Measures variability
Recall: Results with two Variables-- Bivariate Tables (Cross Tabulations) Singleton, R., Straits, B. & Straits, M. (1993) Approaches to social research. Toronto: Oxford
Interpretation issues (Bivariate Tables) • Calculate percentages within categories of attributes of independent variable • In example: • Independent variable: gender • Dependent variable: fear of walking alone at night • Women more afraid than men
Other Ways of Presenting Same Data • Link to other tables Calculating Expected Outcomes • If variables (gender & fear) not related then distribution of subgroups of independent variable (male & female) should be the same in each subgroup as in the group overall (therefore men and women should express fear in the same proportions) • Used in techniques for studying relationships (Chi-square) • Descriptive dimension (strength of relationship) • Inferential (probability that the association is due to chance)
Expected outcomes (Null Hypothesis) Singleton, R., Straits, B. & Straits, M. (1993) Approaches to social research. Toronto: Oxford
In, Say it with Figures, Hans Zeisel presents the following data: Control variables: Trivariate Tables Men/Women Drivers Automobile Accidents by Sex and Distance Driven ---------------------------------------------------------------------------- Distance Under 10,000 kmOver 10,000 km Per Cent Per Cent Accident Free Accident Free Women 75% 48% (5,035) (1,915) Men 75% 48% (2,070) (5,010) ---------------------------------------------------------------------------- Automobile Accidents by Sex ------------------------------------------ Per Cent Accident Free Women 68% (6,950) Men 56% (7,080) ------------------------------------------ Women have fewer accidents than men because women tend to drive less frequently than do men, and people who drive less frequently tend to have fewer accidents