An Introduction to Statistical Thinking

An Introduction to Statistical Thinking January 15, 2014

The Omnipresence of Variability • Individuals vary • Repeated measurements on the same individual vary

The Omnipresence of Variability • Individuals vary • Repeated measurements on the same individual vary Statistics is a set of ideas and tools that account for variability when dealing with data.

Stat Thinking vs. Math Thinking • In pure math, the focus is on abstract patterns. Context is an irrelevant detail. Example: 1, 2, 3, 5, 8, 13, 21, 34, 55, … is an interesting pattern of numbers without context.

Stat Thinking vs. Math Thinking • In pure math, the focus is on abstract patterns. Context is an irrelevant detail. • In statistics, whether a pattern is meaningful or interesting depends on context. Example: 3, 5, 23, 37, 6, 8, 20, 22, 1, 3 has seemingly no meaning or interest.

The Role of Context Source: George W. Cobb and David S. Moore, Mathematics, Statistics, and Teaching, The American Mathematical Monthly, Vol. 104, No. 9 p. 802

What Stat is All About Insight from data in context!

The Content of Statistics • Design: where data comes from and how it is gathered • Exploratory data analysis: informal conclusions about data drawn by direct observation • Statistical inference: formal conclusions about unknown parameters drawn indirectly from collected data

Break for personal data collection

Design • Data are produced in two main ways

Design • Data are produced in two main ways • Randomized comparative experiment: Subjects randomly assigned to two groups One group is treated, the other is not (“control”) Responses of two groups are compared

Design • Data are produced in two main ways • Randomized comparative experiment: Subjects randomly assigned to two groups One group is treated, the other is not (“control”) Responses of two groups are compared • Observational study: Researchers observe subjects in their natural setting and record variables of interest. Comparisons made among small, homogeneous groups to prevent confounding.

Design Examples • Math 109 Personal Info Survey

Design Examples • Math 109 Personal Info Survey • The Salk Vaccine Field Trial

Design Examples • Math 109 Personal Info Survey • The Salk Vaccine Field Trial • Best and Walker’s smoking and health study

Problematic Designs • To study effectiveness of a certain surgery, eligible patients are split into two groups. Those who are too sick to benefit from surgery are put in the control group. The treatment group and control group are compared.

Problematic Designs • To determine the effectiveness of a new treatment, a group of patients receive the treatment and are compared to patients treated in other ways in the past (“historical controls”).

Quick Quiz • Is the comparison described below from an experiment, an observational study, or neither? Of 8,341 middle-aged men with heart trouble, 5,552 were chosen at random to receive one of five drugs for preventing heart attacks and the rest were assigned to the control group. Subjects who took more than 80% of their prescribed medicine were called “adherers.” For the group assigned to the drug clofibrate, the 5-year mortality rate among adherers was 15%, compared to 25% among non-adherers.

Design • Data are produced in two main ways • Randomized comparative experiment • Observational study

Design • Data are produced in two main ways • Randomized comparative experiment • Observational study • Experiments often allow causal conclusions, observational studies don’t, even if done during an ongoing experiment.

Design • Data are produced in two main ways • Randomized comparative experiment • Observational study • Experiments often allow causal conclusions, observational studies don’t, even if done during an ongoing experiment. • Mathematical models used in statistics are identical for both. Thus context is crucial.

Dealing with Data

Variables • Characteristics which change from individual to individual

Variables • Characteristics which change from individual to individual • Two types • Quantitative: numerical characteristics Examples: Age, family size, income, height

Variables • Characteristics which change from individual to individual • Two types • Quantitative: numerical characteristics Examples: Age, family size, income, height • Qualitative/Categorical: non-numerical descriptors Examples: Sex, major, birthplace, marital status

Quantitative Variables • Can be discrete or continuous (or both!)

Quantitative Variables • Can be discrete or continuous (or both!) • Discrete: values differ by fixed amounts Examples: family size, cars owned

Quantitative Variables • Can be discrete or continuous (or both!) • Discrete: values differ by fixed amounts Examples: family size, cars owned • Continuous: difference in values can be arbitrarily small Examples: age, height, weight

Quantitative Variables • Can be discrete or continuous (or both!) • Discrete: values differ by fixed amounts • Continuous: difference in values can be arbitrarily small • Discrete variables with large range and small minimum difference may be treated as continuous. Example: income

Visualizing Quantitative Data Dot Plot EPA mileage ratings for 100 new cars

Visualizing Quantitative Data • Dot plot • Groups values that are the same • Useful for a single discrete variable • Can see individual data points and distribution SPSS: Graphs->Legacy Dialogs->Scatter/Dot Choose Simple Dot Choose X-Axis Variable

Histogram

Visualizing Quantitative Data • Histogram • Divides data into class intervals • Bars represent proportion of data in each interval • Useful for single continuous or discrete variable • Can see distribution, but not individual data

Creating a histogram • Determine class intervals (The widths need not be uniform.)

Class Interval Recommendations For discrete quantitative variables, break intervals between data values.

Creating a histogram • Determine class intervals • Determine % of data in each interval (Data falling on the boundary of two intervals go in the higher interval.)

Creating a histogram • Determine class intervals • Determine % of data in each interval • Determine height of block Height of block = Vertical axis units are “% per (horizontal axis unit)” This scale is called the density scale.

Measures of Center Quantitative descriptions of data

Mean (average) • Sum of the data values divided by number of data values. • “Balances” a histogram made from the data • May not be a good notion of “middle” if there are a few extreme values • Mean of histogram data is estimated by weighted average.

Median • The middle data value when data are ordered smallest to largest. • If there are an even number of data values, the median is the average of the middle two. • Greater than or equal to half of the data, less than or equal to the other half. • Useful when extreme values have reduced importance.

Mode • The data value(s) appearing most frequently • Can be more than one (bimodal distribution) • Shows where data tend to concentrate • The only measure of center we’ll discuss that makes sense for a qualitative variable.

In SPSS • Analyze => Descriptive Statistics => Frequencies => Statistics

The Shape of a Distribution • Skew Right • Long right-hand tail • Mean is larger than median • Skew Left • Long left-hand tail • Mean is smaller than median • Symmetric • Uniform

Measures of Variability

Range • Difference between largest and smallest data value • Easy to compute, but depends too much on extreme values

Interquartile Range (IQR) • The range for the middle 50% of the data • Is not affected by extreme values • To compute IQR of a set of N data values: • Find first quartile (Q1): the 25th percentile data point. • Find third quartile (Q3): the 75th percentile data point. • The IQR = Q3-Q1

Standard Deviation (SD) • Measures how far a typical data point is from the mean • Most values (often about 68%) are within one SD of the mean • Very few values (often about 5%) are more than two SDs away from the mean. • The SD has the same units as the data • WARNING: Software computes SD+

Calculating SD • Compute deviation for each data value Deviation = data value – mean • Compute root mean square (RMS) of the deviations • Square each deviation • Find the mean of the squared deviations • Take the square root of the result • SD of histogram data is a weighted RMS

Why not just take average deviation? • Average absolute deviation • If is replaced by another number, avg. abs. deviation could be smaller. This is not true of SD. • We will see later in the course that • Error in RMS calculations is easier to handle • SD fits best with the theory (Central Limit Theorem)

The Normal Approximation

An Introduction to Statistical Thinking

An Introduction to Statistical Thinking

Presentation Transcript

AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS

Introduction to Statistical Inference

Introduction to Statistical Inference

An Introduction to the World of Economic Thinking

Introduction to Statistical Inference

An Introduction to Critical Thinking

An Introduction to Statistical Process Control Charts (SPC)

Critical Thinking: An Introduction

Introduction to Statistical Inferences

Statistical Indicators and Statistical Thinking

Introduction to Statistical Method

An Introduction to the R Statistical Programming Language

Thinking Geographically An Introduction to An Introduction to AP Human Geography

Thinking Small: An Introduction to Nanotechnology

Statistical Thinking

Control Charts An Introduction to Statistical Process Control

An Introduction to Statistical Machine Translation

An Introduction to Levels of Thinking

Introduction to Statistical Method

Capability Analysis An Introduction to Statistical Process Control

Introduction to (Statistical) Thermodynamics

Introduction to Statistical Sampling