Understanding Statistics: Correlation and Outliers

Mar. 15 Statistics for the day:Highest Temp ever recorded in State College:102 degrees (July 9, 1936 and July 17, 1988) Lowest temp ever recorded in State College:-18 degrees (January 19-20, 1994) Source: http://pasc.met.psu.edu Review Exam Friday, March 19 Chapters 10, 11, 12, 15, 16, 17 These slides were created by Tom Hettmansperger and in some cases modified by David Hunter

Best fitting line through the data: called the REGRESSION LINE Strength of relationship: measured by CORRELATON

calories = -10 + 60(serving size in oz) ------------------------------------------------- For example if you have a 6 oz sandwich on the average you expect to get about: -10 + 60(6) = -10 + 360 = 350 calories -------------------------------------------------- For a 10 oz sandwich: -10 + 60(10) = -10 +600 = 590

calories = -10 + 60(serving size in oz) • -10 is called the intercept • 60 is called the slope • One way to interpret slope: For every extra oz of serving you get an increase of 60 calories

Facts about correlation, measured for two quantitative variables • +1 means perfect increasing linear relationship • -1 means perfect decreasing linear relationship • 0 means no linear relationship • + means one increases as the other increases • - means one increases as the other decreases

Outliers Outliers are data that are not compatible with the bulk of the data. They show up in graphical displays as detached or stray points. Sometimes they indicate errors in data input. Experts estimate that roughly 5% of all data entered is in error. Sometimes they are the most important data points.

Example

A bad outlier:

Another bad outlier:

The Moral: There can be good outliers: Election fraud. We use them to identify important parts of the data. Or in analyzing put options for extreme cases. More often the outliers are bad. They can depress the correlation and make you think the relationship is weaker than it really is. They can increase the correlation and make it appear that the relationship is stronger than it really is. IMPORTANT: Always look at a scatter plot as well as compute the correlation.

Another problem: Sometimes we see strong relationship in absurd examples. Two seemingly unrelated variables have a high correlation. This signals the presence of a third variable that is highly correlated with the other two. (Confounding or interaction)

A third variable: vocabulary vs shoe size

How can we have such high correlation between shoe size and vocabulary? Easy: Both increase with age and hence age is a hidden variable. Age is positively correlated with both shoe size and with vocabulary.

Two categorical variables: Explanatory variable: GenderResponse variable: Body Pierced or Not Survey question: Have you pierced any other part of your body? (Except for ears) Research Question: Is there a significant difference between women and men in terms of body pierces?

Response Data: Pierced? Explanatory Gender? From Stat 100.2, spring 2004 (missing responses omitted)

Percentages 62.22 = 84/135 96.97 = 96/99 Response: body pierced? no yes All female 62.22 37.78 100.00 male 96.97 3.03 100.00 All 76.92 23.08 100.00 Research question: Is there a significant difference Between women and men? (i.e., between 62.22% and 96.97%)

The Debate: The research advocate claims that there is a significant difference. The skeptic claims there is no real difference. The data differences simply happen by chance.

The strategy for determining statistical significance: • First, figure out what you expect to see if there is no difference between females and males • Second, figure out how far the data is from what is expected. • Third, decide if the distance in the second step is large. • Fourth, if large then claim there is a statistically significant difference.

Research Advocate: OK. Suppose there is really no difference in the population as you, the Skeptic, claim. We will compare what you, The Skeptic, expect to see and what you actually do see in the data.Skeptic: How do we figure out what we expect to see?

Rows: gender Columns: body pierces top lines of numbers are observed bottom lines are expected (by skeptic) no yes All female 84 51 135 103.85 31.15 135.00 male 96 3 99 76.15 22.85 99.00 All 180 54 234 180.00 54.00 234.00

How to measure the distance between what the research advocate observes in the table and what the skeptic expects: Add up the following for each cell: Now how do we decide if 38.85 is large or not? If it is large enough the skeptic concedes to the research advocate and agrees there is a statistically significant difference. How large is enough?

Chi-squared distribution with 1 degree of freedom: If chi-squared statistic is larger than 3.84, it is declared large and the research advocate wins. But our chi-squared is 38.85 so the research advocate easily wins! There is a statistically significant difference between men and women.

Why 1 degree of freedom? Note that black box is the ONLY one we can fill arbitrarily. Once that box is filled, all others are determined by margins!

How many degrees of freedom? Degrees of freedom (df) always equal (Number of rows – 1) times (Number of columns – 1)

Health studies and risk Research question: Do strong electromagnetic fields cause cancer? 50 dogs randomly split into two groups: no field, yes field The response is whether they get lymphoma. Rows: mag field Columns: cancer no yes All no 20 5 25 yes 10 15 25 All 30 20 50

Rows: mag field Columns: cancer observed above the expected no yes All no 20 5 25 15.00 10.00 25.00 yes 10 15 25 15.00 10.00 25.00 All 30 20 50 30.00 20.00 50.00 Chi-Square = 8.333 (compare to 3.84) Research advocate wins!

Terminology and jargon: • Identify the ‘bad’ response category: yes cancer • Risk for categories of explanatory variable • Identify treatment category • Identify baseline (control) category • Treatment risk: 15/25 or .60 or 60% • Baseline risk: 5/25 or .20 or 20% • Relative risk: Treatment risk over Baseline risk = .60/.20=3 • So risk due to mag field is 3 times higher than baseline risk. • One more on the next page:

Increased risk (percentage change in risk): So the percentage change is 200% A 200% increase in treatment risk over baseline risk for getting cancer.

Final note: When the chi-squared test is statistically significant then it makes sense to compute the various risk statements. If there is no statistical significance then the skeptic wins. There is no evidence in the data for differences in risk for the categories of the explanatory variable.

Research question: Is ghost sighting related to age? Do young and old people differ in ghost sighting? The skeptic responds by saying he doesn’t believe that there is any difference between the age groups. We need to see the data to resolve the debate. Then we can consider assessing the risk. Exercise 9, p219 of the text.

Expected counts are printed below observed yes no Total young 212 1313 1525 174.9 1350.1 old 465 3913 4378 502.1 3875.9 Total 677 5226 5903 Chi-Sq = 7.870 + 1.020 + 2.742 + 0.355 = 11.987 The research advocate wins and skeptic loses. There is evidence in the data that there are differences in the population.

The percent of young who saw a ghost: 212/1525 = .139 Answer: 13.9% The proportion of old who saw a ghost: 465/4378 = .106 Answer: .106 The risk of young seeing ghost: Answer: 212/1525 or .139 or 13.9% Odds ratio?

Odds • The odds of something happening are given by a ratio: • For example, if you flip a fair coin, the odds of heads are 1 (or sometimes “1 to 1”). • An odds ratio is the ratio of two odds!

The odds that a young person saw a ghost: 212/1313 = .161 The odds that an older person saw a ghost: 465/3912 = .119 The odds ratio: Answer: .161/.106 = 1.35

Relative risk of young person seeing a ghost compared to older person: Answer: .139/.106 = 1.31 We would say that the risk that a younger person sees a ghost is 1.31 times higher than the risk that an older person sees a ghost. The increased risk that a young person sees a ghost over that of an older person: Answer: (.139 - .106)/.106 = .31 Hence we would say that young people have a 31% higher risk of seeing a ghost than older people.

Statistical significance • Statistical significance is related to • the size of the sample. But that makes • sense. More data, more information, more • precise inference. • So statistical significance is related to two things: • The size of the difference between the percentages. • Big differences are more likely to show stat. significance. • 2. The size of the sample. Bigger samples are more likely • to show statistical significance irrespective of the size of • the difference in percentages.

Practical significance Even if the difference in percentages is uninteresting and of no practical interest, the difference may be statistically significant because we have a large sample. Hence, in the interpretation of statistical significance, we must also address the issue of practical significance. In other words, you must answer the skeptic’s second question: WHO CARES?

Probability Relative Frequency Personal Opinion Experiment Repeated Sampling Experience Non-repeatable Event Physical World Assumptions Estimate Probability Repeated Sampling Check by Repeated Sampling

Rules: For combining probabilities 0 < Probability < 1 • If there are only two possible outcomes, then • their probabilities must sum to 1. • If two events cannot happen at the same time, • they are called mutually exclusive. The probability • of at least one happening (one or the other) is the • sum of their probabilities. [1. is a special case of this.] • If two events do not influence each other, they • are called independent. The probability that they • happen at the same time is the product of their probabilities. • If the occurrence of one event forces the occurrence of • another event, then the probability of the second event is • always at least as large as the probability of the first event.

Are mutually exclusive events independent or dependent? • Remember the tests: • Two events are mutually exclusive if they cannot happen • at the same time. • Two events are independent if the occurrence of one does • not alter the probability of the other occurring. • Or, another way, if the probability of the occurrence of one • event changes when we find out whether the other event • occurred or not.

New Rule: Suppose we are considering a series of events. The probability of at least one of the events occurring is: Pr( at least one ) = 1 – Pr( none ) This follows directly from Rule 1 since ‘at least one’ or ‘none’ has to occur.

Long Run Behavior We CANNOT predict individual outcomes. BUT We CAN predict quite accurately long run behavior. -------------------------------------------------------------------- Standard example: We cannot predict the outcome of a single toss of a coin very precisely: Pr(head) = .50 But in the long run we expect about 50% heads and tails.

Two laws (only one of them valid): • Law of large numbers: Over the long haul, we expect about 50% heads (this is true). • “Law of small numbers”: If we’ve seen a lot of tails in a row, we’re more likely to see heads on the next flip (this is completely bogus). Remember: The law of large numbers OVERWHELMS; it does not COMPENSATE.

When will it happen? (p264 text)Odd Man Consider the odd man game. Three people toss a coin. The odd man has to pay for the drinks. You are the odd man if you get a head and the other two have tails or if you get a tail and the other two have heads. Pr(no odd man) = Pr(HHH or TTT) = Pr(HHH) + Pr(TTT) Rule 2 = (1/2)3 + (1/2)3 Rule 3 =1/8 + 1/8 =1/4 = .25 Pr( odd man ) = 1 – Pr(no odd man) = 1 - .25 = .75 Rule 1

Pr( odd man occurs on the third try) = Pr(miss, miss, hit) = Pr(miss)Pr(miss)Pr(hit) Rule 3 =[Pr(miss)]2Pr(hit) =[.25]2.75 = .047

ExpectationInsurance Example 14 p267 extended. Suppose my insurance company has 10,000 policy holders and they are all skateboarders. I collect a $500 premium each year. I pay off $1500 for a claim of a skate board accident. From past experience I know 10% ( ie. 1000) will file a claim. How much do I expect to make per customer?

Understanding Statistics: Correlation and Outliers