Understanding Contingency Tables in Statistics | Significance, Application & Analysis

STATISTICS WORKSHOP - 2 Contingency tables Correlation Analysis of variance

Why relations between variables are important • The ultimate goal of every research or scientific analysis is finding relations between variables. • The philosophy of science teaches us that there is no other way of representing “meaning” except in terms of relations between some quantities or qualities; either way involves relations between variables. • The advancement of science must always involve finding new relations between variables.

Variable 2 Factor 1 Factor 2 Factor 3 Variable1 n11 n12 n13 Factor A Factor B n21 n22 n23 Qualitative Data (Contingency Table) Example: This test would be the one to use if we have, say, different classes of patients (e.g., six types of cancers) and for a set of 1000 markers we can have either presence/absence of each marker in each patient (this would yield 1000 contingency tables of dimensions 6x2 ---each marker by each cancer type)

Contingency Table Question Is there evidence in the data for association between the categorical variables? For cross-classified data, the Pearson chi-square test for independence and Fisher's exact test can be used to test the null hypothesis that the row and column classification variables of the data's two-way contingency table are independent.

Chi-Square test 2 2 Odds Ratio (OR) = (ad)/(bc) Relative risk = a(c+d)/c(a+b)

Contingency Table Chi-Square Test Example 3500 were observed whether they snore or not Is there an association between snoring and gender ?

Contingency Table Example - Is there an association between snoring and gender?

Contingency Table Odds ratio = 1.58 95% CI = 1.39 to 1.81

Contingency Table Is there evidence of differences in smoking pattern between the sexes?

Contingency Table

Measuring treatment differences with Y/N response • For outcomes such as reduction in blood pressure there are obvious summaries of treatment effect such as the difference between the average of each group • For yes/no outcomes like death or cure the choice of summary is not so obvious Dead Y N aspirin 804 7783 9.4% placebo 1016 7584 11.8% TOTAL 1820 15367

Relative Risk or Risk Ratio • Relative risk or risk ratio: risk of death in aspirin group divided by risk in placebo group: Relative Risk = 9.4% / 11.8% = 0.80 “mortality is reduced by 20%” • Relative risk estimates are likely to generalise well from one population to another

Absolute Risk Difference • Absolute risk difference is the proportion of deaths in the aspirin group minus the proportion in the placebo group risk difference = 9.7% - 11.8% = -2.1% "2.1 lives saved for each 100 patients treated" • Risk difference has a more direct clinical interpretation, especially when considering cost- effectiveness

Odds Ratio • Odds ratio: the odds of death in the aspirin group divided by the odds in the placebo group Odds Ratio = (9.7/90.3) / (11.8/88.2) = 0.77 "reduction of 23% in the odds of death” • The odds ratio has some purely mathematical advantages. It is not much used in randomised studies

Berkson’s Fallacy • It is a treatment-seeking bias so called because Berkson indicated that individuals with more than one disorder are more likely to seek clinical services than are those with only one disorder. • This leads to an erroneously higher estimate of the prevalence of the association between these disorders than would be the case if each single disorder independently led the patient to seek care.

Berkson’s Fallacy • 2784 individuals were surveyed to determine whether each subject suffered from either a disease A or disease B or both. It is found that 257 out of the 2784 patients were hospitalised for the condition. P < 0.025 There is some association between having disease A and having disease B P > 0.1 There is no association between having disease A and having disease B

Gene Association Studies Typically Wrong • Evolution of the strength of an association as more information is accumulated. The strength of the association is shown as an estimate of the odds ratio (OR) without confidence intervals. • a, Eight topics in which the results of the first study or studies differed beyond chance (P<0.05) when compared with the results of the subsequent studies. • b, Eight topics in which the first study or studies did not claim formal statistical significance for the genetic association but formal significance was reached by the end of the meta-analysis. • Each trajectory starts at the OR of the first study or studies. Updated cumulative OR estimates are obtained at the end of each subsequent year, summarizing all information to that time. (Adapted from J.P.Ioannidis et al., Nature Genetics 29:306-9, 2001)

Given the number of potentially identifiable genetic markers and the multitude of clinical outcomes to which these may be linked, the testing and validation of statistical hypotheses in genetic epidemiology is a task of unprecedented scale Studies of disease association

Testing for equality of two proportions Example: Two groups of genes 1. genes for transcription and translation 2. genes in the immune system Question: Do they have similar purine-pyrimidine compositions? The question is asking whether the percentage of purines (or pyrimidines) in group 1 is the same as the percentage of purines (or pyrimidines) in group 2. To form the null and alternative hypotheses we can say: G1 = the percentage of purines in group 1 G2 = the percentage of purines in group 2 H0: G1 = G2 H1: G1 > G2 or G2 > G1

Y X Correlation • Correlation can be used to summarise the amount of linear association between two continuous variables x and y. • Let (x1, y1), (x2, y2), ..., (xn, yn) denote the data points. • A scatter plot gives a "cloud" of points Y Y X X Positive correlation Negative correlation No correlation

Positive and Negative Association • If the points are nearly in a straight line then knowing the value of one variable helps you to predict the value of the other. • If there is little or no association, the "cloud" is more spread out and information about one variable doesn't tell you much about the other.

A simple correlation formula • Suppose there are n points altogether and that n(A) is the number in region A and similarly for n(B), n(C) and n(D • Give a value of 1/n to every point in A or C and -1/n to every point in B or D • Define Cor = n(A)+n(C)-n(B)-n(D) n • What are the properties of cor? D A C B

The formula for cor works, but it is rather crude. For example both the diagrams below would give cor = 1. The Pearson product moment correlation coefficient and are positive or negative in the different regions and so is the product * • Sum will not lie between -1 and 1. It depends on: • The scale of x and y • The number of points

Correlation formula where Partial correlation: Correlation between 2 variables that controls for the effects of one or more other variables. Rank Correlation:

Pearson Correlation Coefficient • A measure of linear association between two variables, denoted as r. • Values of the correlation coefficient range from -1 to 1. • The sign of the coefficient indicates the direction of the relationship, and its absolute value indicates the strength, with larger absolute values indicating stronger relationships.

Interpretation of correlation • r measures the extent of linear association between two continuous variables. • Association does not imply causation - both variables may be affected by a third variable. • If r = 0, there is no association between X and Y • r does not indicate the extent of non linear associations • The value of r can be affected by outliers Correlations Do Not Establish Causality Example:When a gene is isolated that has some positive correlation to cancer, claim is often made that it enhances the susceptibility to the disease, and not cause it.

Some misconceptions • When the value of the correlation coefficient is large (small), the relation between the two variates is close to linear, thus, when r = 0.9 or 0.95 the relation is nearly linear • When the value of the correlation coefficient is zero or near zero the two variates have no or almost no functional relation • When the value of the correlation coefficient is positive (negative), the value of Y becomes larger (smaller) as a whole, as the value of X becomes large Example: Let (X,Y) take (1,-1),(2,-2),(3,-3),(4,-4),(5,20) each with probability 1/5. Then we have Cor(X,Y) = 0.62 Concerning the first four points Y decreases as X increases. This example shows that even when the correlation coefficient between X and Y is positive, Y does not always increase as a whole as X increases.

Examples • Eg1. In Australia total alcohol consumption and the number of ministers of religion have both increased over time and would be positively correlated but the increase in one has not caused the increase in the other (both are related to the total population size) • Eg2. In Japanese schoolchildren shoe size was reported to be correlated (positively) with scores on a test of mathematical ability. • Eg3. Extracting informative genes with negative correlation for accurate cancer classification

Effectiveness of the first Cold-War arms agreement • "Most important, the negative correlation between the mutation rate and the parental year of birth [among those born between 1950 and 1956] provides experimental evidence for change in human germline mutation rate with declining exposure to ionizing radiation and therefore shows that the Moscow treaty banning nuclear weapon tests in the atmosphere (August 1963) has been effective in reducing genetic risk to the affected population."

The table below shows the heights and weights of 6 female students. How closely related are the heights and the weights? Example - Heights and weights of 6 female students The correlation coefficient =0.904

Spearman Correlation Coefficient • Commonly used nonparametric measure of correlation between two ordinal variables. For all of the cases, the values of each of the variables are ranked from smallest to largest, and the Pearson correlation coefficient is computed on the ranks.

10 students, arranged in alphabetical order, were ranked according to their achievements in both the laboratory and lecture sections of a biology course. Find the coefficient of rank correlation. Rank Correlation Rank correlation = 0.8545

Thoughts… Patterns often emerge before the reasons for them become apparent. - Vasant Dhar If you do not expect, you cannot find the unexpected. - Heracletes To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. - R.A.Fisher

Understanding Contingency Tables in Statistics | Significance, Application & Analysis

Understanding Contingency Tables in Statistics | Significance, Application & Analysis

Presentation Transcript

Workshop on Gender Statistics

SME Statistics OECD Workshop

with Statistics Workshop

Workshop 2

Statistics Class 2

Statistics 2

Bayesian statistics 2

Statistics Introduction 2

Statistics 2 go

Inferential Statistics 2

Quantitative Skills Workshop Statistics

Statistics # 2

Statistics 2

WORKSHOP ON WASTE STATISTICS

Statistics Chapter 2: Descriptive Statistics

Workshop 2

Statistics Workshop 2011

2 nd CARICOM Workshop on Environment Statistics

Workshop 2

SME Statistics OECD Workshop