240 likes | 441 Vues
Analysing correlated data. Maureen Meadows Senior Lecturer in Management, Open University Business School. Aims of today’s session. Discuss different forms of dependency and correlation that we might find in our datasets
E N D
Analysing correlated data Maureen Meadows Senior Lecturer in Management, Open University Business School
Aims of today’s session • Discuss different forms of dependency and correlation that we might find in our datasets • Explore why it might sometimes be problematic to get a good measure of correlation • Look at examples of data analysis where correlation is an important issue • Discuss why it is important to deal with correlations, and not to forget about them!
What is dependence? • Dependence is any statistical relationship between two random variables, or two sets of data • In other words, they are not independent of each other • E.g. the relationship between the height of parents and the height of children, or the relationship between the demand for a product and its price
What is correlation? • Two variables are said to be correlated if changes in one variable are associated with changes in the other variable • So, if we know how one variable is changing, we have a good idea how the other variable is likely to be changing too • Hence it is widely used in many forms of forecasting etc.
What is the correlation coefficient? • Pearson’s product moment correlation coefficient is also known as r, R, or Pearson's r • It is a measure of the strength and direction of the association - in particular, the linear relationship- between two (metric) variables • It is defined as the covariance of the variables divided by the product of their standard deviations
Pearson’s correlation coefficient • It is a measure of the linear dependence or correlation between the two variables • The sign (+ or -) indicates the direction of the relationship • The value can range from -1 to +1, with +1 indicating a perfect positive relationship, 0 indicating no relationship and -1 indicating a perfect negative or reverse relationship
Other correlation coefficients • Arank correlation is any of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the labels "first", "second", "third", etc. to different observations of a particular variable • Arank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them
Rank correlation coefficients • Spearman’s rank correlation coefficient is a measure of how well the relationship between two variables can be described by a monotonic function • Kendall tau rank correlation coefficient is a measure of the portion of ranks that match between two data sets • Goodman and Kruskal'sgamma is a measure of the strength of association of the cross tabulated data when both variables are measured at the ordinal level
Correlation and sample size • Is a correlation coefficient significantly different from zero or not? • Example: three different correlation coefficients: 0.50, 0.35, and 0.17. • Assume that we want to test whether there is no significant relationship between the two variables at hand. The null hypothesis (H0) to be tested is that these r values are not statistically different from zero (rho = 0).
Sample size example (continued) • For rho = 0, H0 can be tested using a two tailed t-test at a given confidence level, usually at a 95% level • If tcalculated ≥ ttable, H0 is rejected • If tcalculated < ttable H0 is not rejected and there is no significant correlation between variables • Here tcalculated is computed as r/SEr = r*SQRT[((n – 2)/(1 – r2))] while ttablevalues are obtained from the literature
Sample size example (continued) • For n = 14, all three r values (0.50, 0.35, and 0.17) are not statistically different from zero • For n = 30, r = 0.50 is statistically different from zero while r = 0.35 and r = 0.17 are not • Conversely, r = 0.50 is not statistically different from zero when n is equal to or less than 14 while r = 0.35 is not different from zero when n is equal to or less than 30 • Finally, r = 0.17 is not statistically different from zero at any of the sample sizes tested
What is a correlation matrix? • It is a table showing the correlations between a set of variables • Often inspected before multivariate methods are applied, e.g. regression analysis or factor analysis • Examples: Kim et al (2011), Mohammed (2013)
The correlation coefficient and least squares regression analysis • The square of the correlation coefficient, typically denoted r2, is called the coefficient of determination • It estimates the fraction of the variance in Y that is explained by X in a simple linear regression • Example: McDaniel (1981)
What is collinearity? • An expression of the relationship between two variables (collinearity), or more than two variables (multicollinearity) • Two variables exhibit complete collinearity if their correlation coefficient is 1, and complete lack of collinearity if their correlation coefficient is 0 • Multicollinearity occurs when a variable is highly correlated with a set of other variables
When can multicollinearity be a problem? • Multicollinearity is the extent to which a variable can be ‘explained’ by the other variables in the analysis • As multicollinearity increases, it complicates the interpretation of the variate(the linear combination of variables formed in a technique such as regression) • It becomes more difficult to ascertain the effect of any single variables, because of their inter-relationships
The Variance Inflation Factor (VIF) • An indicator of the effect that the other independent variables have on the standard error of a regression coefficient • Large VIF values indicate a high degree of collinearity or multicollinearity among the independent variables • The VIF is directly related to the tolerance value (VIF = 1/tol) • Example: Zhou et al (2013)
Tolerance • Another commonly used measure of collinearity and multicollinearity • The tolerance of a variable is 1- r2, where r2is the coefficient of determination for the prediction of that variable by the other independent variables • As the tolerance grows smaller, the variable is more highly predicted by the other independent variables (collinearity)
Correlations and cluster analysis • Variables that are multicollinear are implictly weighted more heavily • E.g. if we cluster on 10 equally weighted variables, and they form two dimensions (one of 8 variables and the other of the remaining 2), then the first dimension will have four times as many chances to affect the similarity measure
Correlations and factor analysis • Some degree of multicollinearity is desirable, because the objective is to identify sets of variables that are interrelated • Examples: Kim et al (2011), Mohammed (2013)
References • Hair, Anderson, Tatham and Black, Multivariate Data Analysis, 7th edition (Prentice Hall, 2009) • Kim, JY, Shim, JP and Ahn, KM (2011) ‘Social Networking Service: Motivation, Pleasure, and Behavioral Intention to Use’, Journal of Computer Information Systems, Summer, 92-101. • McDaniel, SW (1981) ‘Multicollinearity in advertising-related data’, Journal of Advertising Research, 21:3, 59-63. • Mohammed, S. (2013) ‘Factors Affecting E-Banking Usage in India: an Empirical Analysis’, Economic Insights – Trends and Challenges, Vol. II (LXV) No. 1, 17-25. • Zhou, X, Han, Y and Wang, R (2013) ‘An Empirical Investigation on Firms’ Proactive and Passive Motivation for Bribery in China’, Journal of Business Ethics, 118:461-472.