1 / 50

CORRELATION

CORRELATION. analysis. SECTION 1. INTRODUCTİON. CORRELATİON. correlation coefficient. Correlation is the method of analysis to use when studying the possible association between two continuous variables.

Télécharger la présentation

CORRELATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CORRELATION analysis

  2. SECTION 1 INTRODUCTİON

  3. CORRELATİON correlationcoefficient Correlation is the method of analysis to use when studying the possible association between two continuous variables. If we want to measure the degree of association, this can be done by calculating the correlation coefficient. A relationship exists when changes in one variable tend to be accompanied by consistent and predictable changes in the other variable.

  4. INTRODUCTİON Correlation matrix Correlation analysis Correlation model Correlation the consideration of whether there is any relationship or association between two variables when both Y and X are random variables; under the correlation model, sample observations are obtained by selecting a random sample of the units of association and taking on each a measurement of X and a measurement of Y. statistical tool used to study the closeness of the relationship between two or more variables. presents correlation coefficients among a group of variables an investigator portrays all possible bivariate combinations of a set of variables in order to ascertain patterns of interesting associations for further study

  5. INTRODUCTİON The correlation coefficient is the index which defines the strength of association between two variables If there is a relationship between any two variables, it may be possible to predict the value of one of the variables from the value of the other To establish whether there is a relationship between two variables, an appropriate random sample must be taken and a measurement recorded of each of the two variables Such data are said to be bivariate data, since they consist of two variables Data may be written as ordered pairs, where they are expressed in a specific order for each individual, i.e. (first variable value, second variable value) To predict the value of one variable from the value of the other (if a relationship exists), there is a basic rule It is useful to label the variables according to the following: The dependent variable is the one whose value is to be predicted. It is usually denoted by the letter y The independent variable is the one whose value is used to make the prediction. It is usually denoted by the letter x In this form the ordered pairs resemble points on a graph In doing so, it is possible to infer what the actual relationship between the variables may be, even before any calculations are undertaken

  6. INTRODUCTİON Overview Independent Variables Independent Variables Nominal Interval Nominal Interval • Lambda Considers the distribution of one variable across the categories of another variable • Logistic Regression Considers how a change in a variable affects a discrete outcome Interval Nominal Interval Nominal DependentVariable DependentVariable Considers the difference between the mean of one group on a variable with another group • Confidence Intervals • T-Test Considers the degree to which a change in one variable results in a change in another Regression Correlation

  7. SECTION 2 PEARSON PRODUCT-MOMENT CORRELATİON COEFFİCİENT

  8. THE SCATTER DIAGRAM Overview portray the relationship between two variables the relationship occurs in a sample of ordered (X, Y) pairs one constructs such a diagram by plotting, on Cartesian coordinates, X and Y measurements (X and Y pairs) for each subject Example as an example of two highly correlated measures, consider systolic and diastolic blood pressure remember that when your blood pressure is measured, you are given two values (e.g., 120/70) across a sample of subjects, these two values are known to be highly correlated and are said to form a linear (straight line) relationship

  9. SCATTER DIAGRAM Example Systolic and Diastolic Blood Pressure Values for a Sample of 48 Elderly Men

  10. EXAMPLE Scatter diagram of systolic and diastolic blood pressure horizontal axis or abscissa; denotes the independent variable that we identified in our analytic model X axis (vertical axis), or ordinate, identifies the dependent, or outcome, variable Y axis For the blood pressure data, the choice of the X or Y axes is arbitrary, for there is no independent or dependent variable. then plot variable pairs on the graph

  11. PEARSON PRODUCT-MOMENT CORRELATİON The Pearson correlation coefficient (), is a population parameter that measures the degree of association between two variables. It is a natural parameter for a distribution called the bivariate normal distribution. Briefly, the bivariate normal distribution is a probability distribution for Xand Ythat has normal distributions for both Xand Yand a special form for the density function for the variable pairs. This form allows for positive or negative dependence between Xand Y. The Pearson correlation coefficient is used for assessing the linear (straight line) association between an Xand a Yvariable, and requires interval or ratio measurement. The symbol for the sample correlation coefficient is r, which is the sample estimate of that can be obtained from a sample of pairs (X, Y) of values for Xand Y.

  12. ASSUMPTİONS Assumptions When we choose to analyse your data using Pearson’s correlation, part of the process involves checking to make sure that the data we want to analyse can actually be analysed using Pearson’s correlation. We need to do this because it is only appropriate to use Pearson’s correlation if your data "passes" four assumptions that are required for Pearson’s correlation to give you a valid result. In practice, checking for these four assumptions just adds a little bit more time to our analysis, requiring we to click of few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about our data, but it is not a difficult task. Do not be surprised if, when analysingour own data using SPSS Statistics, one or more of these assumptions is violated (i.e., is not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out Pearson’s correlation when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let’s take a look at these four assumptions:

  13. ASSUMPTİONS Assumption #1: Our two variables should be measured at the interval or ratio level (i.e., they are continuous). Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. Assumption #2: There is a linear relationship between your two variables. Whilst there are a number of ways to check whether a linear relationship exists between your two variables, it’s suggested creating a scatterplot using SPSS Statistics, where we can plot the one variable against the other variable, and then visually inspect the scatterplot to check for linearity. Your scatterplot may look something like one of the following:

  14. ASSUMPTİONS Assumption #3: There should be no significant outliers. Outliers are simply single data points within our data that do not follow the usual pattern (e.g., in a study of 100 students’ IQ scores, where the mean score was 108 with only a small variation between students, one student had a score of 156, which is very unusual, and may even put her in the top 1% of IQ scores globally). The following scatterplots highlight the potential impact of outliers: The Effect of an Outlier on a Pearson Correlation. Pearson’s correlation coefficient, r, is sensitive to outliers, which can have a very large effect on the line of best fit and the Pearson correlation coefficient. Therefore, in some cases, including outliers in your analysis can lead to misleading results. Therefore, it is best if there are no outliers or they are kept to a minimum. Fortunately, when using SPSS Statistics to run Pearson’s correlation on your data, you can easily include procedures to screen for outliers. In our enhanced Pearson’s correlation guide, we: (a) show you how to detect outliers using a scatterplot, which is a simple process when using SPSS Statistics; and (b) discuss some of the options available to you in order to deal with outliers.

  15. ASSUMPTİONS Assumption #4: Our variables should be approximately normally distributed. In order to assess the statistical significance of the Pearson correlation, you need to have bivariate normality, but this assumption is difficult to assess, so a simpler method is more commonly used. This simpler method involves determining the normality of each variable separately. To test for normality we can use the Shapiro-Wilk test of normality, which is easily tested for using SPSS Statistics. We can check assumptions #2, #3 and #4 using SPSS Statistics. Remember that if we do not test these assumptions correctly, the results we get when running a Pearson's correlation might not be valid.

  16. BİVARİATE NORMAL DİSTRİBUTİON Under the correlation model,X and Y are assumed to vary together in what is called a joint distribution. If this joint distribution is a normal distribution, it is referred to as a bivariate normal distribution. Inferences regarding this population may be made based on the results of samples properly drawn from it. If, on the other hand, the form of the joint distribution is known to be non-normal, or if the form is unknown and there is no justification for assuming normality, inferential procedures are invalid, although descriptive measures may be computed. The bivariate normal distribution has five parameters, x, y, x, y, and . The first four are, respectively, the standard deviations and means associated with the individual distributions. The other parameter, 𝜌, is called the population correlation coefficient and measures the strength of the linear relationship between X and Y.

  17. ILLUSTRATION OF A BİVARİATE NORMAL DİSTRİBUTİON (c) A cutaway showing normally distributed subpopulation of X for given Y. (b) A cutaway showing normally distributed subpopulation of Y for given X. (a) A bivariate normal distribution. In this illustration we see that if we slice the mound parallel to Y at some value of X, the cutaway reveals the corresponding normal distribution of Y. Similarly, a slice through the mound parallel to X at some value of Y reveals the corresponding normally distributed subpopulation of X.

  18. ASSUMPTİONS The following assumptions must hold for inferences about the population to be valid when sampling is from a bivariate distribution. • For each value of X there is a normally distributed subpopulation of Y values. 1 • For each value of Y there is a normally distributed subpopulation of X values. 2 • The joint distribution of X and Y is a normal distribution called the bivariate normal • distribution. 3 • The subpopulations of Y values all have the same variance. 4 • The subpopulations of X values all have the same variance. 5

  19. EXAMPLES OF BİVARİATE ASSOCİATİONS The Pearson correlation coefficient is used for assessing the linear (straight line) association between an X and a Y variable, and requires interval or ratio measurement. The symbol for the sample correlation coefficient is r, which is the sample estimate of that can be obtained from a sample of pairs (X, Y) of values for X and Y. The correlation varies from negative one to positive one (–1 r +1). A correlation of + 1 or –1 refers to a perfect positive or negative X, Y relationship, respectively (A and B). Data falling exactly on a straight line indicates that |r| = 1.

  20. PEARSON PRODUCT-MOMENT CORRELATİON 1 • Correlation coefficients merely indicate association between X and Y, and not causation. 2 • If |r| = 1, then all the sample data fall exactly on a straight line. • This one-to-one association observed for the sample data does not necessarily mean that || = 1 3 • But if the number of pairs is large, a high value for r suggests that the correlation between the variable pairs in the population is high. 4

  21. PEARSON PRODUCT-MOMENT CORRELATİON Overview The numerical measure of the degree of association between two variables is given by the product-moment correlation coefficient This index provides a quantitative measure of the extent to which the two variables are associated The value of the correlation coefficient is calculated from the bivariate data by means of a formula that involves the values of the data points The value of the correlation coefficient calculated from a sample is denoted by the letter r The value of the correlation coefficient calculated from a population is denoted by the Greek letter ρ

  22. PEARSON PRODUCT-MOMENT CORRELATİON SAMPLE ESTIMATE The calculation formula for r is: Thedeviation score formula for r is: = = =

  23. PEARSON PRODUCT-MOMENT CORRELATİON The deviation score formula We will apply these formulae to the small sample (n=10) of weight and height measurements. The first calculation uses the deviation score formula (i.e., the difference between each observation for a variable and the mean of the variable). == ==0.29

  24. PEARSON PRODUCT-MOMENT CORRELATİON The calculationformula We will apply these formulae to the small sample (n=10) of weight and height measurements. When using the calculation formula, we do not need to create difference scores, making the calculations a bit easier to perform. == ==0.29

  25. CONFIDENCE INTERVAL FOR PEARSON’S CORRELATION Fisher’sr-to-z transformation Fisher developed a transformation of r that tends to become normal quickly as N increases. It is called the r-to-z transformation. We use it to conduct tests of the correlation coefficient and calculate the CI. Therefore, having computed an obtained z from the obtained r (using the table for r-to-z transformation), a CI can easily be constructed in z-space as z is approximately normally distributed, with an expectation equal to where the criterion z corresponds to the desired confidence level (e.g., 1.96 in the case of a 95% confidence interval). The upper and lower z limits of this CI can then be transformed back to upper and lower r limits. where is the population correlation of which r is an estimate, and a standard deviation of

  26. CONFIDENCE INTERVAL FOR PEARSON’S CORRELATION • Fisher’sr-to-z transformation

  27. CONFIDENCE INTERVAL FOR PEARSON’S CORRELATION EXAMPLE 1 2 3

  28. CONFIDENCE INTERVAL FOR PEARSON’S CORRELATION EXAMPLE 4 5

  29. CONFIDENCE INTERVAL FOR PEARSON’S CORRELATION EXAMPLE 5 4

  30. CONFIDENCE INTERVAL FOR PEARSON’S CORRELATION EXAMPLE 5 4 = • =

  31. SECTION 3 PHI-COEFFICIENT / YULE'S Q or

  32. PHI-COEFFICIENT The phi-coefficient is actually a product—moment coefficient of correlation and is a variation of Pearson’s definition of r when the two states of each variable are given values of 0 and 1 respectively. The phi-coefficient was designed for the comparison of truly dichotomous distributions, i.e., distributions that have only two points on their scale which indicate some unmeasurable attribute. Attributes such as living or dead, black or white, accept or reject, and success or failure are examples. It is also sometimes known as the Yule If certain allowances are made for continuity, the technique may be applied to observations grouped into two arbitrary, but clearly defined, divisions. It is clear that in many such divisions point attributes are achieved by a binary decision process employing demarcation lines through ‘‘grey’’ or ‘‘illdefined’’ regions.

  33. PHI-COEFFICIENT The phi-coefficient relates to the 2 × 2 table: • The phi-coefficient is particularly used in • psychological and educational testing where • the imposing of a dichotomy on a continuous • variable is a frequent occurrence; variables • where pass/fail categories are obtained in • relation to a threshold score are typical If a, b, c, and d represent the frequencies of observation, then is determined by the relationship

  34. PHI-COEFFICIENT It bears a relationship to , where The significance of may be tested by determining the value of from the above relationship and testing in the usual way. An alternative significance test (rarely used) may be performed by considering the standard error of φ Calculation of this is laborious but if N is not too small, then approximates to it

  35. YULE'S Q Yule's Q could is used to analyse the strength and direction of association between two dichotomous variables (e.g., an example of a dichotomous variable would be "gender", which has two categories: "males" and "females") Assumptions The Yule’s Q coefficient is a distribution-free statistic. 1 One major assumption of Yule’s Q coefficient is that the data has to be dichotomized. 2

  36. YULE'S Q The Yule’s Q is a nominal level measure of association that could be used to determine the association or relationship between variables For variables having infinite values, the researcher may want to put these variables into dichotomies 1 5 Yule originated this measure of association for variables which have two and only two values. It is used with 2 x 2 tables, each variable being expressed as a dichotomy The process involves the construction of dichotomies that would affect the value of the Yule depending on how the original categories were collapsed 2 6 These dichotomies may be male-female, yes-no, true-false, for-against, agree-disagree, graduate-non-graduate, tall-short, high-low and so on Yule’ Q is, therefore, the ratio of the differences between the products of the diagonal cell frequencies and the sum of the products of the diagonal cell frequencies 3 7 8 The Yule‘s Q coefficient is a distribution-free statistic 4

  37. YULE'S Q Advantages 1 No correction needs to be made for it 1 2 2 Computed from a 2x2 table without first computing the chi-square test 3 Best meaningfully applied where data are in dichotomies 3 4 4 No stringent assumptions for its application 5 Quickly and easily computed 5 6 Measure of the proportional reduction in error associated with predicting one variable from the order 6

  38. YULE'S Q Disadvantages Limited to only 2 x 2 tables When the researcher’s data fit into larger tables, for example, a 2x3, 3x3, 3x4 or 3x6 table, Yule’s Q cannot be computed unless the data are “collapsed” into a 2x2 table 2 1 Hence, it is better to avoid collapsing data when using the Yule’s Q The idea of collapsing data into fewer categories may lead to the loss of vital information. 4 3

  39. SECTION 4 SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT

  40. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT Now we will measure correlation in a more general way that satisfies two conditions Statisticians have derived nonparametric measures of correlation that exhibit the two properties. Two examples are: Spearman’s rho (), attributed to Spearman (1904); and Kendall’s tau (), introduced in Kendall (1938) Both of these measures have been shown to satisfy conditions 1 and 2 2 1 and are allowed to have any joint distribution and not necessarily the bivariate normal distribution The correlation between and will have the property that as increases increases (or decreases), then the correlation measure will be +1 (or –1)

  41. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT An alternative measure of the degree of association between two variables is the rank correlation coefficient - Nonparametric version of the Pearson product-moment correlation. The subscript spin stands for ‘Spearman’ and is used to distinguish it from the Pearson product-moment correlation coefficient This coefficient does not strictly measure the degree of association between the actual observations, but the association between the ranks of the observations

  42. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT The Spearman rank-order correlation coefficient (Spearman’s correlation, for short) is a nonparametric measure of the strength and direction of association that exists between two variables measured on at least an ordinal scale. It is denoted by the symbol rs (or the Greek letter ρ, pronounced rho). The test is used for either ordinal variables or for continuous data that has failed the assumptions necessary for conducting the Pearson's product-moment correlation. For example, we could use a Spearman’s correlation to understand whether there is an association between exam performance and time spent revising; whether there is an association between depression and length of unemployment; and so forth. Possible alternative tests to Spearman's correlation are Kendall's tau-b or Goodman and Kruskal's gamma.

  43. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT Assumptions When we choose to analyse your data using Spearman’s correlation, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using a Spearman’s correlation. You need to do this because it is only appropriate to use a Spearman’s correlation if your data "passes" two assumptions that are required for Spearman’s correlation to give you a valid result. In practice, checking for these two assumptions just adds a little bit more time to your analysis, requiring you to click of few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task. These two assumptions are: Assumption #1: Your two variables should be measured on an ordinal, interval or ratio scale. Examples of ordinal variables include Likert scales (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories (e.g., a 3-pont scale explaining how much a customer liked a product, ranging from "Not very much", to "It is OK", to "Yes, a lot"). Examples of interval/ratio variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. You can learn more about ordinal, interval and ratio variables in our article: Types of Variable.

  44. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT Assumption #2: There is a monotonic relationship between the two variables. A monotonic relationship exists when either the variables increase in value together, or as one variable value increases, the other variable value decreases. Whilst there are a number of ways to check whether a monotonic relationship exists between your two variables, we suggest creating a scatterplot using SPSS Statistics, where you can plot one variable against the other, and then visually inspect the scatterplot to check for monotonicity. Your scatterplot may look something like one of the following:

  45. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT Note: Spearman's correlation determines the degree to which a relationship is monotonic. Put another way, it determines whether there is a monotonic component of association between two continuous or ordinal variables. As such, monotonicity is not actually an assumption of Spearman's correlation. However, we would not normally want to pursue a Spearman's correlation to determine the strength and direction of a monotonic relationship when we already know the relationship between our two variables is not monotonic. Instead, the relationship between our two variables might be better described by another statistical measure of association. For this reason, it is not uncommon to view the relationship between our two variables in a scatterplot to see if running a Spearman's correlation is the best choice as a measure of association or whether another measure would be better.

  46. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT Spearman’s rho is derived as follows: 1 3 2 Separately rank the measurements for theand in increasing order. Replace the pair for each with its rank pair (i.e., if has rank 4 and rank 7, the transformation replaces the pair with the rank pair). Apply the formula for Pearson’s product moment correlation to the rank pairs instead of to the original pairs. The result is Spearman’s rho.

  47. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT Spearman’s rho enjoys the property that all of its values lie between –1 and 1. This result obtains because rho is the Pearson correlation formula applied to ranks. If is a monotonically increasing function of (i.e., as increases, increases), then the rank of will match the rank of . This relationship means that the ranked pairs will be (1, 1), (2, 2), (3, 3), . . . , (n, n).

  48. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT The computational formula: where: is the number of ranked pairs; is the rank of ; and is the rank of . When there are no ties, the formula simplifies to the following equation: where: is the number of ranked pairs; is the rank of ; and is the rank of .

  49. SPEARMAN’S RANK-ORDER CORRELATION COEFFICIENT It is also worth noting that a Spearman’s correlation can be used when our two variables are not normally distributed It is also not very sensitive to outliers, which are observations within our data that do not follow the usual pattern. Since Spearman’s correlation is not very sensitive to outliers, this means that we can still obtain a valid result from using this test when we have outliers in our data.

  50. THANK YOU SERVER.RAREDIS.ORG/EDU

More Related