Data Transformations

1. Data Transformations

3. R estimates true population parameter �based on sample of data R estimates true population parameter �based on sample of data

5. The need for transformations We should always check the assumptions that data follow a normal distribution with uniform variance *If the data meet the assumptions we can analyze the raw data as described. *If they are not met, we have two possible strategies:

6. 1-We can use a method which does not require these assumptions, such as a rank-based method. 2-We can transform the data mathematically to make them fit the assumptions more closely before analysis.

7. There are three commonly used transformations for quantitative data: The logarithm, the square root, and the reciprocal. We call these transformations variance-stabilizing because, their purpose is to make variances the same. For most data encountered in healthcare research, the first or third situation applies

8. If we have several groups of subjects and calculate the mean and variance for each group, we can plot variability against mean. We might have one of these situations: -Variability and mean are unrelated. We do not usually have a problem and can treat the variances as uniform. We do not need a transformation. -Variance is proportional to mean. A square root transformation should remove the relationship between variability and mean. -Standard deviation is proportional to mean. A logarithmic transformation should remove the relationship between variability and mean. -Standard deviation is proportional to the square of the mean. A reciprocal transformation should remove the relationship between variability and mean.

9. . Variance-stabilizing transformations also tend to make distributions Normal. There is a mathematical reason for this, as for so much in statistics. It can be shown that if we take several samples from the same population, the means and variances of these samples will be independent if and only if the distribution is Normal. This means that uniform variance tends to go with a Normal Distribution. A transformation which makes variance uniform will often also make data follow a Normal distribution and vice versa

10. logarithmic transformation *The most frequently used is the logarithm. *This is particularly useful for concentrations of substances in blood. *The reason for this is that blood is very dynamic, with reactions happening continuously. Many of the substances we measure are part of a metabolic chain, both being synthesized and metabolized to something else. *The rates at which these reactions happen depends on the amounts of other substances in the blood and the consequence is that the various factors which determine the concentration of the substance are multiplied together.

11. *Multiplying and dividing tends to produce skew distributions. *If we take the logarithm of several numbers multiplied together we get the sum of their logarithms. **So log transformation produces something where the various influences are added together and addition tends to produce a Normal distribution.

12. For example, the following figure shows serum cholesterol in stroke patients

14. As we have seen, for the serum cholesterol in stroke patients data, the log transformation gives a good fit to the Normal. What happens if we analyze the logarithm of serum cholesterol then try to transform back to the natural scale? For the raw data, serum cholesterol: mean = 6.34, SD = 1.40. For log (base e) serum cholesterol: mean = 1.82, SD = 0.22. If we take the mean on the transformed scale and back-transform by taking the antilog, we get exp(1.82) = 6.17. This is less than the mean for the raw data. The antilog of the mean log is not the same as the untransformed arithmetic mean.

15. geometric mean is calculated which is found by multiplying all the observations and taking the n�th root The geometric mean is found by multiplying all the n observations together and then taking the nth root. For example, the geometric mean of 4 and 9 is 6, found by multiplying 4 by 9 to give 36 and taking the square (or second) root. The geometric mean is usually smaller than the arithmetic mean. For 4 and 9 this is (4 + 9)/2 = 6.5. Thus the mean of the logs is the log of the geometric mean.

16. What about the units for the geometric mean? If cholesterol is measured in mmol/L, the log of a single observation is the log of a measurement in mmol/L. and the antilog is back in the original units, mmol/L

17. Even if a transformation does not produce a really good fit to the Normal distribution, it may still make the data much more amenable to analysis.

18. The following figure shows a histogram and Normal plot for the area of venous ulcer at recruitment

19. The raw data have a very skew distribution and the small number of very large ulcers might lead to problems in analysis. Although the log transformed data are still skew, the skewness is much less and the data much easier to analyze

20. Making a distribution more like the Normal is not the only reason for using a transformation The following figure shows prostate specific antigen (PSA) for three groups of prostate patients: with benign conditions, with prostatitis, and with prostate cancer

21. A log transformation of the PSA gives a much clearer picture . The variability is now much more similar in the three groups

22. The square root The square root is best for fairly weak relationships between variability and magnitude, i.e. variance proportional to mean or standard deviation proportional to the square root of the mean. The square root can be used for variables which are greater than or equal to zero, the log and the reciprocal can only be used for variables which are strictly greater than zero, because neither the logarithm nor the reciprocal of zero are defined.

23. Arm lymphatic flow in rheumatoid arthritis with oedema

24. The distribution is positively skew and the variability is clearly greater in the groups with greater lymphatic activity. A square root transformation has the effect of making the data less skew and making the variation more uniform. In these data, a log transformation proved to have too great an effect, making the distribution negatively skew, and so the square root of the data was used in the analysis.

25. Reciprocal transformation Removes the relationship between variability and mean. The reciprocal is best for very strong relationships, where the standard deviation is proportional to the square of the mean.

26. The reciprocal can only be used for variables which are strictly greater than zero. If the square root removes the least amount of skewness , the reciprocal removes the most..

27. Can all data be transformed? Not all data can be transformed successfully. 1-Sometimes we have very long tails at both ends of the distribution, which makes transformation by log, square root or reciprocal ineffective

28. For example the distribution of blood sodium in ITU patients This is fairly symmetrical, but has longer tails than a Normal distribution. The shape of the Normal plot is first convex then concave

29. 2-Sometimes we have a bimodal distribution, which makes transformation by log, square root or reciprocal ineffective For example systolic blood pressure in a sample of ITU patients

30. 3-Sometimes we have a large number of identical observations, which will all transform to the same value whatever transformation we use. These are often at one extreme of the distribution, usually at zero

31. For example the distribution of coronary artery calcium in a large group of patients More than half of these observations were equal at zero. Any transformation would leave half the observations with the same value, at the extreme of the distribution. It is impossible to transform these data to a Normal distribution .

32. 4-Sometimes transformation lead to variation in p-value. So, What can we do if we cannot transform data? It is usually safer to use methods that do not require such assumptions These include the non-parametric methods.

33. Hypothesis Testing Procedures

34. Types of data and analysis Nominal Ordinal Discrete Continuous

35. Types of Data Nominal - no numerical value Ordinal - order or rank Discrete - counts Continuous - interval, ratio

36. Parametric Test Procedures 1-Involve Population Parameters Example: Population Mean . 2-Require Interval Scale or Ratio Scale Whole Numbers or Fractions Example: Height in Inches (72, 60.5, 54.7) . 3-Have Stringent Assumptions Example: Normal Distribution

37. Nonparametric Test Procedures A nonparametric test is a hypothesis test that does not require any specific conditions about the shape of the populations or the value of any population parameters. Tests are often called �distribution free� tests.

38. Why non-parametric statistics? -Need to analyse �Crude� data (nominal, -ordinal) -Data derived from small samples -Data that do not follow a normal distribution -Data of unknown distribution

39. Advantages of Nonparametric Tests . 1-Used With All Scales 2-Easier to Compute. 3- Make Fewer Assumptions. 4- Suitable for small sample size. 5-Analysis involves outlier values. 6- No need for population Parameters. 7-Results May Be as Exact as Parametric Procedures

40. Disadvantages of Nonparametric Tests 1-May Waste Information If Data Permit Using Parametric Procedures Example: Converting Data From Ratio to Ordinal Scale 2-Difficult to Compute by Hand for Large Samples 3-Tables Not Widely Available

41. What is a parameter and why should I care? Most statistical tests, like the t test, assume some kind of underlying distribution, like the normal distribution If you know the mean and the standard deviation of a normal distribution then you know how to calculate probabilities Means and standard deviations are called Parameters; all theoretical distributions have parameters. Statistical tests that assume a distribution and use parameters are called parametric tests Statistical tests that don't assume a distribution or use parameters are called nonparametric tests

42. Ranks Many nonparametric procedures are based on ranks. Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Ties are resolved by assigning tied values the mean of the ranks they would have received if there were no ties. Example: 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5 � If the two 119s were not tied, they would have been assigned the ranks 2 and 3. The mean of 2 and 3 is 2.5. Procedure: replace the original data with the ranks across subjects and then perform the parametric test.

43. For large samples, many nonparametric techniques can be viewed as the usual normal-theory-based procedures applied to ranks

44. Wilcoxon signed rank test To test difference between paired data

45. STEP 1 -Exclude any differences which are zero -Put the rest of differences in ascending order -Ignore their signs -Assign them ranks -If any differences are equal, average their ranks

46. STEP 2 -Count up the ranks of +ives as T+ -Count up the ranks of �ives as T-

47. STEP 3 If there is no difference between drug (T+) and placebo (T-), then T+ & T- would be similar If there were a difference one sum would be much smaller and the other much larger than expected The smaller sum is denoted as T T = smaller of T+ and T-

48. STEP 4 Compare the value obtained with the critical values (5%, 2% and 1% ) in table N is the number of differences that were ranked (not the total number of differences) So the zero differences are excluded

52. *Because calculated T is at a p value less than 0.05 , from the tables ,the difference is significant . *we can reject H0

53. Signed Rank Test Computation Table for paired data.

55. Signed Rank Test Computation Table

56. Wilcoxon Signed Rank Table (Portion)

57. There are two types of comparison using tables for wilcoxon signed rank test 1- Looking at critical values (Z): In which the calculated T value ( smaller one ) is compared with the tabulated value at specific N and p The difference is significant (Null HYPOTHESIS IS REJECTED ) If calculated T < OR = tabulated T

58. 2-By comparing the P values By finding P value at certain N that match the calculated T If this P value > the specified one ( 0.05 for example ) the H0 can not be rejected.i,e not significant. It is only significant if that p = or < the assumed p . ** for 2 tailed test p=2 x p for one tailed

Data Transformations

Data Transformations

Presentation Transcript

Assessing Normality and Data Transformations

Transformations & Data Cleaning

Transformations & Data Cleaning

Lab IV: Phase Transformations Sample Data

Transformations

Transformations

Assessing Normality and Data Transformations

Transformations

Transformations

Assessing Normality and Data Transformations

More on data transformations

Data Transformations

Lecture 09: Data Structure Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Section 9.3 Transformations of Data

Data Transformations