Chapter 9

Chapter 9 Analyzing Data

Learning Objectives • Process raw data into variables • Gain a preliminary understanding of various statistics computer packages • Examine and transform variables into analyzable units • Describe the data with appropriate descriptive statistics • Answer the research question using appropriate inferential statistics

“We tend to regard statistics as though they are magical, as though they are more than mere numbers. We treat them as powerful representations of the truth; we act as though they distill the complexity and confusion of reality into simple facts.” (Best, 2001, p. 160)

Misleading or Even Dangerous • Data that are not analyzed correctly • Results that are not interpreted appropriately

From Raw Data to Variables • Usually newly collected data needs to be “processed” or modified to a format that can be analyzed

Processing Hand-Recorded Data • Develop a customized data entry screen that includes entry fields for each measure, specific to its level (e.g., categorical) and type (e.g., open-ended) • Entry clerk enters data from written forms into the entry screen • Entry screen can be in a raw form (e.g., ascii) or directly into statistical package (e.g., SAS, SPSS) • Ideally, a second clerk re-enters the same data and any inconsistencies are resolved by referring to the written form

Processing Direct Computer-Entered Data • Directly entered data usually needs to be converted to the statistical package format of choice (e.g., SAS, SPSS) • Direct entry formats may be on-line surveys such as Survey Monkey or CASI (computer-assisted self-administered interview) and so on

Data Cleaning • Occurs before and/or after any conversion to SAS or SPSS • Inconsistent responses • Out-of-range responses • Missing responses

Inconsistent Responses • Compare responses for similar measures to check for illogical responses • Have you ever used marijuana? YES • How many times have you ever used marijuana? 0 • If hand-entered, the hard-copy should be checked for clues about the discrepancy • With sufficient evidence and reasonable assumptions, the data can be corrected

Out of Range Responses • Numeric responses are impossibly or unlikely high or low • Usually able to prevent these with computer-administered data collection (error message is programmed to appear if a response out of the pre-determined range is entered) • Problem with self-administered paper-pencil surveys • For example, a study of illicit drug users identifies a respondent who self-reported having had 1200 sex partners in the past year • Check to see if respondent is a sex worker • Change to a reasonable number for the target population (i.e., 1 partner/day; 3 partners/week)

Missing Responses • Responses are not available because the respondent skipped or refused to answer • Questions at the end of a self-administered survey, especially if rather the survey is rather long, tend to have the highest frequencies of missing data • If a case is missing for one variable, that case will be excluded from all multivariable analyses that include that variable • Considerable gain in statistical power if missing data can be resolved

Imputation • When possible and appropriate, effort can be devoted to changing missing responses to justifiable responses • Examination of the missing case’s responses on other variables may indicate the logically correct answer • Any imputations should be documented in the explanation of the data and should be logically justifiable

Another Comment about Missing Data • Variables with relatively high frequencies of missing data (say more than 10% of the total sample) should be examined for potential bias • Other variables should be examined in comparing subjects with and without missing data on a variable • For example, are women more likely than men to have missing data on this measure? • If systematic differences are found, the variable should be excluded from analyses or caveats about potential bias should be included with the results of analyses that include it

Variables • Raw data are summed within categories to create variables Raw Data Variable Variable Frequency Sex f Male 5 Female 5

Variables • The most important characteristic of variables is that they vary • There should be a reasonable distribution of cases across the categories of the variable • If all or nearly all cases are in only one (or two…) categories, then the variable does not vary – there is insufficient variation to support analysis Variable Frequency Sex f Male 4 Female 1 Variable Frequency Sex f Male 5 Female 5

Getting to Know (and Like) the Variables • Examine frequency distributions • Recode the variables as appropriate and necessary • Combine categories • Reorder categories • Create dummy variables • Create Scales

Recoding Variables • For conceptual or statistical reasons, the categories of variables may need to be modified or recoded • Conceptual – maybe there is not enough conceptual distinction across 7 categories of the variable (e.g., a classic likert scale) and it is more meaningful to combine them into 3 categories (e.g., disagree, neutral, agree) • Statistical – there is not enough variation across the variable to support the planned analysis (e.g., 80% of the distribution agrees or strongly agrees); the planned analysis requires a dichotomous rather than a categorical variable with 3 categories

Combining Categories Variable Frequency (Recode) RACERC f Caucasian/ 6 Asian African- 4 American Variable Frequency (Original) RACE f___ (1)Caucasian 5 (2)African- 4 American (3)Asian 1 Example Recode If RACE=3 Then RACERC=1 If RACE=Else then RACERC = Same Note: Computer packages usually assign numbers to categories in consecutive order

Reordering Categories Variable Frequency (Original) RACE f___ (1)Caucasian 5 (2)African- 4 American (3)Asian 1 Variable Frequency (Recoded) RACERO f___ Asian 1 African- 4 American Caucasian 5 Example Recode If RACE=3 Then RACERO=1 If RACE=1 Then RACERO=3 If RACE=Else then RACERO= Same

Assigning Values to Missing Data • Consider a variable measuring the last time the subject visited a physician (1) more than one year ago (2) within the past year (.) missing • One could make the argument that subjects who do not respond to this measure probably have not been to the doctor for some time • Consequently, the missing values (.) can be recoded into category “1” • Recoding missing data must be based on sound logic and any available evidence and should be documented

Creating Dummy Variables • Dichotomous variables coded “0” and “1” • Gives somewhat meaningful numeric values to the response options of categorical variables • Allows analysts to interpret categorical variables as continuous, to a point • With a dummy-coded exposure measure, the interpretation is the change in the outcome with a movement from 0 to 1 on the dummy variables

Dummy Variables (cont.) • For example, the difference in systolic blood pressure comparing males (coded 1) and females (coded 0) is 6.2 • Dummy variable is called MALE because that is the category coded 1 • Interpretation: males’ systolic blood pressure is 6.2 units higher than that of females

Dummy Variables (cont.) • Dummy variables can be created for categorical measures with more than 2 categories • For example, a measure of race with 3 categories is recoded into 2 dummy variables • BLACK: Caucasian (0); African-American (1) • ASIAN: Caucasian (0); Asian (1) • Results indicate the difference between African-Americans and Caucasians and between Asians and Caucasians • In this example, Caucasian is the reference category

Creating Scales • This definition of a scale is a complex measure created by combining 2 or more individual measures in a meaningful way • Common method of combining measures is the Guttman or additive scale for which the values of individual measures are added together

Creating Scales (cont.) • Vital preliminary steps are to examine the frequencies of each response option for the individuals measures to be included in the scale—how many subjects are in each category and how are the categories ordered? • Individual measures with a larger number of categories than the others may need to be recoded (categories combined) to be consistent • Categories of some measures may need to be reordered so that all measures range in the same direction (e.g., from low to high)

Example Scale • Say we have 6 variables we want to combine into a scale measuring social support (SOCSUPP) • Each is coded: Strongly Disagree(1) Disagree(2) Agree(3) Strongly Agree(4) Missing (.) • Create the scale: SOCSUPP = var1 + var2 + var3 + var4 + var5 + var6 • Resulting SOCSUPP scale would range from 6 to 24 if there are no missing data in any of the variables

Variables’ Levels of Measurement • Vital consideration in choosing the appropriate analytic technique • Levels • Categorical • Dichotomous • Ordinal • Continuous • Interval • Ratio

Describing the Data • Descriptive statistics • Sample size • Frequencies and relative frequencies • Range • Measures of central tendency (average) • Measures of dispersion (how widely or narrowly the data cluster around the measure of central tendency)

Sample Size and Frequency Measures • Sample size (n) • Compare each variable’s n to the total sample n to determine the proportion of missing data • Indicates the general level of statistical power • Frequency and relative frequency • Frequency (f) indicates the number of subjects in each category of a measure • Relative frequency (%) indicate the proportion of frequency size in each category relative to the total n for the variables

Tabular Presentation of Frequency Distribution

Graphic Presentation of a Frequency Distribution Bar Graph for Categorical Variables

Graphic Presentation of a Frequency Distribution (cont.) Histogram for Ordinal Variables

Graphic Presentation for a Frequency Distribution (cont.) Line Graph for Continuous Variables

Note about Scale in Graphic Presentations % 1 2 3 4 5 % 1 2 3 4 5

The Importance of Scale • The two graphs show the same data • If a numerically small difference is important given the topic at hand, then the more restricted scale would be appropriate • The scale of the graph axes should reflect the conceptual significance of the comparison

Variable Range • Used for continuous measures • Helps put the measures of central tendency and dispersion in the context of the range of responses • Is a mean of 60 in the middle or at one end of the range of values? Range = Highest Response Value – Lowest Response Value

Measures of Central Tendency • Indicate the average, typical, central response in the distribution of responses • Mode – the response(s) with the highest frequency of subjects • Can be used for all levels of measurement • Median – the response value that separates the distribution into halves (2 groups each with 50% of the distribution • Can be used for ordinal and continuous levels

Central Tendency (cont.) • Mean – the statistical average in the distribution • Sum of all variable values (x) for all subjects divided by the number of subjects (n) • Can be used only for the continuous level of measurement

Mode Variable Frequency Race f___ Caucasian 5 African- 4 American Asian 1 Mode

Median 135 135 145 150 150 180 180 200 200 225 (150+180)/2=165=Median 135 135 145 150 150 180 180 200 200 225 325 360 410 180=Median

Mean Mean=∑ xi / n (135+135+145+150+150+180+180+200+200+225)/10 or 1700/10=170

Mean or Median? Bell-Shaped Mean=Median=Mode Skewed to the left Median=Mode Mean is deflated due to extreme values (Outliers) on the low end of the distribution Mean

Measures of Dispersion • Measures of dispersion or variability indicate how “spread out” the subjects are across the range of variable categories • Dispersion coupled with the measure of central tendency gives a very good idea of the frequency distribution of a variable • Variance and standard deviation used with mean • Interquartile range used with median

Variance • Used for continuous variables • Varianceis the average of the squared deviations of each data point (xi) around the mean • Deviations are squared for mathematical reasons – if not squared, the sum of deviations will always by zero (0) Variance = ∑ (xi – X)2/ n-1

Standard Deviation • Square root of the variance • Need to “undo” the squaring of differences so the deviation makes sense within the scale of the variable (X) Standard Deviation = /n-1

Example Output Descriptive Statistics

Interquartile Range • Measure of dispersion used with median • Ordinal variables • Continuous variables with outliers • Interquartile Range – difference between the response values that separate the lower 25% and the upper 25% of the distribution of the variable • Middle 50% of the distribution • 25% of the distribution on either side of the median

Box-Whisker Plot Highest Possible Value 130 Highest Actual Value 126 108 Upper 75% Median 76 Lower 25% 54 Lowest Actual Value 42 Lowest Possible Value 30

Testing Inferences, Effects, and Relationships • Inferential statistics • Make estimates about the theoretical population • Test hypotheses • Associations • Causality • Simple or complex models • Calculate confidence intervals or test statistics to make inferences about the degree of “truth” of an estimate or association

Estimates and Confidence Intervals • What is the prevalence of malignant meningiomas in among adults in upstate NY in 2010? • Sample of oncology practices in upstate NY • Calculate the prevalence = x/n, where x is the number of cases and n is the total population size of upstate NY • How precise or accurate is this prevalence? 95% C.I.= ± Z , where Z is the test statistic associated with 95% chance of not being a Type I error • We can be 95% confident that the true population prevalence is within this interval

Chapter 9

Chapter 9

Presentation Transcript

Chapter 9

CHAPTER 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

CHAPTER 9

Chapter 9

Chapter 9

Chapter 9