Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Class 3 Methods of Developing New Measures and How to Select Measures for Your StudyOctober 8, 2009 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Overview of Class 3 • Overview: sequence of developing new measures • Rationale for multi-item measures • Scale construction methods • Steps in choosing appropriate measures for your study

Typical Sequence of Developing New Self-Report Measures Develop/define concept Create item pool Pretest/revise Field survey Psychometric analyses Final measures

Sequence: Develop Item Pool • Generate a large set of items that reflect the concept definition • For multidimensional concepts, items for each dimension • Item sources: • Other measures of similar concepts • Qualitative research such as focus groups • Researchers’ ideas about concept

Considerations in Writing Item Pool • Items from various sources will have different formats, response choices, and instructions • Have to determine consistent approach

Reduce Item Pool to Manageable Number • Review items against concept until “best” ones remain for pretesting • Judgment of investigators • Expert panels • Achieve good representation of all dimensions • Have more items than final goal

Revised Interpersonal Processes of Care Concepts and Item Pool IPC Version I frameworkin Milbank Quarterly Draft IPC II conceptual framework 19 focus groups -African American, Latino,and White adults Literature review of quality of care in diverse groups

IPC Item Pool Original IPC DraftIPC II conceptual framework IPC II Item pool (1,006 items) 19 focus groups Literature review 160 items selected for pre-testing

Sample Item These questions are about your experiences talking with your doctors at ___ over the past 12 months 1. How often did doctors use words that were hard to understand? -- never -- rarely -- sometimes -- usually -- always

Sequence: Pretest/Revise • Pretest, pretest, pretest • Numerous methods • For new measures, pretesting essential • Obtain reactions and comments of individuals targeted for study • Results in revisions of items and response choices

Pretest in Target Population • Pretesting essential for measures being applied to any new population group • Especially priority measures (e.g., outcomes) • Pretest is to identify: • problems with procedures • method of administration, respondent burden • problems with questions • Item stems, response choices, and instructions

Problems with Questions or Response Choices • Are all words/phrases understood as intended? • Are questions interpreted similarly by all respondents? • Are some questions not answered? • Are any questions offensive or irrelevant? • Does each closed-ended question have an answer that applies to each respondent? • Are the response choices adequate?

Types of Pretests • General debriefing pretest (N=10) • In-depth cognitive interviewing pretest (N=5-10 each group)

Sequence: Field Survey/Questionnaire • Administer survey to large enough sample to test psychometric characteristics • Two approaches • Preliminary field test (N= about 100) • Administer in main study – conduct psychometric studies on study data • Some items may not be used in final scales

Sequence: Psychometric Analyses • Evaluate items - variability, % missing • Create multi-item scales according to scale construction criteria • Evaluate scale characteristics • Variability, reliability • Validity

Sequence: “Final Measures” • When to publish: depends on • authors’ standards • sample size of psychometric analyses • Many measures published with very little iterative work • Single sample testing

Single-item Measures - Usually Ordinal • Advantages • Response choices interpretable • Disadvantages • Impossible to assess complex concept • Very limited variability, often skewed • Reliability usually low

Multi-Item Measures or Scales Multi-item scales are created by combining two or more items into an overall measure or scale score Sometimes called summated ratings scales

Advantages of Multi-item Measures (Over Single Items) • More scale values (improves score distribution) • Reduces # of scores to measure a concept • Improves reliability (reduces random error) • Reduces % with missing data (can estimate score if items are missing) • More likely to reflect concept (content validity)

One Major Exception: Self-rated Health

Review of 27 Studies of Self-rated Health and Mortality • Independently predicted mortality in nearly all studies • Despite controlling for numerous specific health indicators and other predictors of mortality Idler EI et al. J Health Soc Beh, 1997;38:21-37

Methods for Creating Multi-item Scales • Two Basic Scale Construction Approaches • Multitrait scaling • Factor analysis • Classical test theory approaches

How much of the time .... tired? 1 - All of the time 2 - Most of the time 3 - Some of the time 4 - A little of the time 5 - None of the time How much of the time …. full of energy? 1 - All of the time 2 - Most of the time 3 - Some of the time 4 - A little of the time 5 - None of the time Example of a 2-item Summated Ratings Scale

How much of the time .... tired? 1 - All of the time 2 - Most of the time 3 - Some of the time 4 - A little of the time 5 - None of the time How much of the time …. full of energy? 1=5 All of the time 2=4 Most of the time 3=3 Some of the time 4=2 A little of the time 5=1 None of the time Step 1: Reverse One Item So They Are in the Same Direction Reverse “energy” item so high score = more energy

How much of the time .... tired? 1 - All of the time 2 - Most of the time 3 - Some of the time 4 - A little of the time 5 - None of the time How much of the time …. full of energy? 5 - All of the time 4 - Most of the time 3 - Some of the time 2 - A little of the time 1 - None of the time Step 2: Sum the Items Lowest = 2 (tired all of the time, full of energy none of the time) Highest = 10 (tired none of the time, full of energy all of the time)

How much of the time .... tired? 1 - All of the time 2 - Most of the time 3 - Some of the time 4 - A little of the time 5 - None of the time How much of the time …. full of energy? 5 - All of the time 4 - Most of the time 3 - Some of the time 2 - A little of the time 1 - None of the time Step 2: Can Also Average the Two Items Lowest = 1.0 (tired all of the time, full of energy none of the time) Highest = 5.0 (tired none of the time, full of energy all of the time)

Summed or Averaged: Increases Number of Levels from 5 (per item) to 9

Summated Rating Scales: Scaling Analyses • To create a summated rating scale, set of items need to meet several criteria • Need to test whether the items hypothesized to measure a concept can be combined • i.e., that items form a single concept

Five Criteria to Qualify as a Summated Ratings Scale • Item convergence • Item discrimination • No unhypothesized dimensions • Items contribute similar proportion of information to score • Items have equal variances

First Criterion: Item Convergence • Each item correlates substantially with the total score of all items • with the item taken out or “corrected for overlap” • Typical criterion is > .30 • for well-developed scales, often > .40

Example: Analyzing Item Convergence for Adaptive Coping Scale Item-scale correlations Adaptive coping (alpha = .70) 5 Get emotional support from others .49 11 See it in a different light .62 18 Accept the reality of it .25 20 Find comfort in religion .58 13 Get comfort from someone .45 21 Learn to live with it .21 23 Pray or meditate .39 Moody-Ayers SY et al. J Amer Geriatr Soc, 2005;53:2202-08.

Example: Analyzing Item Convergence for Adaptive Coping Scale Item-scale correlations Adaptive coping (alpha = .70) 5 Get emotional support from others .49 11 See it in a different light .62 18 Accept the reality of it .25 <.30 20 Find comfort in religion .58 13 Get comfort from someone .45 21 Learn to live with it .21 <.30 23 Pray or meditate .39

Example: Split Into Two Scales • Item-scale correlations • Adaptive coping (alpha = .76) • 5 Get emotional support from others .45 • 11 See it in a different light .59 • 20 Find comfort in religion .73 • 13 Get comfort from someone .45 • Pray or meditate .51 • Acceptance (alpha = .67) • Learn to live with it .50 • 18 Accept the reality of it .50

Can Examine Item Convergence Using Any Statistical Software • Programs to calculate internal consistency reliability • Provide estimated coefficient alpha • Produce item-scale correlations corrected for overlap

Second Criterion: Item Discrimination • Each item correlates significantly higher with the construct it is hypothesized to measure than with other constructs • Item discrimination • Statistical significance is determined by standard error of the correlation • Determined by sample size

Example: Two Subscales Being Developed Using Multitrait Scaling • Depression and Anxiety subscales of MOS Psychological Distress measure

Example of Multitrait Scaling Matrix: Hypothesized Scales ANXIETYDEPRESSION ANXIETY Nervous person .80 .65 Tense, high strung .83 .70 Anxious, worried .78 .78 Restless, fidgety .76 .68 DEPRESSION Low spirits .75 .89 Downhearted .74 .88 Depressed .76 .90 Moody .77 .82

Example of Multitrait Scaling Matrix: Item Convergence ANXIETYDEPRESSION ANXIETY Nervous person .80* .65 Tense, high strung .83* .70 Anxious, worried .78* .78 Restless, fidgety .76* .68 DEPRESSION Low spirits .75 .89* Downhearted .74 .88* Depressed .76 .90* Moody .77 .82*

Example of Multitrait Scaling Matrix: Item Discrimination ANXIETYDEPRESSION ANXIETY Nervous person .80* .65 Tense, high strung .83* .70 Anxious, worried .78* .78 Restless, fidgety .76* .68 DEPRESSION Low spirits .75 .89* Downhearted .74 .88* Depressed .76 .90* Moody .77 .82*

Multitrait Scaling to Develop New “Expectations of Aging” Measure • Pretested initial 94-item version (N=58) • Eliminated items with • Missing data • Poor distributions • Low item-scale correlations • Field tested 56-item version (N=588) • Eliminated more items • Low item-scale correlations • Weak item discriminant validity • Field tested again (N=429) • 38 items, final scales Sarkisian CA et al. Gerontologist 2002;42:534-542

Multitrait Scaling - An Approach to Constructing Summated Rating Scales • Confirms whether hypothesized item groupings can be summed into a scale score • Examines extent to which all five criteria are met • Reports characteristics of resulting scales • A confirmatory method • Requires strong conceptual basis for hypothesized scales • Typically used for scales well along in testing

Multitrait Scaling Methods • Used at RAND in all health measurement development (e.g., MOS measures) • Method described in reading #1 for class 3 • Stewart and Ware, 1992, pp 67-80

Multitrait Scaling Analysis Described by Ron Hays (UCLA/RAND) • Hays RD & Wang E. (1992, April). Multitrait Scaling Program: MULTI. Proceedings of the Seventeenth Annual SAS Users Group International Conference, 1151-1156. • Hays RD et al. Behavior Research Methods, Instruments, and Computers, 1990;22:167-175

SAS Macro Available • Ron Hays also makes available a SAS macro for conducting multitrait scaling • You don’t have to purchase software http://gim.med.ucla.edu/FacultyPages/Hays/util.htm Go to MULTI Sample program including macro call: MULTI.sas and its output: MULTI.out

Using Factor Analysis to Develop Multi-Item Scales • For new measures in early developmental stages • Exploratory factor analysis of items can identify possible dimensions • Useful when starting with item pool with uncertainty about subdimensions

Patient Satisfaction with Pharmacy Services • No measures – started from scratch • Phase 1: pretested 44 items (N=30) • Revised items • Phase 2: field tested 45 items (N=313) • Exploratory factor analysis - 7 factors • Revised items MacKeigan LD et al. Med Care 1989;27:522

Patient Satisfaction with Pharmacy Services • Phase 3: field tested 44 items (N=389) • Exploratory factor analysis - 8 factors (56% of variance) • Items retained with factor loadings >0.40 MacKeigan LD et al. Med Care 1989;27:522

Item Reduction by Analysis

Anita L. Stewart Institute for Health & Aging University of California, San Francisco