280 likes | 457 Vues
Scale development Doctoral seminar Prof. Dr. Nicky Dries Faculty of Business and Economics Athens University of Economics and Business Monday April 7 th , 2014. Scales: adopt, adapt, or develop?.
 
                
                E N D
Scale developmentDoctoral seminarProf. Dr. Nicky DriesFaculty of Business and EconomicsAthens University of Economics and BusinessMonday April 7th, 2014
Scales: adopt, adapt, or develop? • Only develop a new scale when (a) there are really nomeasures available (new construct); (b) existing measures are seriously flawed; (c) the construct is context-specific and you need to adapt it to the specific contextyou want to study (e.g. sector, culture, language…) or develop a completely new scale (e.g. guanxi in Chinese culture likely not captured by a US-developed scale). • When you adapt a scale, or translate it, make sure to do EFA/CFA and validation analyses! (but item generation can be skipped). • Translation-back translation: • 5C project (subjective career success): 1. my colleague translated items to Dutch, 2. me back to English, 3. we then discussed “bad” translations and made final Dutch version, 4. Jon (a native English speaker) then looked at our final back translation, 5. final adjustments based on Jon’s comments. • Career Adapt-Abilities Scale (CAAS) project: 1. I translated items to Belgian Dutch, 2. Dutch colleague to Netherlands Dutch, 3. We commented on each other’s versions via e-mail, 4. We then met in person (with 4 people) to negotiate the most neutral translation.
Item pool generation • Example (deductive): MacKenzie, Podsakoff, & Fetter (1991) • Organizational citizenship • Five construct domains < concept paper by Organ (1988) • Generate theoretically derived items for each of five “factors” • Content validity assessment by 10 faculty members & PhDs • Asked to classify items in 5 factors + “other” • Items assigned to correct a priori category +80% of time retained • Scale development, survey administration, CFA, correlations… • Example (inductive): Butler (1991) only when little theory available! • Conditions of trust • Semi-structured interviews of managers • 280 critical incidents that led to trust, 174 to mistrust • Classified consistently (.78) by PhD students into 10 categories (categories were not determined a priori!) • 10 conditions defined; 4 items written for each (Hinkin, 1995)
Inductive sorting:multidimensional scaling • (Hinkin, 1995): “Because sorting is a cognitive task that requires intellectual ability rather than work experience, the use of students[as SMEs] in this stage of scale development is appropriate” (p. 971) • A way to analyze qualitative (Q-sort) data quantitatively (“distance” matrix) • MDS can easily be done in SPSS (Proxcal algorithm recommended) • Requires sorting of N = > 30 • Plot of so-called “stress values”  “elbow” • Determine number of dimensions • More than 3 dimensions = difficult • Visual inspection of “distances” • Delineate “regions of meaning” • (Additive tree analysis) • Content analysis (Dries, Pepermans, & Carlier, 2008)
Inductive sorting: islands analysis • An island is a maximal constellation of [career success meaning] items connected directly or indirectly by co-ocurrencelevels which are higher than those outside (De Nooy et al., 2011). • Islands are visualized with Pajek (Batagelj & Mrvar, 1998) using the Kamada & Kawai (1989) algorithm.
Item wording • KISS: Keep It Short & Simple • No technical terms, short and specific (unambiguous), simple • No double negatives! (“Do you disagree with the law not being applied?”) • No double-barreled items (“If you are a smoker, does it bother you when others smoke on a restaurant terrace?”) • No extreme or suggestive statements • What do you want to know? (if people bought a certain magazine, really read it, bought it for one article, flipped through it…?) • Questions that can differentiate between different opinions/people • Careful with negatively worded (reverse-scored) items! • Have been shown to reduce the validity of questionnaire response, and may introduce systematic error to a scale. •  Artifactual response factors consisting of all reverse items; •  Consistently lower item loadings for reverse-scored items than for positively worded items that load on the same factor. (Hinkin, 1995)
Number of items • As a rule of thumb, scales should consist of at least 3 items (in case of a multifactorial scale, 3 items or more per factor). As some items are likely to be deleted in the scale evaluation/CFA/ construct validation stage, allow a margin… (Hinkin, 1995) If you really have no other choice, 2 items can be acceptable if they are correlated > .70 and relatively uncorrelated with other variables (Worthington & Whittaker, 2006) • Careful with single-item measures! • Editors and reviewers can even see this as grounds for rejection (especially when the single-item measure is the dependent). • IF a construct is unambiguous and IF a global/holistic impression is informative (e.g., mood, quality of life, life satisfaction), it can be justified. • But be aware: never use fewer than 3 items without a strong rationale! (Wanous & Reichers, 1996)
Rating scale anchoring • Likert scale anchoring: Coefficient alpha reliability has been shown to increase up to the use of five points, but then it levels off. • Self-anchored rating scales (SARS): For example: • "Think of the worst anger you have ever felt, or can imagine that you would ever feel. Give this the number 100. • "Now, think of not being angry at all and call this zero. • "Now, think a feeling of anger halfway between not feeling angry at all and feeling as angry as you possibly could. Call this 50. • "Now you have a scale of anger. On this scale, how do you rate yourself right now?“ • (see https://www.msu.edu/course/sw/850/stocks/pack/slfanch.pdf) • + Avoidsocial desirability responding by (a) pretending socially undesirable responses are normal (“everyone does X once in a while…”); (b) assuming the behavior (“how often do you…”); (c) mentioning authoritative figures (“doctors agree that drinking wine is healthy…”); or detect it using special scales (e.g. Marlowe-Crown 33 items; knowledge questions on non-existing items).
Software Qualtrics SurveyMonkey
Pilot-testing your scale • Always, always, always test your scale/entire survey before you really administer it (even better using colleagues, friends, family than not at all!)  at least 10 people • Ask them (ideally “think aloud” protocols) to report: • How long the survey took to complete • The clarity of your instructions • Items that are unclear or unambiguous • Items that are uncomfortable to answer • Major topic omissions (missing items) • Clarity and attractiveness of lay-out • Technical issues (different screens, browsers, does data load?) • General comments (Saunders, Lewis, & Thornhill, 2003)
Sampling and sample size (i) • The sample [for the scale development] should reflect the population that the researcher will be studying in the future and to which results will be generalized. In other words, do not run your scale development study on another type of population! • Some questions to guide your sampling strategy: • Can the full population be “known” to you? YES: all people following government-sponsored career counseling in Flanders; all long-term unemployed people on benefits. NO: all ‘high potentials’ in Flanders; everyone who is thinking about quitting their employer. In order to engage in representativesampling, you need to know the size and characteristics of the population you want to generalize to! • Representative sampling is not always necessary; as a reviewer I have a (subjective) preference for purposive/case-control sampling, for instance (not: “100 Austrian employees…”)
Sampling and sample size (ii) • It is not necessary to closely represent any clearly identified population as long as those who would score high on your scale and those who would score low are well represented (i.e. variance, heterogeneity) • Do avoid systematic variance in sampling (e.g. phone survey during working hours; only high-paid occupational groups…) • (Worthington & Whittaker, 2006) • Probability sampling = everyone has a chance of being sampled • Non-probability sampling = some are excluded from selection • Sampling strategies: • Random sampling = all population members have an equal chance • Systematic sampling = every kth person (e.g. from a phone list) • Stratified sampling = random within subpopulations (e.g. ages) • Panel sampling = longitudinal, with random sample • Snowball sampling = (convenience/not +) forwarding
Sampling and sample size (iii) • Sample size calculators (internet)… For example: population= 3000, margin of error=3%, confidence level=95%, estimated response rate=40%  survey 1970 people to obtain N= 788. • Margin of error (confidence interval)= random sampling error. E.g. “true” population value of “agree very much”=[-3% 50% +3%] • Confidence level = how often the true % of the population who would pick an answer lies within the confidence interval • According to these calculators, as the population you target goes beyond 10000, the required sample size stabilizes (if sampling is completely random & according to the rules of probability): • 5% margin of error: 400 respondents is the ‘ceiling’ • 3% margin of error: 1000 respondents is the ‘ceiling’ • Another rule of thumb for sample size (specifically for factor analysis) is 5-20 respondents per variable (subject-to-variable ratio). •  If you want to be 100% sure: N > 1000; 20 N per variable
Response rate • Be aware that response rate is very important to (top) journals. Sampling is already a strong reduction of the population and a low response rate increases the risk of sampling bias even further. Editors/reviewers want to see > 30-50% and preferably also non-respondent analyses or, in case of a longitudinal study, attrition analyses. • Total response rate = N/surveyed people minus ineligible • Active response rate = also minus unreachable • Also take into account drop-outduring your survey! (incomplete)
Techniques for increasing r.r. Best time to ask = Tuesday at 11 am  Easy questions first, hard ones in the middle, sociodemos at end + Explain study (do not just “take”) + three follow-up e-mails + debriefing + feedback reports + ethics disclaimers + make yourself available for questions! (Saunders, Lewis, & Thornhill, 2003)
How to avoid being “spam” • Netiquette, kids!  • E-mails to dozens of respondents at the same time (unless from a corporate or other “safe” list) can be marked as spam • E-mails from a “no reply” e-mail address can be marked as spam (+ I think it is impolite to be unreachable to respondents…) • E-mails containing a URL to a commercial site (such as a web survey site) can be marked as spam • So… this is painstaking but can avoid major response issues: • Send e-mails in groups of 30 or even better, individually and using people’s name! • Send them from your corporate e-mail address (e.g. not within Qualtrics—but then you have no automated response data etc!) • Instead of “clickable” URL, remove the hyperlink and tell people to go to a certain website or copy-paste the URL in their address bar
Factor analysis: EFA • EFA best practices • Always run EFA before CFA (CFA = confirmation, not development); • In the first run, use oblique rotation to test for correlations between factors (if not correlated, continue subsequent runs with orthogonal rotation such as varimax); • Common-factors analysis (FA) like principal-axis factoring or maximum likelihood fit better with the notion of scale development than principal component analysis (PCA)! • Purpose of PCA is to reduce the number of items while retaining as much of the original item variance as possible • Purpose of FA is to understand the latent factors or constructs that account for the shared variance among items • You can run the EFA “freely” first, then force items into N factors. • Fit indices: eigenvalues > 1, scree plot “elbow” (% of variance explained), factor loadings > .40 (+ twice as strong as on any other factor), no cross-loadings.(Worthington & Whittaker, 2006)
Factor analysis: CFA • Best practices • CFA is done using SEM: measurement model & structural model • Fit indices: chi-square (with df and sig. level), RMSEA (90% CI), comparative fit index (CFI) & SRMR (Kline, 2005) • Recent discussions about letting error variances covary: avoid? • Model comparison: comparing the fit of the hypothesized model to alternative models is further evidence of construct validity. • You can use multi-group SEM to demonstrate concurrent (criterion) validity (see further), e.g. men-women, age groups… • Do not hold on to a preconceived factor structure and keep defending it when the data to not support it: try to meaningfully interpret your data and/or return to the item generation stage. • When items are added or deleted from a measure, the “new” scale should be administered again to another independent sample (but this can be in a non-scale development study)! • (Worthington & Whittaker, 2006; Hinkin, 1995)
Split-sample approach • To complicate things further  the literature recommends that EFA and CFA are done on different samples (for external validity, generalizability to population). • In addition, the literature suggests that administering your new scale together with validation items can “contaminate” your scale. •  I suggest a split-sample approach where you split your sample in half: • In sub-sample 1, administer only the new scale and do EFA • In sub-sample 2, add validation scales and do CFA • You can run both surveys simultaneously! • For EFA at least N=100 (and/or 10 per item), for CFA 300. • (Worthington & Whittaker, 2006)
Reliability • = the degree to which an assessment tool produces stable and consistent results. • Internal consistency vs. test-retest reliability • Internal consistency = consistency of items within a measure (Cronbach’s Alpha) (.70 is minimally acceptable level) • Test-retest reliability = stability of the measure over time (only when the attribute being measured is not expected to change over time!) • Reliability is a necessary precondition to validity! • But, internal consistency does not guarantee content validity • i.e., demonstration of a nomological network of • relationships with other variables (Hinkin, 1995)
Validity = how well a test measures what it is intended to measure. While reliability is necessary, it alone is not sufficient. For example, if your scale is off by 5 kilos, it reads your weight every day with an excess of 5 kilos. The scale is reliable because it consistently reports the same weight every day, but it is not valid because it adds 5 kilos to your true weight. (It is not a valid measure of your weight.) Convergent validity  Discriminant validity  (Hinkin, 1995)
Content validity = Checking the operationalization against the relevant content domain for the construct(i.e., the adequacy with which a measure assesses the domain of interest). E.g. Does our measure correspond better to the construct of subjective career success than existing measures? Does it avoid the weaknesses of the existing measures? SCS example: “What we do that other measures don’t do: (a) also assess whether people find an item important or not before judging their success level on that item (which brings us closer to a “truly subjective” measure of success); (b) we end up with a multidimensional score rather than a single (unidimensional) score, thus acknowledging the inherently multidimensional nature of the construct. We are missing the self-referent/other-referent discussion, though. Is that an issue?”
construct validity: Convergent validity = The degree to which the operationalization is similar to (converges on) other operationalizations that it theoretically should be similar to. SCS example: “Demonstrated by correlation patterns. The most logical candidate is the Greenhaus, Parasuraman, & Wormley (1990) Career Satisfaction Scale (5 items) + (a) For the “importance” response format: Work Values (e.g. scale by Sean Lyons – 32 items) (b) For the “fulfilment” response format: Job Satisfaction, Life Satisfaction (both around 7 items)”
construct validity: Discriminant validity = The degree to which the operationalization is not similar to (diverges from) other operationalizations that it theoretically should not be similar to. SCS example: “Demonstrated by correlation patterns. I would include different measures of objective career success here. The goal is to demonstrate that the different factors in our scale correlate lower with objective career success measures than with the “convergent” measures outlined above. Suggestions: net month salary, gross year salary, functional level, number of promotions throughout entire career, number of promotions within current organization, status of occupation in country/culture of residence (each 1 item)”
criterion validity: Predictive validity = The degree to which the operationalization is able to predict something it theoretically should be able to predict. SCS example:“Demonstrated by regression analysis. I would try to achieve this by using two “generic”, one-item measures, i.e. “All things considered, how successful do you personally feel your career has been?” and “All things considered, how satisfied are you with your career to date?”. My guess would be that our measure should be better at predicting the first item, and Greenhaus’ the second item. That way, we could argue that career satisfaction and subjective career success are two different constructs and that our measure is qualitatively different from his”
criterion validity: Concurrent validity = The degree to which the operationalization is able to distinguish between groups that it should theoretically be able to distinguish between. SCS example: “What are groups that SUBJECTIVE career success would distinguish between? – be careful, we are not talking about objective success!). Demonstrated by difference tests (e.g. t-tests, ANOVAs). Suggestions: (a) For the “importance” response format: gender, age, generation, culture, socio-demographic background, work values, occupational group… (b) For the “fulfilment” response format: gender, objective career success, personality, peer comparison…”
So… what would be “ideal”?  • Scale development • Do not develop a scale when good scales are available. • Develop items based on in-depth literature review. • If necessary, complement with critical incidents (< qualitative). • Check items with SMEs and/or let them “sort” them into factors. • You need 4-6 items per factor, so develop twice as many items. • Avoid reverse-coded items and single-item measures if you can. • Always pilot-test your scale, if only on 10 people. • Scale testing/evaluation • Purposive sampling is always better than convenience sampling. • Representative sampling is not always necessary or possible. • Sample size can be calculated beforehand (sure = 1000; 20:1). • E-mail respondents on Tuesday at 11 & offer them a “win-win”. • Treat respondents as you like to be treated. • Split samples: Sample 1= scale, EFA; Sample 2 = scale, validation items, CFA
Questions?Contact me:nicky.dries@econ.kuleuven.be+32.16.37.37.19.