Lecture 8: Issues of Bias, Fairness, and Diversity

Lecture 8: Issues of Bias, Fairness, and Diversity PSY 605

Overview of the issues • When using tests to make decisions in organizational settings (or any settings) about individuals who come from different ‘groups’, we want to ensure: • The test scores statistically predict your criterion of interest in the same way across groups (test is not biased) • The test is perceived fair by individuals from all groups (test is fair) • Group differences on test scores indicate true differences on the construct – nothing else (test meets conditions of ‘measurement invariance’)

Test Bias

What is test bias? • Test bias is a technical psychometric issue (i.e., testable with statistical analysis) focused on statistical prediction from test scores (i.e., criterion-related validity) (Schultz & Whitney, 2005) • Test bias refers to any construct-irrelevant source of variance that results in systematic differences in test scores or test score – criterion prediction (The Standards, 1999) • When this irrelevant variance effects scores on a given variable, we have ‘measurement bias’ • When this irrelevant variance effects the relationship between test scores and a criterion of interest, we have ‘predictive bias’

Testing for test bias • Testing for bias in measurement… • Very difficult to do, as it would technically require comparison of observed test scores to known true scores (and 99.9% of the time, we don’t know true scores) • Using Item Response Theory (IRT), some methods of differential item functioning (DIF) analysis can indicate measurement bias • More commonly, measurement bias is NOT tested, or it is subjectively tested via an item sensitivity review

Testing for test bias • Testing for bias in prediction… • Recall that bias in prediction requires your test scores are being used to predict some criteria • For example: Integrity Test scores  Organizational Citizenship Behaviors • When we looked at this prediction before, we examined the correlation coefficient between test scores & criterion; the correlation coefficient (r) was our criterion-related validity coefficient • When examining test bias, we need to use regression analysis instead of correlation

Bias in prediction Example: Integrity Test scores OCBs • Regression provides us with a prediction equation – this allows us to estimate someone’s score on a criterion based on their score on our test: Y = criterion score X = score on test (the predictor) b0 = the intercept (estimated value of Y when X is 0) b1 = the slope (predicted change in units of Y when X increases by 1 unit)

Bias in prediction Example: Integrity Test scores OCBs • For our example: OCBs = b0 + b1(integrity) Y = OCBs score X = score on integrity test (the predictor) b0 = the intercept (estimated value of OCBs when integrity test score is 0) b1 = the slope (predicted change in units of OCBs score when integrity test score increases by 1 unit)

Bias in prediction Example: Integrity Test scores OCBs • For our example: OCBs = b0 + b1(integrity) We have bias in prediction when either the intercept (b0) or slope (b1) is significantly different across groups. When intercept is different: intercept bias. When slope is different: slope bias.

Bias in prediction Example: Integrity Test scores OCBs • Ignoring gender: OCBs = 11.563 + 1.144 (integrity) • Just the males: OCBs = 13.374 + 1.314 (integrity) • Just the females: OCBs = 37.121 + .210 (integrity) Definitely appear to be differences in both slope and intercept.. but are they statistically significant? Need to use multiple linear regression with gender as moderator variable to test….

Bias in prediction Example: Integrity Test scores OCBs • To test whether differences in prediction equations for males & females are statistically significant, we regress OCBs on integrity, gender, and the interaction between integrity & gender • this tests the ‘moderation effect’ of gender, i.e., whether the relationship between integrity and OCBs differs based on gender

Although the analysis indicates the effect is NOT significant, let’s graph it anyway…

Test bias – concluding thoughts • Important to look at statistical test results AND trends in the data – statistical test may not be significant due to small sample size or relatively weak bias effect, but the bias could still be impacting your decision-making • Tests for predictive bias assume you are predicting an unbiased criterion • Subgroup differences and subgroup differences are independent – interesting to look at both; however, test bias has greater legal and decision outcome implications

Test Fairness

What is test fairness? • Test fairness is a sociopolitical issue (i.e., not testable with statistical analysis) focused on perceptions of the test, testing process, and test outcomes • Unlike bias, fairness is a subjective judgment • Parallel to concept of organizational justice (work-related fairness perceptions), can form fairness judgments about 1) test outcome, 2) process of testing, and 3) interpersonal communications with and information provided by test users/administrators

To promote test fairness… • In outcomes of testing: • Ensure test scores are valid indicators of true scores on construct (yet again, the power of validity!) • Ensure test scores are not measuring group-related variance that is irrelevant to the construct (i.e., measurement bias) • DO NOT focus on equality of group outcomes, as this may compromise validity of the test and decision-making based on test scores

To promote test fairness… • In testing process: • Ensure all test takers are exposed to same testing conditions, access to practice materials, feedback, and re-test opportunities • Provide reasonable accommodations for test takers with disabilities when appropriate* • In test-related communications: • Ensure test takers are well-informed regarding the test’s purpose and the testing process • Ensure all communication with test takers is open and transparent when possible • Ensure all test takers are treated with dignity and respect The Americans with Disabilities Act (ADA) requires that any employment testing (and jobs too) have ‘reasonable accommodations’ available for individuals with disabilities.

Diversity Issues: Measurement Invariance

Administering a test to highly diverse groups • Key Questions: Does the test ‘work’ just as well for each group? Are test score comparisons across groups meaningful? • Example: Does your employee engagement measure capture ‘employee engagement’ in the same way for your Colorado-based division and your Singapore-based division? • Example: Do differences in observed scores on your measure of motivation for professional growth between ‘Baby Boomers’ and ‘Millennials’ represent true differences in motivation for professional growth?

Reasons why a test might not work so well with different groups

How to ensure x-group equivalence: The Basics • When groups speak different languages, at a bare minimum, test should be translated & back-translated to ensure equivalence in meaning • Lonner’s (1990) 4 Types of Equivalence to consider: • Content equivalence: ensure items are relevant to all groups • Conceptual equivalence: ensure same meaning is attached to terms on test for all groups • Functional equivalence: ensure the purpose of the test is the same (and interpreted the same) in all groups • Scalar equivalence: ensure all groups have same interpretation of response options

How to ensure x-group equivalence: The Advanced Methods • A more rigorous test of x-group equivalence: Measurement Invariance • MI assesses the extent to which a test is measuring the same construct in the same way for different groups and the extent to which we can confidently assume score differences = true differences at the construct level • MI is based on comparing results of factor analysis across groups using ‘multiple groups confirmatory factor analysis’ • Little (1997) defines MI as the mathematical equality of measurement parameters for a factorially defined construct across groups (i.e., the equality of a measure’s factor structure – judged by several different factor structure parameters – across groups)

Refresher: Factor Structures Motivation to Learn: A 2-Factor CFA Example Utility of new knowledge/skills Confidence in Ability to Learn Item 1 Item 3 Item 5 Item 7 Item 6 Item 8 Item 2 Item 4

4 Components of Factor Models Factor structure (configuration): 2 factors with items 1-4 on ‘Confidence’ and items 5-8 on ‘utility’ Utility of new knowledge/skills Confidence in Ability to Learn Item 1 Item 3 Item 5 Item 7 Item 6 Item 8 Item 2 Item 4

4 Components of Factor Models Factor loadings: the strength of the relationship between items and corresponding factors (in a regression equation relating factor & item, loadings = slope) Utility of new knowledge/skills Confidence in Ability to Learn .6 .8 .5 .9 .7 .5 Item 1 Item 3 Item 5 Item 7 .9 .8 Item 6 Item 8 Item 2 Item 4

4 Components of Factor Models Item intercepts: the intercepts of regression equation relating item with its factor: item = intercept + factor loading * (construct) Utility of new knowledge/skills Confidence in Ability to Learn item 4 = 5 + .9*(confidence) Item 1 Item 3 Item 5 Item 7 .9 Item 6 Item 8 Item 2 Item 4

4 Components of Factor Models Item uniqueness (error variance): the ‘unique variance’ associated with each item, not captured by the corresponding factor for that item Utility of new knowledge/skills Confidence in Ability to Learn Item 1 Item 3 Item 5 Item 7 Item 6 Item 8 Item 2 Item 4 .03

Measurement Invariance: 4 Types Vandenberg & Lance (2000) provide definitions and recommendations for assessing these 4 types of MI

Wrap-up Once you have established that your measure represents the construct in the same way for all groups (measurement invariance), your measure predicts criteria in the same way for all groups (test is not biased), and your test is perceived as fair by all test takers, you are ready to use your test for decision-making in multi-national, multi-cultural, multi-(any other grouping variable) groups!

Lecture 8: Issues of Bias, Fairness, and Diversity