Comprehensive Guide to Factor Analysis in Statistics and Psychometrics

PSY6010: Statistics, Psychometrics and Research Design FACTOR ANALYSIS, CLUSTER ANALYSIS and SEGMENTATIONS Professor Leora Lawton Spring 2006 Wednesdays 7-10 PM Room 204

1. Purpose of Factor Analysis Factor Analysis – a ‘data reduction’ technique • Technique for dealing with multicollinearity • Used to transform Likert scales into factor scores as an alternative to linear additive scale. • Creates groups of respondents based on sets of shared attitudes (explains variables in terms of their underlying dimensions). • Facilitates interpretation of a large number of variables • Factor scores (the grouped attitudes) can be then used as an independent variable.

2. Steps to conducting FA • When creating a questionnaire, often you may want to include a number of attitudinal questions around certain issues. • When analyzing the data with all these variables you start by selecting those attitudes that you think describe some overall category, for example ‘Taste in Music’. • These attitudinal variables ideally should be of the same metric (e.g., 1,2,3,4,5). Some say the variables should have 7 values, but 5 works fine. Don’t use dichotomous variables. • Begin by computing a correlation matrix of all the variables in question. There should be some significant correlations, both positive and negative. • There should be a 4:1 ratio of cases to variables (e.g., 100 cases for 25 variables minimum), and sample size of at least 50.

Correlation matrix of musical tastes • Research issue: You’ve been asked by a music store owner to assist in increasing sales by making sure the placement of music genres in the store is optimal. • Using GSS93 subset.sav, run a set of frequencies to check that the variables fit the requirements. • Then run a correlation matrix of all the music questions.

Correlation Matrix

Evaluating Appropriateness of FA • Check the correlation matrix, which examines only relationships between pairs of variables (e.g., bivariate, not multivariate correlation) • So, then select these variables into the FA. • Analysis - Data Reduction – Factor • Move all 11 music variables to the Variables window. • Under Descriptions, click on the option for KMO and Bartletts test of sphericity. • Use Bartlett Test of Sphericity to examine the entire matrix, where you want to reject the null hypothesis that the matrix is a unity matrix (i.e., it should be significant. A unity matrix is when all the correlations are 0 except for, of course, the correlation between a variable and itself (=1). (Note that our text says not to place much value on this test in most cases.) • KMO stands for Kaiser-Meyer-Olkin Meausure and it compares the magnitude of observed correlation coefficients to partial (that is, what’s unique about the attribute) coefficients. Here you want a number closer to 1. Less than .5 indicates that FA may not be appropriate. Ours is .748.

SPSS for PCA/FA • Analysis – Data Reduction – Factor • Under Extraction, choose the options for Principle Components, Eigenvalues over 1, Display unrotated and screen plot. • Note that there is an option for Number of Factors. There are times you may want to impose a number rather than letting SPSS decide for you (and it decides based on the eigenvalues in the extraction). • For Rotation, choose Varimax (variance maximization; it’s the most commonly used), and Display Rotated Solution. • For scores, you will want to select Save as Variables/Regression when you find your solution. But not while in the exploration phase.

SPSS for PCA/FA FACTOR /VARIABLES bigband blugrass country blues musicals classicl folk jazz opera rap hvymetal /MISSING LISTWISE /ANALYSIS bigband blugrass country blues musicals classicl folk jazz opera rap hvymetal /PRINT INITIAL KMO EXTRACTION ROTATION /CRITERIA MINEIGEN(1) ITERATE(25) /EXTRACTION PC /CRITERIA ITERATE(25) /ROTATION VARIMAX /METHOD=CORRELATION .

Interpreting SPSS results • Under the chart ‘Total Variance Explained’ you will see that four factors have been identified, based on having eigenvalues > 1. • The screen plot shows you a pictoral view of the eigenvalues. We have four, some might want to try the fifth, because that’s where the slope of the eigenvalues change, or similarly, try only 2. The most important thing is that the solution is interpretable, that it makes sense, that the factors provide insight into your overall concept. Eigenvalues are the values for the factor loading matrix that is used to describe the factors. It’s the variance in the correlation matrix condensed into a scale such that the factor with the largest eigenvalue has the most variance (or, the more variance the greater the distance of one factor from another, i.e., the factors are distinguishable. • The unrotated matrix doesn’t tell you too much, go directly to the rotated matrix: here’s where the ‘rotated view’ can give you a better picture on the distinctiveness of each factor. Rotation maximizes high correlations and minimizes low correlations in the matrix used t calculate the factors, or it makes the factors more distinguishable to the ‘naked eye.’ • In the rotated matrix, you then select the variables (attributes) with the highest coefficients. This one works out pretty well, sometimes you have to go back to the drawing board to redefine. • Try it by limiting the result to just two factors. What underlying issue might be explaining this result compared to the four-factor solution?

Interpreting SPSS results You want to keep find components where the coefficients are at least above .3 and see a clear demarcation between the highest coefficients per component. Note that folk music is high for both 1 and 3. Sometimes therefore it is worthwhile to set the number of components to one above, and one less, than the default number based on the eigenvalue you’ve selected.

Scree Plot: Number of Components

Interpreting SPSS results

Project Recommendations

Homework #8 • Using our own employee dataset (or if you wish, use your SDA data set and select your own variables), take the attitudinal variables, to understand how people define “quality of work.” • V11 I have the necessary resources (e.g., computers, databases) to do my work comfortably and efficiently. • V13 The work I'm responsible for is appropriate for my level of capability. • V16 I'm challenged and interested in my work. • V17 My immediate manager recognizes and acknowledges my contributions. • V22 I have responsibility with the required authority. • V24 I am satisfied with communications between management and employees. • v41r Your total compensation (salary, bonuses) • v42r 401(k), retirement and/or pension • v43r Availability of PTO (vacation) days • v44r The office itself (lighting, space, decor) • v45r Performance awards and bonuses

Homework #8 • Run a frequencies test to make sure they are appropriate. Are they? Explain. • Run a correlations table. Is this appropriate for PCA/FA? Explain. • On this same selection of variables, conduct tests for KMO and Bartlett. Are we still on track for PCA/FA? Explain. • Now conduct a factor analysis using these variables, setting the defaults as in the class example. Are you happy with this result? Then try setting the number of components differently, adding one or more, or subtracting, from the first result. Are you happy with this result? Explain. • What can you say about components of Quality of Work?

Using Factor Scores • Rarely are factor analyses conducted just for themselves. Rather, they are used as attitudinal measures to predict or be associated with other behavior or statuses. • One could use factor scores as predictors in regression analyses. • Or, as will be seen in segmentation later this semester, one can use factor scores to cluster with other characteristics to create typologies, or segments, of subgroups in a population. • Today we’ll go back and use our music taste factors as predictors in other behaviors.

Review of Factor Analysis First, let’s not twist our brains into pretzels, so begin by doing an automatic recode on all musical variables. Give them a consistent new name, e.g., preface or end with an ‘r’, e.g., BIGBAND becomes RBIGBAND. /VARIABLES bigband blugrass country blues musicals classicl folk jazz opera rap hvymetal

Saving the Factor Score • Analyze – data reduction – factor • Descriptives (check KMO-Bartletts) • Extraction (uncheck unrotated matrix, and check Screen Plot, select method = principal components) • Rotation (select varimax) • Scores (select Save as Variables) • Run. Now look at your Variable View, and then at the Data View. • Now run a Descriptive Statistics – Descriptives – Mean, Std Dev, Min, Max).

Using Factor Scores in a Regression • Now, let’s predict tv viewing. • First, run a frequencies of the variable TV hours watched per week. • Recode it so that 8 hours and above = 8. • Create a conceptual model: TV viewing = a + musical taste + education + sex + age. Run your regression with these variables.

Homework #9 • Using the same factor analysis you ran last week with the employee data (see slide #14, run this factor analysis and save the factor score variables. • Now run a regression: • Overall satisfaction = a + (factor scores) + male + hours worked (hourswk)+ whether there was a layoff (v32) • Explain why this model makes theoretical sense. Now explain the results. If you were an HR manager, what areas would you either try to improve, or make sure they stay as good?

Segmentation Using Factor Analysis and Cluster Analysis • As you learned last week, segmentation analysis is used to create typologies or categorical groups of constituents, such as customers, patrons, etc. • Often segmentations employ factor score results as well. • In a segmentation, one first develops any necessary factor scores and saves them as output variables (you will see them added to your data set). • Then, because the purpose of the segmentation is to create groups that can then be reached through some sort of marketing (social or commercial), or for some other actionable purpose, use demographics that can be employed to target the groups. • Then, with the factor scores and the sociodemographic variables identified as being logical, use a clustering technique to create the groups. • We will use cluster analysis, but other techniques include discriminant (also in SPSS), CHAID and CART (separate software packages), and the most adventurous is latent class models (also separate software, such as AMOS).

Cluster Analysis - 1 • We’ll use GSS93 subset.sav. • You will remember our musical factors (go back to slide #12 for results). • First create names for your factor scores. I’ve labeled them: Classbig, bluejazz, cwgrass, heavyrap. Clients like meaningful labels, plus it helps you when reading the output. • Then, consider possible demographic factors that might relate to musical taste, e.g., sex, age, race, region, education, income. • Because this kind of analysis tends to be exploratory, you don’t need to specify the logic behind the relationships, but you should have some a priori idea about why these factors might be important in distinguishing the possible groups, in this case, musical taste. • Cluster analysis doesn’t require recoding of IVs the way the other methods do…specify a categorical variable, or a covariate, as is appropriate.

Cluster Analysis - 2 • Analyze - Classify – 2-step Cluster – select factors (categorical variables, e.g., sex) and covariates (ratio, interval or continuous variables). • In our first round, do not specify the number of clusters. • Because segmentations are part art, part science, you need to experiment until you find one that ‘works’ for you, so let’s try it with a different number of clusters.

Syntax for Cluster Analysis • TWOSTEP CLUSTER • /CATEGORICAL VARIABLES = sex politics • /CONTINUOUS VARIABLES = bigclass bluejazz cwgrass heavyrap age educ • /DISTANCE LIKELIHOOD • /NUMCLUSTERS FIXED = 4 • /HANDLENOISE 0 • /MEMALLOCATE 64 • /CRITERIA INITHRESHOLD (0) MXBRANCH (8) MXLEVEL (3) • /PLOT BARFREQ PIEFREQ • /PRINT COUNT SUMMARY • /SAVE VARIABLE=TSC_4337 . • AIM TSC_4337 • /CATEGORICAL sex politics • /CONTINUOUS bigclass bluejazz cwgrass heavyrap age educ • /PLOT ERRORBAR CATEGORY CLUSTER (TYPE=PIE) .

Segmentation Homework • Use the same data set, but this time use the variables for tv viewing and attendance at sports events and art museums for your factors. • Label the factors, then cluster them with age, sex, political views. • Try it with 3, 4, and 5 clusters. Which do you find, if any, to be believable? Why?

Comprehensive Guide to Factor Analysis in Statistics and Psychometrics