An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo

An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and Psychometrics Core, CDART RJCharn2@aol.com

Objectives • First ~80 minutes: • Be able to describe a group-based trajectory model • and, in particular, distinguish it from a conventional regression model. 2. Be able to interpret results obtained from fitting a group-based trajectory model via PROC TRAJ. Last ~40 minutes: 3. Be able to fit a group-based trajectory model via PROC TRAJ.

Motivating example The Excel file at {www.richardcharnigo.net/traj} contains a simulated data set: Five hundred college freshmen (“ID”) were asked to estimate how many times per month they consumed marijuana during their freshman (“Y1”), sophomore (“Y2”), junior (“Y3”), and senior (“Y4”) years of high school. Later they were asked to estimate their marijuana use during freshman year of college (“Y5”). They were also assessed on reward seeking; for ease of interpretation, we standardize this variable (“X”).

Motivating example Two possible “research questions” are: • What are prototypical trajectories of marijuana use within the population of college students from which this sample was drawn ? • Is the trajectory that best describes the experience of a particular student associated with that student’s level of reward seeking ? We can develop more complicated and realistic scenarios ( e.g., with additional personality variables and/or interventions ), but this simple scenario will help us begin to understand group-based trajectory modeling and PROC TRAJ.

Exploratory data analysis Before pursuing group-based trajectory ( or any other statistical ) modeling, we are well-advised to perform exploratory data analysis. This can alert us to gross mistakes in the data set, heretofore undetected, which may otherwise threaten the validity of our results. This can also suggest an appropriate probability distribution to use with the group-based trajectory model and help us to anticipate what the results may be.

Exploratory data analysis

Exploratory data analysis The preceding slides show descriptive statistics for Y1 and Y5. ( We can similarly examine descriptive statistics for Y2, Y3, and Y4. ) Here are a few observations: • As anticipated, the possible values of Y1 and Y5 are nonnegative, and they appear to have been recorded ( or rounded ) to the nearest integer. • The distributions of Y1 and Y5 are right-skewed, and there are lots of 0’s. • Both the mean and the variance for Y5 are greater than the corresponding quantities for Y1.

Exploratory data analysis Our observations suggest the following: • Because there are lots of 0’s, there is no transformation that will bring Y1 or Y5 to approximate normality. • However, because Y1 and Y5 are integer-valued, a Poisson ( or similar ) probability distribution may be applicable. • Since Y5 has greater mean and variance than Y1, we anticipate some divergence between trajectories over time and at least one trajectory showing increasing marijuana use over time.

A first trajectory model Let t denote time in years. If we set time 0 to be high school graduation, then we have t = -3, -2, -1, 0, and 1 corresponding to Y1 through Y5. Suppose for now --- the viability of this supposition can be assessed later --- that there are three subpopulations whose mean levels of marijuana use over time ( called “trajectories” ) are defined by exponentials of linear functions f1(t) = exp(a1 + b1 t), f2(t) = exp(a2 + b2 t), and f3(t) = exp(a3 + b3 t). The exponentials are needed because f1(t), f2(t), and f3(t) must be nonnegative.

A first trajectory model Suppose that the distribution of Yk ( 1 < k < 5 ) in the first subpopulation is Poisson with mean f1( k-4 ), in the second is Poisson with mean f2( k-4 ), and in the third is Poisson with mean f3( k-4 ). Finally, suppose that the probability of belonging to subpopulation j ( 2 < j < 3 ) divided by the probability of belonging to subpopulation 1 is of the form exp(cj + dj X). If dj > 0, then higher levels of reward seeking increase the above ratio; if dj < 0, then they decrease the above ratio.

A first trajectory model A group-based trajectory model is thus distinguished from a conventional regression model in that a latent variable --- namely, the subpopulation to which one belongs --- is intermediate between what might be thought of as the independent variable (here, reward seeking) and the dependent variable (here, marijuana use). Consequently, and importantly, the difference between two trajectories is typically much greater than the difference between mean levels among persons “high” on the independent variable versus persons “low” on the independent variable.

A first trajectory model

A first trajectory model The preceding figure shows results from fitting the group-based trajectory model via PROC TRAJ. Approximately 65.3% of persons belong to a subpopulation that is essentially abstinent from marijuana, about 19.4% to a subpopulation whose marijuana use increases and then decreases, and about 15.3% to a subpopulation whose marijuana use continually increases. Dashed lines represent estimates of f1(t), f2(t), and f3(t) when they are assumed to be exponentials of linear functions; solid lines represent estimates without such a constraint.

A first trajectory model

A first trajectory model The preceding tables display additional results. The first table shows variable values for six subjects, along with the estimated probabilities that the subjects belong to the three subpopulations. The second and third tables present estimates of a1, b1, a2, b2, a3, b3, c2, d2, c3, and d3. Companion output, which is displayed by PROC TRAJ on screen only, provides accompanying p-values. The fourth table provides indices of model fit, and the fifth table specifies the numbers used to construct the figure displayed earlier.

A first trajectory model Visually, the estimate of f2(t) appears somewhat unsatisfactory. There are corresponding discrepancies between the “AVG2” and “PRED2” columns in the fifth table. Therefore, let us consider a second group-based trajectory model in which the trajectories are defined by exponentials of quadratic functions f1(t) = exp(a1 + b1 t + g1 t2), f2(t) = exp(a2 + b2 t + g2 t2), and f3(t) = exp(a3 + b3 t + g3 t2).

A second trajectory model

A second trajectory model Some comments are in order: • The estimate of f2(t) looks much better now. • The guess about which subpopulation subject 6 belongs to has changed ( and appears more reasonable now ). • The BIC1, BIC2, and AIC have increased by approximately 66, 64, and 73 points respectively. These are overwhelming changes, suggesting that the second group-based trajectory model provides a much better fit to the data than the first group-based trajectory model.

Is that the best we can do ? Besides moving from linear functions to quadratic functions, other modifications are possible. One, for which I provide SAS code at {www.richardcharnigo.net/traj}, entails replacing the ordinary Poisson probability distribution by the zero-inflated Poisson probability distribution. The idea is that, especially in the first subpopulation, there may be too many 0’s to be compatible with the ordinary Poisson probability distribution. Accounting for this zero inflation may provide a better fit to the data.

Is that the best we can do ? Another possible modification is to change the quadratic functions to cubic or even quartic functions. ( With only five time points, we cannot go beyond polynomials of degree four. ) In fact, the polynomial degree need not be the same for each subpopulation. For instance, a linear function may suffice for the first and third subpopulations, while ( at least ) a quadratic function appears necessary for the second subpopulation.

Is that the best we can do ? We face the practical problem, though, of deciding which modifications to make. Rather than consider dozens ( or hundreds ) of possible competing models, a more feasible approach may be to start with the most complicated model that one is willing to entertain ( for example, with quartic polynomials for each subpopulation ) and then perform “backward elimination”.

Is that the best we can do ? To do this, remove whichever model feature has the largest p-value, while respecting the hierarchical principle that simpler features cannot be removed before more complicated features. Thus, for example, the linear term cannot be removed from a quadratic polynomial. Once all remaining model features have p-values less than 0.05 ( or are ineligible for removal ), stop and create a table of model fit indices corresponding to the various steps of the backward elimination.

Is that the best we can do ? The step in the backward elimination at which the model fit indices are optimized can be used to select a final model. ( Matters become a bit more complicated, though, if the model fit indices are not in agreement about this. ) Also, if we are unsure whether three is the best number of groups, then the above process can be repeated with, say, two groups and four groups. Model fit indices can then be used to choose among the final two-group model, the final three-group model, and the final four-group model.

Other capabilities of PROC TRAJ Worth mentioning here, though not illustrated in this presentation or in the SAS code at {www.richardcharnigo.net/traj}, are three additional capabilities of PROC TRAJ: • The dependent variable need not have the (zero-inflated) Poisson probability distribution; the normal and Bernoulli probability distributions can be accommodated as well. • Multiple independent variables can be accommodated.

Other capabilities of PROC TRAJ • Multiple, related dependent variables can be accommodated. If there are two ( for instance, marijuana use and alcohol use ), then PROC TRAJ provides one latent variable defining subpopulations on the first dependent variable and a separate latent variable defining subpopulations on the second. Part of the output from PROC TRAJ then estimates the probabilities of membership in the subpopulations defined by the second latent variable given membership in a subpopulation defined by the first. If there are more than two, then PROC TRAJ provides a single latent variable defining subpopulations on all dependent variables simultaneously.

Trying out PROC TRAJ With this background, let us open SAS and work our way through at least some of the SAS code at {www.richardcharnigo.net/traj}. This is also an opportunity to experiment and make some changes to the SAS code. For instance, you can see what PROC TRAJ does when a quadratic function is replaced by a cubic function or when a quadratic function is retained for only one of the three subpopulations.

An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo

An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo

Presentation Transcript

An Introduction to Use Case Modeling

An Introduction to Use-Case Modeling

An Introduction to Multiscale Modeling

An Introduction to Modeling Instruction

An Introduction to Proc Transpose

Biomedical Modeling : Introduction to the Agent-based epidemic modeling

An Introduction to Climate Modeling

A Gentle Introduction to Linear Mixed Modeling and PROC MIXED Richard Charnigo

Modeling and Simulation ( An Introduction )

An Introduction to Qualitative Mathematical Modeling

Introduction to Rule-based modeling

Introduction to Agent-Based Modeling

An Introduction to Group Theory

An introduction to TCP and its modeling

Trajectory Discussion Group

Matrix Based OFDM modeling and Introduction to MIMO modeling

An introduction to basic multilevel modeling

An Introduction to Object Modeling

An Introduction to Modeling and Simulation with DEVS

An Introduction to Systems Modeling

An introduction to basic multilevel modeling