Data-pipeline using ALSPAC data

Data-pipelineusing ALSPAC data

Contents • Introduction to ALSPAC • Description of the measures • Preparing my data for the pipeline • The pipeline (in stata) • Summarize / Codebook • Polychoric correlations • Polychoric PCA • Loevinger’s H • Mokken Scale Procedure • Options for SPSS users

Day 2: Contents • Introduction to Psychometrics: • Item Response Theory in Stata • Non-parametric procedures: • Mokken,Description of the measures • Parametric models • Singe parameter logistic model (Rasch) • Two parameter logistic (Lord-Birnbaum) • An R – 2 – Detour (a detour to R) • Running psychometric analyses from Stata • From Data to PAPER • Automated IRT analyses that yield publication quality graphics • Connections from Stata

What is ALSPAC? • “Avon Longitudinal Study of Parents and Children” AKA Children of the Nineties • Cohort study of ~14,000 children and their parents, based in south-west England • Eligibility criteria: Mothers had to be resident in Avon and have an expected date of delivery between April 1st 1991 and December 31st 1992 • Population based prospective cohort study

Where’s Avon to, my luvver? trans: Where is Avon?

The county of Avon • 1) A nice short name • 2) Known for it’s “ladies” • 3) Replaced in 1996 with • Bristol • North Somerset • Bath and North East Somerset • South Gloucestershire • Collectively known as “CUBA” (Counties which Used to Be Avon)

What data does ALSPAC have? • Self completion questionnaires • Mothers, Partners, Children, Teachers • Hands on assessments • 10% sample tested regularly since birth • Yearly clinics for all since age 7 • Data from external sources • SATS from LEA, Child Health database • Biological samples • DNA / cell lines

Today’s Measures 1 - MFQ • Moods and Feelings Questionnaire • Angold and Costello (1987). Mood and feelings questionnaire (MFQ). Durham: Duke University, Developmental Epidemiology Program. • Short version, 13 items • Parental response at 13 years [Questionnaire] • Child response at 14 years [Clinic, computer]

Today’s Measures 2 - EAS • EAS Temperament Survey (Parental Ratings) • Buss and Plomin, (1984). A temperament theory of personality development. New York: John Wiley. • 20 questions • 4 subscales: Emotionality, Activity, Shyness & Sociability • Parental response at 4 years [Questionnaire]

Rename variables for clarity and consistency gen mum01_012 = ta5020 gen mum02_012 = ta5021 gen mum03_012 = ta5022 gen mum04_012 = ta5023 gen mum05_012 = ta5024 gen mum06_012 = ta5025 gen mum07_012 = ta5026 gen mum08_012 = ta5027 gen mum09_012 = ta5028 gen mum10_012 = ta5029 gen mum11_012 = ta5030 gen mum12_012 = ta5031 gen mum13_012 = ta5032 gen kid01_012 = fg6410 gen kid02_012 = fg6412 gen kid03_012 = fg6413 gen kid04_012 = fg6414 gen kid05_012 = fg6415 gen kid06_012 = fg6416 gen kid07_012 = fg6418 gen kid08_012 = fg6419 gen kid09_012 = fg6421 gen kid10_012 = fg6422 gen kid11_012 = fg6423 gen kid12_012 = fg6424 gen kid13_012 = fg6425 ta5020 ~ fg6410 or ta5027 ~ fg6419???  mum01_012 ~ kid01_012 and mum08_012 ~ kid08_012 

Derive binary variables recode *_012 (3=0)(2=1)(1=2) foreach x in "mum01" "mum02" "mum03" "mum04" "mum05" "mum06" /// "mum07" "mum08" "mum09" "mum10" "mum11" "mum12" "mum13" /// "kid01" "kid02" "kid03" "kid04" "kid05" "kid06" "kid07" "kid08" /// "kid09" "kid10" "kid11" "kid12" "kid13" { gen `x'_001 = `x'_012 recode `x'_001 (0=0)(1=0)(2=1) gen `x'_011 = `x'_012 recode `x'_011 (0=0)(1=1)(2=1) } mum01_012 mum01_001 mum01_011

Variable labels foreach var of varlist *01_* { label variable `var' "Felt miserable/unhappy [`var']" } foreach var of varlist *02_* { label variable `var' "Didnt enjoy anything at all [`var']" } foreach var of varlist *03_* { label variable `var' "Felt so tired they just sat around & did nothing [`var']" } foreach var of varlist *04_* { label variable `var' "Was restless [`var']" } Etc.

Value Labels foreach var of varlist *_012 { label define `var'_lab 0 "Not true" 1 "Sometimes true" 2 "True" label values `var' `var'_lab } foreach var of varlist *_011 { label define `var'_lab 0 "Not true" 1 "Sometimes true / True" label values `var' `var'_lab } foreach var of varlist *_001 { label define `var'_lab 0 "Sometimes true / not true" 1 "True" label values `var' `var'_lab }

Contents • Introduction to ALSPAC • Description of the measures • Preparing my data for the pipeline • The pipeline (in Stata) • Summarize / Codebook • Polychoric correlations • Polychoric PCA • Loevinger’s H • Mokken Scale Procedure • Options for SPSS users

Typical data-pipeline syntax log using "mfq_dataprep.log", replace foreach x in "mum" "kid" { su `x'*_012 codebook `x'*_012 loevH `x'*_012 polychoric `x'*_012 polychoricpca `x'*_012 msp `x'*_012 } log close Repeat with *_011 and *_001

summarize / codebook

su emo_*_01234 Variable | Obs Mean Std. Dev. Min Max -------------+------------------------------------------------------ emo_l_02_0~4 | 9467 1.564276 .806012 0 4 emo_l_06_0~4 | 9445 1.7081 .8448107 0 4 emo_l_11_0~4 | 9448 1.274238 .8241389 0 4 emo_l_15_0~4 | 9431 1.613933 .8029195 0 4 emo_l_19_0~4 | 9342 1.594198 1.008401 0 4

codebook emo_*_01234 ----------------------------------------------------------------------------------------------- emo_l_02_01234 Child cries easily [emo_l_02_01234] ----------------------------------------------------------------------------------------------- type: numeric (float) label: emo_l_02_01234_lab range: [0,4] units: 1 unique values: 5 missing .: 5196/14663 tabulation: Freq. Numeric Label 761 0 E-Like 3620 1 Q-like 4202 2 S-like 751 3 NM-Like 133 4 NAA-Like 5196 . ----------------------------------------------------------------------------------------------- emo_l_06_01234 Child tends to be somewhat emotional [emo_l_06_01234] ----------------------------------------------------------------------------------------------- type: numeric (float) label: emo_l_06_01234_lab range: [0,4] units: 1 unique values: 5 missing .: 5218/14663 tabulation: Freq. Numeric Label 632 0 E-Like 3018 1 Q-like 4507 2 S-like 1051 3 NM-Like 237 4 NAA-Like 5218 .

codebook emo_*_01234 ----------------------------------------------------------------------------------------------- emo_l_11_01234 Child often fusses and cries [emo_l_11_01234] ----------------------------------------------------------------------------------------------- type: numeric (float) label: emo_l_11_01234_lab range: [0,4] units: 1 unique values: 5 missing .: 5215/14663 tabulation: Freq. Numeric Label 1538 0 E-Like 4420 1 Q-like 2942 2 S-like 457 3 NM-Like 91 4 NAA-Like 5215 . ----------------------------------------------------------------------------------------------- emo_l_15_01234 Child gets upset easily [emo_l_15_01234] ----------------------------------------------------------------------------------------------- type: numeric (float) label: emo_l_15_01234_lab range: [0,4] units: 1 unique values: 5 missing .: 5232/14663 tabulation: Freq. Numeric Label 559 0 E-Like 3689 1 Q-like 4214 2 S-like 772 3 NM-Like 197 4 NAA-Like 5232 .

codebook emo_*_01234 ----------------------------------------------------------------------------------------------- emo_l_19_01234 Child reacts intensely when upset [emo_l_19_01234] ----------------------------------------------------------------------------------------------- type: numeric (float) label: emo_l_19_01234_lab range: [0,4] units: 1 unique values: 5 missing .: 5321/14663 tabulation: Freq. Numeric Label 1329 0 E-Like 3038 1 Q-like 3459 2 S-like 1127 3 NM-Like 389 4 NAA-Like 5321 .

Multihist • pause on • foreach x in "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" { • multihist *`x'_012 • pause • } • pause off • Compare response to same questions at different times • Big differences would suggest an error in previous code • - reversal of responses • - change to order of questions asked • - change to response options (aargh!)

Multihist for first item of MFQ (6 repeat measures)

Polychoric Correlations

Correlation -v- regression coefficient Correlation coefficient: The interdependence between pairs of variables i.e. the extent to which values of the variable change together The strength and direction of the linear relationship A fatter ellipse will result in a greater degree of scatter for a regression line of a given gradient, and a lower correlation

Polychoric Correlation - Assumptions • A binary or categorical variable is the observed (or manifest) part of an underlying (or latent) continuous variable • Here we’ll also assume that latent variables are normally distributed • THRESHOLD relates the manifest to the latent variable • Uebersax link: http://ourworld.compuserve.com/homepages/jsuebersax/tetra.htm

Thresholds Figure from Uebersax webpage

… this is what we assume is going on Figure from Uebersax webpage

What we are really interested in is the correlation (r) between the continuous latent variables Computer algorithm used to search for a correlation r and thresholds t1 and t2 which best reproduce the cell counts of the 2x2 table

Poly / tetra • Tetrachoric • Special case where both variables are binary • Polychoric • More general (any categorical variable) • Bi/Polyserial • One continuous and one categorical variable

Poly versus standard correlations foreach x in "emo_l_02" "emo_l_06" "emo_l_11" "emo_l_15" "emo_l_19" { gen `x'_00001 = `x'_01234 recode `x'_00001 (0=0)(1=0)(2=0)(3=0)(4=1) gen `x'_00011 = `x'_01234 recode `x'_00011 (0=0)(1=0)(2=0)(3=1)(4=1) gen `x'_00111 = `x'_01234 recode `x'_00111 (0=0)(1=0)(2=1)(3=1)(4=1) gen `x'_01111 = `x'_01234 recode `x'_01111 (0=0)(1=1)(2=1)(3=1)(4=1) gen `x'_01122 = `x'_01234 recode `x'_01122 (0=0)(1=1)(2=1)(3=2)(4=2) gen `x'_00123 = `x'_01234 recode `x'_00123 (0=0)(1=0)(2=1)(3=2)(4=3) }

log using "eas_dataprep_poly_corr.log", replace foreach x in "emo_*_00001" "emo_*_00011 " "emo_*_00111 " /// "emo_*_01111" "emo_*_01122 " "emo_*_00123 " "emo_*_01234" { corr `x' polychoric `x' } log close

Polychoric Correlation Matrix (01234) Standard Correlation Matrix (01234)

Poly versus standard correlations • Polychoric correlations always higher than Pearson correlations • Polychoric correlations more robust to changes in the number of categories • For polychoric in Stata, if # categories > 10, variable treated as if continuous, so the correlation of two variables that have 10 categories each would be simply the usual Pearson moment correlation found through correlate.

Polychoric PCA

Polychoric PCA • Performs PCA on the polychoric correlation matrix • Produces eigenvectors, eigenvalues, and the correlation matrix as with standard PCA

PCA v PolychoricPCA, mum MFQ

PCA v PolychoricPCA, EAS

Assumptions of PCA/FA • Items can be regarded as parallel (same frequency distribution) • PCA/FA not always appropriate when items differ in their frequency distribution such as when items have differing levels of difficulty • Alternative methods may be more appropriate…. find out tomorrow

Loevinger’s H Coefficient of Homogeneity

Item Response Function Increasing probability of endorsing item Increasing level of latent trait

Data-pipeline using ALSPAC data

Data-pipeline using ALSPAC data

Presentation Transcript

Using Data

Using Data

Handling Missing Data on ALSPAC

HMI Data Analysis Pipeline

Data Pipeline: Finance

Using Data

Managing the NextGen data pipeline

Data Pipeline

Neuroimaging Data Provenance Using the LONI Pipeline Workflow Environment

Data Pipeline Project

Pipeline with Data Forwarding

GCOD Data Pipeline

BIRN Vision for Data Pipeline

Data Pipeline to Data Use

KFPA Data Pipeline

ALSPAC

ALSPAC Data

Data Pipeline Regional Training 2013

Data Pipeline to Data Use

Data Pipeline Project

Pipeline SIG PODS Pipeline Open Data Standard

KFPA Data Pipeline