Accuracy, Reliability, and Validity of Freesurfer Measurements

Accuracy, Reliability, and Validity of Freesurfer Measurements David H. Salat salat@nmr.mgh.harvard.edu

Why Talk About This? • This is not meant to imply that everything is perfect in FreeSurfer processing; it is a sample of the types of procedures that we and others have used to provide information about what works and what doesn’t, and to enhance confidence in our results. • The information here should be used as a guide for how to assess the data in your own projects.

What is Accuracy? • Accuracy: the degree of closeness of a measured or calculated quantity to its actual (true) value (e.g. a physical property such as length or thickness) • MRI measures are indirect. We may be able to measure morphometry accurately given the contrast of the MR image, however, this contrast may differ from measurements from the actual tissue.

What is Reliability? • Measures obtained for the same individual on two different days, close together in time to avoid a biological influence on the reliability measure • Reliability of a labeling procedure in the same scan • Reliability of the labeling procedure on two different scans • Reliability of the labeling procedure on two different scans collected on two different scanners • The reliability of an overall effect can be assessed by replication of the experiment in an independent sample. • This is a general theory, that applies to all types of data, structural, functional, cognitive, etc.

What is Validity? • Validity: the extent to which an indirect measurement is representative of what it is supposed to measure. • For example, in fMRI we use blood flow as an indirect measure of neural activity. Is this a valid measure of neural activity?

Validity Examples • Internal validity: What is the strength of the overall experimental design, study sample size, analysis procedures, etc.? • External validity: Would the effect measured generalize to another sample? (replication) • Ecological validity: Can the results be applied in the real world outside of the experimental setting? (clinical application) • Construct validity: Does the totality of evidence support the validity of a single measure? (do the data fit with what is known?) • Face validity: Does the measure seem to be a good measure? • Convergent validity: How well does the measure correlate with other types of measures that it should theoretically be correlated with? (do the data correlate with ‘gold standards’) • Discriminant validity: Is the measure not correlated with measures it should not be correlated with? (ICV/age)

One does not necessarily ensure the other • A measure that is perfectly reliable (e.g. you get the same exact measure every time), but not accurate, or valid. • We can measure morphometry very precisely, but the validity of this measure depends on the quality of the input data. • If an experiment is not reliable, then it is likely inaccurate and invalid.

Types of Error • Random Error: Unknown and unpredictable changes in the measurement • Should be unbiased • Accuracy, reliability, and validity all limited by error • Systematic error: Predictable offset or scaling of data • Typically comes from some aspect of the data acquisition/analysis • Can be identified and corrected by analyzing standards that closely match the real sample (e.g. do you get the same values at 1.5T as at 3T?)

How does poor reliability and validity affect your studies? • Poor reliability increases variance across individuals and across timepoints. • Validity is directly tied to interpretation. You may have a valid measure of ‘cortical thickness’, but ‘cortical thickness’ might not be a valid measure of degeneration • E.g. normal variation, hydration • Many studies would benefit from the ability to measure minute changes across time.

Accuracy and Validity of Spherical Averaging for Labeling Structural and Functional Anatomy Fischl et al., 1999

Anatomical Labeling Fischl et al., 1999

Functional Labeling Fischl et al., 1999

Enhanced Statistical Power Fischl et al., 1999

Face Validity: Results fall within Expected Range • Consistent with published findings: • crowns of gyri are thicker than the fundi of sulci • sensory areas are among the thinnest in the cortex. Fischl et al., 1999

Validate against manual measurements of imaging data from another study Fischl et al., 1999

Automated measures are similar in size and region to manual measures, and predict who will develop AD Fischl et al., 2002

Comparison with Postmortem Measures Rosas et al., 2002

Manual Measurements • Can only be done in regions where folds are appropriate • Calcarine also consistent across studies Salat et al., 2004 Calcarine Orbitofrontal Kuperberg et al., 2003

Compared to ManuallyLabeled Data • 1 volume and 2 surface based labeling schemes • Percent of subjects labeled correctly at each location across the surface. Volume Atlas Surface Atlas Surface Atlas 2 Fischl et al., 2004 Desikan et al., 2006

Replication of Result:Split Sample • Concordant results are likely not due to statistical error • Current study with 5 samples used in prior literature Salat et al., 2004

Cross Sequence Parameters Fischl et al., 2004

Comparison across time, scanner, field strength, number of scans, sequence type, scanner upgrade, and scanner manufacturer Han et al., 2006

Effects of Pulse Sequence, Voxel Geometry, and Parallel Imaging Wonderlick et al., 2008

Replication of Effects in Same Participants Across Scanning Conditions Dickerson et al., 2008

WMPARC: same subjects scanned at different times (test-retest) Salat et al., 2008

Replicable results across sex and hemisphere Men Women Salat et al., 2008

Consistent Findings Across 5 samples Used To Identify Regions with Predictive Validity • Regional measures predict who wll progress to AD. Dickerson et al., 2008

Conclusions • Any tool used for MR analysis should be rigorously tested for accuracy, reliability, and validity • Most of the measures from Freesurfer have good accuracy, reliability, and validity across a range of conditions • These results are dependent on optimal input data and correct implementation • These data provide confidence, but do not substitute for using similar procedures to check data from each new study

Accuracy, Reliability, and Validity of Freesurfer Measurements