Data Quality (a.k.a. “ Data Heterogeneity ” )

Data Quality(a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton

Objectives • Assess Data variability within and across institutions • Assess impact of this variability on Secondary Use of EMR • Generate specifications for Widgets • “Warning Label” for suspect data categories • Data quality audits with logs • Batch data correction / removal

Current Research: Effects of Variation on Diabetes Phenotyping Algorithm • Purpose: Compare data relevant to Type 2 DM eMERGE phenotyping algorithm between Intermountain and Mayo • Methods: 1. Identify adult subjects with evidence in any semantic category of algorithm: • ICD-9-CM codes for Diabetes Mellitus • Abnormal glucose or HbA1C • Antihyperglycemic medications • Capillary glucose (Glucometer) procedures

Methods • Collect relevant data on these subjects • ICD-9-CM codes • Procedure codes • Demographic data • Smoking status • Body Mass index • Specialty of provider • Geographic info • Frequency of health care encounters • Describe variation between institutions

Analysis • Compare (between institutions) frequencies of data elements • ICD9 codes– overall and specific codes • Compare lab values– number and values • Compare medications– • Control for: • Provider specialty • Geographic variables • Demographic variables

Interpretation • Assess impact of data heterogeneity on phenotyping at different institutions • Recommendations for • High throughput Phenotyping • High throughput screening for clinical trials • Generalization to other phenotypes • Hypothesis generation

Preliminary Mayo Results • Mayo Data: (ICD or abn.labs or capill. Glucose, limited to Olmsted and surrounding counties) • 13,754 subjects • 89% Caucasian, • 2.5% African-American, • 2.0% Asian • 6.5% Native Am, Pac. Isl., other, unknown, refuse • Mean current age 64, range 20 to 104 • Sex: 53% male, 47% female

Preliminary Mayo resultsN=13,754 • Smoking (n=11,626) • Current 66%, past 16%, never 13%, Unk 6% • BMI (limited to < 60) (n=6,338) • Mean 32.6 +/- 7.2 • Median 31.6, quartiles (27.5, 36.6)

Preliminary Results: ICD9 codes • Complications • None 6743 (250.0) • Ketoacidosis 1 (250.1) • Hyperosmolality 2 (250.2) • Renal 398 (250.4) • Opthalmic 1385 (250.5) • Neuro 586 (250.6) • Peripheral Circ. 25 (250.7) • “other specified” 312 (250.8) • Unspecified 336 (250.9)

Preliminary Results: ICD9 codes • 250.X0 Type 2 or unspecified, controlled or not • specified as uncontrolled • 250.X1 Type 1, controlled or not • Specified as uncontrolled • 250.X2 Type 2 or unspecified, uncontrolled • 250.X3 Type 1, uncontrolled

Type 2/U vs. Type 1 DM codesMayo Data: n=13707

Disclaimer– don’t assume data are ready to compare between sites at this point Intermountain peek (sic)

Back to Mayo SummarySample Lab data

Future Directions • Carry out inter-institution comparison • Study effects of geography, race, etc. • Implement chart review (on random sample) for “gold standard” definition of Type 2 DM • Use of lab values /meds for definition of continuous phenotype (DM-ness) • Extrapolation / generalization to other diseases /phenotypes

Data Quality(a.k.a. “Data Heterogeneity”) Susan Rea Welch

Conclusions: PhD ResearchCohort Amplification • Knowledge Discovery from Databases (KDD) • Associative Classification Methods • Classification Rules for Diabetes and Asthma • comparably accurate • Concise • consistent with domain knowledge • Contributed new knowledge • Attributes for cohort identification • Unanticipated comorbidity associations

Consistency and NoveltyDiabetes • Elevated quantitative lab glucose assays • Frequency 19%, Likelihood 87% • LesspredictivethanglucosebyglucometerorUrineMicroalbumin • Abnormal HbA1c test • Equivalent predictive power of HBA1c test order • Antihyperglycemic medications • Variable predictive strength: Metformin, Insulin, Insulin Release Stimulators, Insulin Response Enhancers

Consistency and NoveltyAsthma • Medications were most predictive • High Likelihood: Salmeterol, Leukotriene receptor antagonist • Albuterol / Glucocorticoid combine: • Pulmonary Procedures (CPT hierarchy) • Female gender • Abnormal CBC • Unexpected comorbidity associations • Suggests discovery of shared pathways

Associative Classification – What? • Pattern discovery in transaction database • Independent of domain expertise • Deductive, global associations in data • Induce a general & accurate classifier

Associative Classification – Why? • No domain expertise attribute selection • Not affected by missing data • Proven accuracy • Understandable rules • Independent rules

Core Candidate Attributes One Dimensional • Diagnosis codes • Provider specialty • Lab observations • Procedure codes • ‘Abnormal’ lab obs. • Imaging procedures • Medication list • Age groups • Female gender

SHARPn Y2 Research Aims • Associations reliable across EHRs? • Improve algorithms’ sensitivity / specificity? • AC attribute selection + other classifiers Two Dimen- sional Data Three Dimen- sional Data

Data Quality (a.k.a. “ Data Heterogeneity ” )