High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational Res

High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational Research Jyotishman Pathak, PhD Assistant Professor of Biomedical Informatics Health Sciences Research Grand Rounds April 23, 2012

Background – The Problem • Patient recruitment is a huge bottleneck step in conducting successful clinical research studies • 50% of time is spent in recruitment • Low participant rates (~ 5%); studies are underpowered • Clinicians: lack resources to help patients find appropriate studies and trials • Patients: face difficultly to find appropriate studies that are locally available

Background – Use Cases • Large-scale genomics research • Linking biospecimens and genetic data to personal health data via biorepositories • Need large sample sizes for study design • Population-based epidemiological studies in understanding disease etiology • Often limited in scope or population diversity • Quality metrics and HITECH Act • Pay-for-Performance and quality-based incentives • Population management and cohort identification is non-trivial

Electronic health records (EHRs) driven phenotyping – The Proposed Solution • EHRs are becoming more and more prevalent within the U.S. healthcare system • Meaningful Use is one of the major drivers • Overarching goal • To develop techniques and algorithms that operate on normalized EHR data to identify cohorts of potentially eligible subjects on the basis of disease, symptoms, or related findings

Advantages: EHR-derived phenotyping • There is a LOT of information about subjects • Demographics, labs, meds, procedures, clinical notes… • Identification of otherwise latent population differences • Minimal costs for case ascertainment, no study-specific recruitment • Records are “retrospectively longitudinal” • Records are real world and contain many different phenotypes • Transportability and reuse of phenotype definitions across EHR enabled sites = power for clinical and research studies

Challenges: EHR-derived phenotyping • There is a LOT of information about subjects… • Non-standardized, heterogeneous, unstructured data (compared to protocol-based structured data collection) • Measured (e.g., demographics) vs. un-measured (e.g., socio-economic status) population differences • Hospital specialization and coding practices • Population/regional market landscape

The challenges can be addressed…if we • Develop techniques for standardization and normalization of clinical data and phenotypes • Develop techniques for transforming and managing unstructured clinical text into structured representations • Develop techniques for transportability of EHR-driven phenotyping algorithms • Develop a scalable, robust and flexible framework for demonstrating all of the above in a “real-world setting”

http://gwas.org

Funded by the NHGRI/NIGMS • Goal: to assess utility of EHRs as resources for genome science • Each site includes a biorepositorylinked to EHRs • Each project includes informatics, biostatistics, community engagement, ELSI, genetics experts • Initial proposals included identifying a primary phenotype of interest in 3,000 subjects and conduct of a genome-wide association study at each center: Σ=18,000 • eMERGE Phase II has a target of developing ~40 phenotype algorithms by the end of 2014 • Algorithm transportability an integral component

EHR-based Phenotyping Algorithms • Typical components • Billing and diagnoses codes • Procedure codes • Labs • Medications • Phenotype-specific co-variates (e.g., Demographics, Vitals, Smoking Status, CASI scores) • Pathology • Imaging? • Organized into inclusion and exclusion criteria

EHR-based Phenotyping Algorithms • Iteratively refine case definitions through partial manual review to achieve ~PPV ≥ 95% • For controls, exclude all potentially overlapping syndromes and possible matches; iteratively refine such that ~NPV ≥ 98%

Algorithm Development Process Rules Evaluation Phenotype Algorithm Visualization Data Transform Transform Mappings NLP, SQL [eMERGE Network]

Hypothyroidism: Initial Algorithm No thyroid-altering medications (e.g., Phenytoin, Lithium) 2+ non-acute visits in 3 yrs ICD-9s forHypothyroidism AbnormalTSH/FT4 Antibodies forTTG or TPO(anti-thyroglobulin,anti-thyroperidase) No ICD-9s forHypothyroidism NoAbnormalTSH/FT4 Thyroid replace. meds No thyroid replace. meds NoAntiboides for TTG/TPO No secondary causes (e.g., pregnancy, ablation) No hx of myasthenia gravis Case 1 Case 2 Control [Denny et al., 2012]

Hypothyroidism: Initial Algorithm [Conway et al. 2011]

Hypothyroidism: Algorithm Refinement No thyroid-altering medications (e.g., Phenytoin, Lithium) 2+ non-acute visits in 3 yrs ICD-9s forHypothyroidism AbnormalTSH/FT4 Antiboides forTTG or TPO(anti-thyroglobulin,anti-thyroperidase) No ICD-9s forHypothyroidism NoAbnormalTSH/FT4 Thyroid replace. meds No thyroid replace. meds NoAntiboides for TTG/TPO No secondary causes (e.g., pregnancy, ablation) No hx of myasthenia gravis Case 1 Case 2 Control [Denny et al., 2012]

New Hypothyroidism Algorithm: Validation [Denny et al., 2012]

[eMERGE Network]

Genotype-Phenotype Association Results published observed gene / disease marker region rs2200733 Chr. 4q25 Atrial fibrillation rs10033464 Chr. 4q25 rs11805303 IL23R rs17234657 Chr. 5 Crohn's disease rs1000113 Chr. 5 rs17221417 NOD2 rs2542151 PTPN22 rs3135388 DRB1*1501 Multiple sclerosis rs2104286 IL2RA rs6897932 IL7RA rs6457617 Chr. 6 Rheumatoid arthritis rs6679677 RSBN1 rs2476601 PTPN22 rs4506565 TCF7L2 rs12255372 TCF7L2 rs12243326 TCF7L2 rs10811661 CDKN2B Type 2 diabetes rs8050136 FTO rs5219 KCNJ11 rs5215 KCNJ11 rs4402960 IGF2BP2 0.5 1.0 5.0 2.0 Odds Ratio [Ritchie et al.2010]

Key lessons learned from eMERGE • Algorithm design and transportability • Non-trivial; requires significant expert involvement • Highly iterative process • Time-consuming manual chart reviews • Representation of “phenotype logic” for transportability is critical • Standardized data access and representation • Importanceof unified vocabularies, data elements, and value sets • Questionable reliability of ICD & CPT codes (e.g., billing the wrong code since it is easier to find) • Natural Language Processing (NLP) needs

Algorithm Development Process - Modified Rules Semi-Automatic Execution Evaluation Phenotype Algorithm Visualization Data Transform Transform Mappings NLP, SQL [eMERGE Network]

Strategic Health IT Advance Research Projects (SHARPn): Secondary Uses of EHR Data • Mission:To enable the use of EHR data for secondary purposes, such as clinical research and public health. Leveraging clinical and health informatics to: • generate new knowledge • improve care • address population needs http://sharpn.org [Chute et al. 2011]

SHARPn: Secondary Use of EHR DataA $15M National Consortium • Agilex Technologies • CDISC (Clinical Data Interchange Standards Consortium) • Centerphase Solutions • Deloitte • Group Health, Seattle • IBM Watson Research Labs • University of Utah • University of Pittsburgh • Harvard University • Intermountain Healthcare • Mayo Clinic • Mirth Corporation, Inc. • MIT • MITRE Corp. • Regenstrief Institute, Inc. • SUNY, Buffalo • University of Colorado

Cross-integrated suite of projects

Algorithm Development Process - Modified • Standardized and structured representation of phenotype definition criteria • Use the NQF Quality Data Model (QDM) Rules • Conversion of structured phenotype criteria into executable queries • Use JBoss® Drools (DRLs) Semi-Automatic Execution Evaluation Phenotype Algorithm Visualization • Standardized representation of clinical data • Create new and re-use existing clinical element models (CEMs) Data Transform Transform [Welch et al. 2012] [Thompson et al., submitted 2012] [Li et al., submitted 2012] Mappings NLP, SQL

The SHARPn “phenotyping funnel” Intermountain EHR Mayo Clinic EHR [Welch et al. 2012] [Thompson et al., submitted 2012] [Li et al., submitted 2012]

Clinical data normalization • Data Normalization • Clinical data comes in all different forms even for the same kind of information • Comparable and consistent data is foundational to secondary use • Clinical Element Models (CEMs) • Basis for retaining computable meaning when data is exchanged between heterogeneous computer systems • Basis for shared computable meaning when clinical data is referenced in decision support logic

Clinical Element ModelsHigher-Order Structured Representations [Stan Huff, IHC]

Pre- and Post-Coordination [Stan Huff, IHC]

[Stan Huff, IHC]

Data element harmonization • Stan Huff (Intermountain Healthcare) • Clinical Information Model Initiative (CIMI) • NHS Clinical Statement • CEN TC251/OpenEHR Archetypes • HL7 Templates • ISO TC215 Detailed Clinical Models • CDISC Common Clinical Elements • Intermountain/GE CEMs

SHARPn data normalization flow - I

SHARPn data normalization flow - II CEM MySQL database with normalized patient information [Welch et al. 2012]

Algorithm Development Process - Modified • Standardized and structured representation of phenotype definition criteria • Use the NQF Quality Data Model (QDM) Rules Semi-Automatic Execution Evaluation Phenotype Algorithm Visualization • Standardized representation of clinical data • Create new and re-use existing clinical element models (CEMs) Data Transform Transform [Welch et al. 2012] [Thompson et al., submitted 2012] [Li et al., submitted 2012] Mappings NLP, SQL

NQF Quality Data Model (QDM) - I • Standard of the National Quality Forum (NQF) • A standard structure and grammar to represent quality measures precisely and accurately in a standardized format that can be used across electronic patient care systems • First (and only) standard for “eMeasures” • “All patients 65 years of age or older with at least two provider visits during the measurement period receiving influenza vaccinesubcutaneously” • Implemented as set of XML schemas • Links to standard terminologies (ICD-9, ICD-10, SNOMED-CT, CPT-4, LOINC, RxNorm etc.)

NQF Quality Data Model (QDM) - II • Supports temporality & sequences • AND: "Procedure, Performed: eye exam" > 1 year(s) starts before or during "Measurement end date" • Groups of codes in a code set (ICD9, etc.) • Can group groups • Represented by OIDs, requires lookup • "Diagnosis, Active: steroid induced diabetes" using "steroid induced diabetes Value Set GROUPING (2.16.840.1.113883.3.464.0001.113)” • Focus on structured data • Would require extensions for NLP

116 Meaningful Use Phase I Quality Measures

Example: Diabetes & Lipid Mgmt. - I

Example: Diabetes & Lipid Mgmt. - II

NQF Measure Authoring Tool (MAT)

Our task: human readable  machine computable [Thompson et al., submitted 2012]

Algorithm Development Process - Modified • Standardized and structured representation of phenotype definition criteria • Use the NQF Quality Data Model (QDM) Rules • Conversion of structured phenotype criteria into executable queries • Use JBoss® Drools (DRLs) Semi-Automatic Execution Evaluation Phenotype Algorithm Visualization • Standardized representation of clinical data • Create new and re-use existing clinical element models (CEMs) Data Transform Transform [Welch et al. 2012] [Thompson et al., submitted 2012] [Li et al., submitted 2012] Mappings NLP, SQL

JBoss® open-source Drools environment • Represents knowledge with declarative production rules • Origins in artificial intelligence expert systems • Simple when <pattern> then <action>rules specified in text files • Separation of data and logic into separate components • Forward chaining inference model (Rete algorithm) • Domain specific languages (DSL)

Drools inference architecture Inference Execution Model • Define a Knowledge Base • Compiled Rules • Produces Production Memory • Extract Knowledge Session from Knowledge Base • Insert Facts (data) into Knowledge Session  “Agenda” • Fire Rules (Race Conditions/Infinite Loops) • Retrieve End Results

Example Drools rule {Rule Name} rule"Glucose <= 40, Insulin On“ when $msg : GlucoseMsg(glucoseFinding <= 40, currentInsulinDrip> 0 ) then glucoseProtocolResult.setInstruction(GlucoseInstructions.GLUCOSE _LESS_THAN_40_INSULIN_ON_MSG); end {Class Getter Method} {Java Class} {binding} {Class Setter Method} Parameter {Java Class}

The “obvious” slide - T2DM Drools flow

Automatic translation from NQF QDM criteria to Drools Measure Authoring Toolkit Drools Engine From non-executable to executable Measures XML-based Structured representation Drools scripts Converting measures to Drools scripts Data Types XML-based structured representation Fact Models Mapping data types and value sets Value Sets saved in XLS files [Li et al., submitted 2012]

SHARPn phenotyping architecture using CEMs, QDMs, and DRLs [Welch et al. 2012]

The SHARPn “phenotyping funnel” Intermountain EHR Mayo Clinic EHR [Welch et al. 2012] [Thompson et al., submitted 2012] [Li et al., submitted 2012]

Phenotype library and workbench - I http://phenotypeportal.org

Phenotype library and workbench - I http://phenotypeportal.org Converts QDM to Drools Rule execution by querying the CEM database Generate summary reports

High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational Res

High-Throughput Phenotyping and Cohort Identification from Electronic Health Records for Clinical and Translational Res

Presentation Transcript

electronic health records facilitating clinical research

Electronic Health Records for Clinical Research

Electronic Health Records

Electronic health records and nursing

Electronic Health Records

Initial Prototype for Clinical Data Normalization and High Throughput Phenotyping

Electronic Health Records for Clinical Research

Electronic Health Records

Electronic Health Records

Phenotyping from Electronic Health Records

Electronic Health Records

Electronic Health Records

Electronic Health Records

Care Coordination and Electronic Health Records

High-Throughput Field Phenotyping of Plants

Electronic Health Records

Electronic Health Records for Clinical Research

Electronic Health Records, and Electronic Prescribing and Medicines Management

Electronic Health Records

Electronic Health Records

Electronic Health Records

High-Throughput Machine Learning from Electronic Health Records